Imbalance-Aware SSD Failure Prediction with Attention-Gated SMART Modeling and LLM-Guided Feature Semantics
DOI:
https://doi.org/10.69987/JACS.2026.60601Keywords:
SSD failure prediction, SMART telemetry, extreme class imbalance, XGBoost, LightGBM, BiGRU, multi-head attention, focal loss, probability calibration, feature semanticsAbstract
Solid-state drive (SSD) failure prediction is a rare-event reliability problem in which missed failures can lead to service disruption, whereas excessive alarms consume replacement capacity and engineering time. This study evaluates imbalance-aware SSD failure prediction on a deterministic 30,000-drive benchmark constructed to follow the public Alibaba SSD data schema. The benchmark contains 105 SMART snapshot fields, 39 failure-tag fields, 11 anonymized drive models, and 280 failures, corresponding to a 0.933% positive rate. Balanced logistic regression, random forest, XGBoost with scale-sensitive weighting, LightGBM with class weighting, and a bidirectional gated recurrent unit with multi-head attention (BiGRU-MHA) trained by focal loss are compared under a common stratified protocol. A fixed language-model-guided semantics layer groups SMART counters into wear, program/erase, media-error, interface, thermal, reallocation, and power-cycle concepts. Balanced logistic regression provides the highest PR-AUC (0.603), LightGBM provides the strongest thresholded recall (0.667) and the highest non-linear-model F1 score (0.596), and BiGRU-MHA delivers the best precision at the strictest 0.5% alert budget while remaining less consistent over broader budgets. The results indicate that imbalance treatment and semantically coherent aggregation can be more valuable than architectural complexity when the available telemetry is an endpoint snapshot rather than a complete daily history.







