Topology-Aware SSD Health Risk Prediction with SMART Signals, Location Features, and LLM-Based Failure Explanation

Ethan Ng

doi:10.69987/JACS.2026.60604

Authors

Ethan Ng Computer Engineering, Monash University, Melbourne, VIC, Australia Author

DOI:

https://doi.org/10.69987/JACS.2026.60604

Keywords:

SSD failure prediction, SMART attributes, topology-aware machine learning, rack-level risk, imbalanced classification, interpretable AI, large language model explanations

Abstract

Solid-state drive (SSD) failure prediction is commonly treated as a device-level classification problem based on Self-Monitoring, Analysis and Reporting Technology (SMART) counters. Production fleets, however, also exhibit model, application, and placement heterogeneity, and failures can cluster within nodes and racks. This study presents TopoHealth-ET, a topology-aware risk model that combines SMART indicators, deployment metadata, train-only topology priors, and an evidence-constrained language explanation layer. The empirical evaluation uses a deterministic schema-aligned benchmark that follows the documented Alibaba SSD table structures and contains 30,000 disks, 11 drive models, 200 racks, and a 1.89% failure rate. On the held-out test split, TopoHealth-ET achieved a ROC-AUC of 0.775, PR-AUC of 0.203, precision of 0.345, recall of 0.175, and F1 of 0.233. At a 1% maintenance review budget, the ranked worklist captured 18.42% of failures with 35.00% precision, compared with a fleet failure rate of 1.90%. The results indicate that topology context can materially improve rare-event maintenance triage, while structured evidence objects allow failure explanations to remain concise, auditable, and aligned with the classifier evidence.