Risk-Bounded GPU Resource Oversubscription via Conformal Demand Envelopes in Production AI Clusters
DOI:
https://doi.org/10.69987/JACS.2024.40509Keywords:
GPU clusters, oversubscription, conformal prediction, demand envelopes, resource scheduling, Alibaba cluster trace, risk-bounded admission control, AI infrastructure, explainable policy generationAbstract
GPU inventory is expensive, bursty, and hard to expand quickly, so operators often want to sell or admit more work than a strict reservation policy allows. Unbounded oversubscription, however, turns short-term efficiency into SLO risk. This paper presents a risk-bounded oversubscription method based on conformal demand envelopes. We conducted a full empirical evaluation on the Alibaba cluster-trace-gpu-v2023 CSV data. The evaluation used 1,523 production nodes, 1,213 GPU nodes, 6,212 GPUs, 8,152 default pod records, and 3585 reconstructed hourly demand bins. Each pod request was converted into normalized GPU, CPU, and memory demand using only released fields: cpu_milli, memory_mib, num_gpu, gpu_milli, QoS, pod phase, creation time, scheduled time, and deletion time. A chronological train/calibration/test split of 2050/683/684 bins was used. At α = 0.05, the conformal envelope achieved a measured joint violation rate of 0.000000 and a false-safe rate of 0.000000, while increasing protected inventory utilization from 0.003983 under conservative reservation to 0.332877. Average-demand oversubscription and linear quantile regression had measured joint violation rates of 0.555556 and 0.510234, respectively. The results show that a conformal controller can tighten capacity envelopes without replacing cluster schedulers or relying on unverifiable utilization counters. The paper also includes a deterministic LLM-facing explanation layer that translates the calibrated policy state into operator-readable admission guidance without changing numerical decisions.







