GPU Memory Usage Prediction for Generative AI Serving Pipelines with Queue, Latency, and Utilization Signals

Lee Ji-su

doi:10.69987/JACS.2026.60701

Authors

Lee Ji-su Computer Science, Postech, Pohang, GB, South Korea Author

DOI:

https://doi.org/10.69987/JACS.2026.60701

Keywords:

generative AI serving, GPU memory prediction, diffusion pipelines, queueing signals, latency-aware forecasting, GPU utilization, GenTD26, serverless inference

Abstract

Generative AI serving pipelines combine prompt processing, model and adapter selection, iterative generation, queueing, and GPU-resident state, so their memory footprint depends on more than request volume alone. This study investigates next-window GPU memory prediction with cross-layer signals from Alibaba GenTD26. The public request trace is transformed into a deterministic pod-window benchmark containing 12,253 active 10-minute samples constructed from 26,823 request records and 143 reproducible pseudo-pods. The feature space combines request complexity, queue pressure, execution and pipeline latency, GPU duty-cycle proxy, pod-memory-utilization proxy, and recent resident-memory lags. Six predictors are evaluated with chronological train, validation, and test partitions. Stochastic-gradient regression provides the lowest test RMSE at 2.588 GiB, with MAE 1.903 GiB and R² 0.608; Ridge and ExtraTrees follow closely. A feature ablation reduces DecisionTree RMSE from 3.469 GiB with request-only variables to 2.889 GiB with the full cross-layer signal set, a 16.7% reduction. Error increases from the 10-minute to the 30-minute horizon, emphasizing the value of near-term forecasts for admission control, pod selection, and warm-state management. The results show that queue, latency, utilization, and recent memory state jointly provide a practical basis for lightweight GPU-memory forecasting in generative serving pipelines.