Cross-Cloud Transfer Learning for AI Training Capacity Forecasting under Workload and Topology Distribution Shift

Shilu He; Xiaohan Chang; Eddin Sun

doi:10.69987/JACS.2024.40108

Authors

Shilu He Mathematics, UW-Madison, WI, USA Author
Xiaohan Chang Computer Science, University of Connecticut, CT, USA Author
Eddin Sun Data Science, Columbia University, NY, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40108

Keywords:

AI training traces, GPU capacity forecasting, transfer learning, domain adaptation, distribution shift, cloud resource management, topology-aware scheduling, spot GPU workloads, inference serving

Abstract

AI cloud operators plan GPU capacity with traces that age quickly as training, inference, and opportunistic spot workloads change their resource mix and topology constraints. This paper presents a reproducible cross-domain study of six-hour GPU capacity forecasting using Alibaba Lingjun 2023 training jobs as the source domain and two newer Alibaba GPU traces as target domains: a 2025 GPU-disaggregated DLRM inference trace and a 2026 spot-GPU job trace. The experiment uses the released CSV files and reports only measured values. It parses Lingjun job, worker, and Clos topology records; normalizes inference instances and spot-GPU jobs into the same interval representation; aggregates one-hour windows; and forecasts active GPU demand six hours ahead. We compare source-only transfer, target-only few-shot learning, pooled few-shot transfer, residual adaptation, importance weighting, CORAL feature alignment, and domain-adversarial neural training. The measured drift is large: a logistic domain classifier separates source from each target with AUC 1.00, and the largest standardized feature shifts come from RDMA/topology proxies for the 2025 inference target and high-priority/spot mix plus GPU-model capacity proxies for the 2026 target. Source-only transfer exhibits negative transfer on 99.5% of 2025 inference test windows and 63.2% of 2026 spot-GPU windows against the target few-shot reference. Pooled few-shot transfer is the best 2026 method, reducing MAE from 811.36 GPUs for source-only ExtraTrees to 491.44 GPUs. For the long-running 2025 inference target, six-hour persistence is best with MAE 22.85 GPUs, while pooled transfer remains close at 40.12 GPUs. The results show that transfer learning is useful only when target support windows anchor the workload and topology shift.