Machine Learning-Based Network Performance Monitoring and Prediction for Distributed AI Training Workloads

Juan Li; Wenkun Ren

doi:10.69987/

Authors

Juan Li Shanghai Jiao Tong University Master of Science in Communication and Information Systems Author
Wenkun Ren Information Technology and Management, Illinois Institute of Technology, Chicago, IL Author

DOI:

https://doi.org/10.69987/

Keywords:

Network Performance Monitoring, Machine Learning Prediction, Distributed AI Training, Performance Optimization

Abstract

The exponential growth of distributed artificial intelligence training workloads has created unprecedented challenges in network performance management and optimization. Traditional monitoring approaches fail to adequately address the dynamic and complex communication patterns inherent in distributed AI systems. This paper presents a comprehensive machine learning-based framework for network performance monitoring and prediction specifically designed for distributed AI training environments. Our approach leverages advanced feature engineering techniques to capture multi-dimensional network performance metrics, including latency variations, bandwidth utilization patterns, and communication bottlenecks. We develop and evaluate multiple machine learning models, including gradient boosting machines, neural networks, and ensemble methods, to predict network performance degradation before it impacts training efficiency. Experimental evaluation on real-world distributed training scenarios demonstrates that our framework achieves 94.7% prediction accuracy for network latency spikes and reduces training time by 23.4% through proactive performance optimization. The proposed monitoring architecture provides real-time insights into network behavior patterns, enabling adaptive resource allocation and communication scheduling optimization. Our contributions include a novel feature extraction methodology, a comprehensive performance prediction model, and a scalable monitoring framework that can be deployed across heterogeneous distributed computing environments.