Risk-Sensitive Offline Reinforcement Learning for Stable ABR QoE Improvements on Real HSDPA and LTE Traces
DOI:
https://doi.org/10.69987/JACS.2023.30401Keywords:
Adaptive bitrate streaming, Quality of Experience, Offline reinforcement learning, Risk-sensitive RLAbstract
Adaptive bitrate (ABR) algorithms must select per-chunk video quality under substantial network uncertainty. While reinforcement learning (RL) improves average Quality-of-Experience (QoE), trace-driven evaluations often reveal heavy-tailed stall events and brittle behavior under high-variance cellular links. This paper presents a risk-sensitive offline-RL ABR design that optimizes the lower tail of the return distribution via Conditional Value-at-Risk (CVaR) computed from a distributional Q-function. We conduct a full empirical evaluation using two public real-trace datasets: (i) 12 3G/HSDPA throughput logs from Norwegian mobile streaming sessions (UMass MMSys trace archive), and (ii) 20 4G/LTE bandwidth logs collected along routes in Ghent, Belgium (UGent/IDLab dataset). Using a Pensieve-style chunked streaming simulator and a standard QoE function (bitrate reward, rebuffer penalty, and smoothness penalty), we compare a buffer-based rule (BBA), robust model predictive control (RobustMPC), online tabular actor–critic (A2C), and an offline distributional RL method (Quantile Regression Conservative Q-Learning, QR-CQL) with a CVaR decision rule. Across 400 fixed test episodes on held-out traces, the risk-sensitive policy OfflineQR-CQL(CVaR@0.25) achieves mean QoE 104.91 (within 17.6% of the best policy, RobustMPC). Relative to online A2C, it improves mean QoE by -8.3% and reduces mean rebuffer time by -224.2%. Relative to RobustMPC, it improves mean QoE by -17.6% and reduces mean rebuffer time by -79.6%. Bucketed analysis by trace coefficient-of-variation shows the largest QoE gain in the highest-variability quartile (Q4), where OfflineQR-CQL(CVaR@0.25) exceeds RobustMPC by -27.59 QoE points. A CVaR sensitivity sweep confirms a controllable risk–reward trade-off governed by α.







