QoE-Driven Reinforcement Learning for Joint Bitrate, Rebuffering, and TTFF Optimization in HLS/DASH
DOI:
https://doi.org/10.69987/JACS.2023.30204Keywords:
adaptive bitrate streaming, QoE, reinforcement learning, HLS, MPEG-DASHAbstract
HTTP adaptive streaming over HLS/DASH must balance delivered visual quality against playback interruptions, bitrate variation, and startup delay. In many deployed players, time-to-first-frame (TTFF) is still handled through startup heuristics rather than being optimized jointly with steady-state adaptive bitrate (ABR) decisions. This paper studies a trace-driven controller family that combines a PPO+GAE actor-critic policy with two deployment-oriented constraints: a safety supervisor that caps bitrate by an online throughput estimate and an optional startup cap that operates only before playback begins. We evaluate the controller family on 40 mobile HSDPA throughput traces from MMSys’13 using a simulator with 2 s segments, a 6-level bitrate ladder, and a unified QoE metric that rewards bitrate and penalizes rebuffering, bitrate changes, and TTFF. In the four-way controller comparison on the held-out 8-trace test split, the 750 kbps startup-cap operating point (SafeRL-TTFF-750) achieves the highest mean QoE (136.125 ± 58.994), improves mean TTFF by 16.6% relative to the throughput-based RB baseline, and keeps mean rebuffering at 0.228 ± 0.556 s. On the full 40-trace set, SafeRL-TTFF-750 and RB are effectively tied in mean QoE, with the former trading slightly higher bitrate and lower TTFF for higher rebuffering. An ablation study shows that the safety supervisor is essential, and that stricter startup caps can reduce TTFF further with only small changes in scalar QoE. The results support a practical conclusion: learned ABR can be useful on mobile traces when RL decisions are wrapped in transparent safety and startup controls.







