Risk-Aware Budget-Constrained Auto-Bidding under First-Price RTB: A Distributional Constrained Deep Reinforcement Learning Framework

Hanqi Zhang

doi:10.69987/JACS.2024.40603

Authors

Hanqi Zhang Computer Science, University of Michigan at Ann Arbor, MI, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40603

Keywords:

Real-time bidding, first-price auction, auto-bidding, budget constraint, pacing, risk-aware reinforcement learning, CVaR, distributional RL

Abstract

Real-time bidding (RTB) has become the dominant mechanism for programmatic display advertising, and the industry has migrated from second-price to first-price auctions. First-price auctions simplify settlement but fundamentally change the cost dynamics: the bidder pays its own bid, which amplifies overbidding losses, increases spend volatility, and makes smooth budget delivery (pacing) more difficult. Most academic auto-bidding literature assumes second-price payment, and many budget-constrained reinforcement learning (RL) methods optimize only expected performance, without explicit downside-risk control. This paper proposes RA-BCB, a risk-aware budget-constrained auto-bidding framework for first-price RTB. RA-BCB combines (i) a value model (pCTR/pCVR) trained from logged impression–click–conversion chains, (ii) an offline replay auction simulator that re-prices wins using first-price payment, and (iii) a distributional constrained RL agent that optimizes a Conditional-Value-at-Risk (CVaR) objective under a daily budget constraint. The agent acts at a time-aggregated granularity (24 hourly slots), selecting a continuous bid multiplier that scales a base bid derived from predicted value. A dual (Lagrangian) update enforces the expected budget constraint, and an explicit pacing deviation penalty reduces intra-day spend variance. Offline replay experiments on the iPinYou RTB benchmark (nine campaigns with bid, impression, click, and conversion logs) and a semi-synthetic first-price evaluation built from the Criteo Display Ads (Kaggle 2014) click logs show that RA-BCB improves ROI and cost-efficiency while maintaining high budget utilization. Compared with linear bidding b_i = λ·ŷ_i with λ selected on training logs to fully utilize the budget, RA-BCB increases weighted value by 43.1% at 50% budget, reduces eCPC by 30.6%, and improves the 10%-tail ROI (CVaR0.1) by 2.42×, while producing near-linear pacing curves.