Counterfactual Learning-to-Rank for Ads: Off-Policy Evaluation on the Open Bandit Dataset
DOI:
https://doi.org/10.69987/JACS.2025.51201Keywords:
counterfactual learning-to-rank, off-policy evaluation, inverse propensity scoring, self-normalized IPS, doubly robust, slate recommendation, bandit feedback, advertising rankingAbstract
Reliable offline evaluation is a central bottleneck in ad recommendation and ranking systems: online A/B experiments are expensive, slow, and risky, while naive offline replay is biased when logs are collected by a non-random policy. Counterfactual learning-to-rank (LTR) and off-policy evaluation (OPE) address this bottleneck by leveraging logged bandit feedback with known propensities. This paper presents a reproducible experimental study of IPS/SNIPS/DR estimators and counterfactual policy construction in a multi-position setting using the Open Bandit Dataset (OBD) released by ZOZO. We evaluate estimator behavior in cross-policy settings (Random ↔ Bernoulli Thompson Sampling), characterize heavy-tailed importance weights, and study robustness under propensity clipping. We further construct stochastic ranking policies from a fitted reward model, including a diversity-aware slate policy, and quantify the CTR–diversity trade-off via a Pareto analysis. Finally, we conduct a semi-synthetic evaluation that preserves real OBD covariates but simulates rewards from a learned environment, enabling bias–variance curves under known ground truth. Across experiments, self-normalization and doubly robust corrections improve stability, while the dominant failure mode remains limited overlap that produces heavy-tailed weights; clipping mitigates variance at the cost of controlled bias.







