Serving-Aware CTR Prediction: Embedding Compression and Interaction Distillation Under Memory and Latency Constraints

Authors

  • Hanqi Zhang Computer Science, University of Michigan at Ann Arbor, MI, USA Author

DOI:

https://doi.org/10.69987/AIMLR.2026.70101

Keywords:

CTR prediction, embedding compression, feature hashing, INT8 quantization, knowledge distillation, serving latency, Pareto frontier, Criteo 1TB

Abstract

Click-through rate (CTR) prediction is a core component in modern advertising and recommender systems, yet the dominant industrial bottleneck is often not the lack of offline accuracy but the cost of serving: large embedding tables dominate memory footprint and bandwidth, while sophisticated feature-interaction modules increase tail latency and reduce throughput. This paper proposes a serving-aware experimental framework and a practical method that jointly address these constraints by combining (i) embedding compression and (ii) interaction knowledge distillation. For embeddings, we adopt a serving-first design based on per-field hash buckets with explicit capacity control, followed by post-training INT8 quantization to reduce memory and improve cache locality. For interactions, we train a high-capacity teacher model (DCN-V2 or AutoInt) and distill its interaction knowledge into a low-latency student (shallow interaction + MLP) using a logit-level distillation objective. We evaluate on the Criteo 1TB Click Logs dataset consisting of 24 daily files and more than 4 billion examples. Beyond standard offline metrics (AUC and LogLoss), we report serving metrics (embedding memory, total parameters, QPS, and p95 latency) in a unified benchmark harness and visualize accuracy–cost trade-offs with Pareto frontiers. The resulting analysis clarifies when embedding compression alone is sufficient, when distillation recovers accuracy at fixed cost, and how the combined approach moves the Pareto frontier under realistic memory and latency budgets.

Author Biography

  • Hanqi Zhang, Computer Science, University of Michigan at Ann Arbor, MI, USA

     

     

     

Downloads

Published

2026-01-05

How to Cite

Hanqi Zhang. (2026). Serving-Aware CTR Prediction: Embedding Compression and Interaction Distillation Under Memory and Latency Constraints. Artificial Intelligence and Machine Learning Review , 7(1), 1-15. https://doi.org/10.69987/AIMLR.2026.70101

Share