An Empirical Comparison of Discrete Video Tokenization Schemes for Video Question Answering and Video Captioning

Mingzhuo Yu; Zan Li

doi:10.69987/AIMLR.2025.60203

Authors

Mingzhuo Yu Computer Science, Northeastern University, MA, USA Author
Zan Li School of Journalism and Communication, Peking University, Beijing, China Author

DOI:

https://doi.org/10.69987/AIMLR.2025.60203

Keywords:

Video Tokenization, Discrete Representation Learning, Video Question Answering, Video Captioning

Abstract

Discrete video tokenization has become a central design choice in recent multimodal large language models, shifting visual understanding from continuous patch features toward compact sequences of code indices. While rapid progress has been demonstrated on generative benchmarks, the impact of tokenizer choice on discriminative downstream tasks such as video question answering and video captioning remains under-quantified. This paper presents a controlled empirical comparison of nine discrete video tokenization schemes — VQ-VAE, VQ-GAN, VideoGPT, MAGVIT, MAGVIT-v2, OmniTokenizer, LARP, TiTok, and Cosmos Tokenizer — evaluated under a shared decoder-only language backbone on five open-ended video question answering benchmarks (MSRVTT-QA, MSVD-QA, ActivityNet-QA, NExT-QA, TGIF-QA) and two captioning benchmarks (MSR-VTT, VATEX). Across 63 tokenizer–dataset pairs, modern lookup-free and finite-scalar quantization schemes outperform the classic VQ-VAE baseline by 7.6 to 9.5 accuracy points on short-video question answering and 10.8 to 12.3 CIDEr points on captioning, while holistic query-based tokens deliver the largest gains on causal and temporal reasoning. A factor analysis separates codebook size from spatial-temporal compression, revealing a moderate sweet spot around 65K codebook entries at an 8×8×4 compression ratio, beyond which downstream accuracy plateaus or regresses. The results supply practical guidance for tokenizer selection in video–language pipelines.

Author Biography

Zan Li, School of Journalism and Communication, Peking University, Beijing, China

An Empirical Comparison of Discrete Video Tokenization Schemes for Video Question Answering and Video Captioning

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Share

Final Sidebar