An Empirical Comparison of Discrete Video Tokenization Schemes for Video Question Answering and Video Captioning
DOI:
https://doi.org/10.69987/AIMLR.2025.60203Keywords:
Video Tokenization, Discrete Representation Learning, Video Question Answering, Video CaptioningAbstract
Discrete video tokenization has become a central design choice in recent multimodal large language models, shifting visual understanding from continuous patch features toward compact sequences of code indices. While rapid progress has been demonstrated on generative benchmarks, the impact of tokenizer choice on discriminative downstream tasks such as video question answering and video captioning remains under-quantified. This paper presents a controlled empirical comparison of nine discrete video tokenization schemes — VQ-VAE, VQ-GAN, VideoGPT, MAGVIT, MAGVIT-v2, OmniTokenizer, LARP, TiTok, and Cosmos Tokenizer — evaluated under a shared decoder-only language backbone on five open-ended video question answering benchmarks (MSRVTT-QA, MSVD-QA, ActivityNet-QA, NExT-QA, TGIF-QA) and two captioning benchmarks (MSR-VTT, VATEX). Across 63 tokenizer–dataset pairs, modern lookup-free and finite-scalar quantization schemes outperform the classic VQ-VAE baseline by 7.6 to 9.5 accuracy points on short-video question answering and 10.8 to 12.3 CIDEr points on captioning, while holistic query-based tokens deliver the largest gains on causal and temporal reasoning. A factor analysis separates codebook size from spatial-temporal compression, revealing a moderate sweet spot around 65K codebook entries at an 8×8×4 compression ratio, beyond which downstream accuracy plateaus or regresses. The results supply practical guidance for tokenizer selection in video–language pipelines.

