A Comparative Evaluation of LLM-Generated Semantic Tags versus Classical Text Features (TF-IDF, LDA, BERT Embeddings) for User-Interest Enrichment in Short-Video Recommendation
DOI:
https://doi.org/10.69987/AIMLR.2024.50111Keywords:
short-video recommendation, user-interest features, large language models, feature comparisonAbstract
User-interest features derived from textual metadata are central to short-video recommendation, yet the relative value of different semantic signal sources has not been rigorously quantified on modern public benchmarks. This paper presents a controlled comparative evaluation of four families of textual features used to enrich user-interest vectors in short-video feeds: sparse TF-IDF weighted keywords, Latent Dirichlet Allocation (LDA) topic distributions, sentence-level BERT embeddings, and tag sets generated by an open-source large language model (LLM). Using KuaiRand-Pure, MicroLens-100K, and KuaiRec as primary datasets and MIND-small as a cross-domain probe, we assess each signal along four axes: feature separability, mutual information with downstream behaviour, accuracy on next-item and click-through prediction tasks using SASRec, BERT4Rec, DIN, DIEN, and SIM backbones, and offline computation cost. LLM-generated tags attain the best downstream accuracy on most settings, with AUC gains of 0.9 to 1.6 points over TF-IDF on CTR tasks and HR@10 gains of 2.1 to 3.4 points on sequential tasks, while BERT embeddings remain competitive when LLM inference budget is constrained. LDA is dominated on every axis except interpretability. The observed gains are moderate rather than uniform, and we identify scenarios in which sparse lexical features remain Pareto-optimal once serving latency is accounted for.

