LLM-Augmented Customer Representation Learning for Next-Purchase Prediction in Online Retail

Shenghan Lu; David Zhou

doi:10.69987/JACS.2023.30305

Authors

Shenghan Lu Information Technology, Fordham University, NY, USA Author
David Zhou Computer Science, UCLA, CA, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30305

Keywords:

Customer representation learning, online retail, next-purchase prediction, RFM, LightGBM, Transformer encoder, large language

Abstract

This paper reports a reproducible empirical study of customer representation learning for next-purchase prediction and top-N recommendation in online retail. The experiments use the UCI Online Retail transaction data, whose raw file contains 541,909 transaction lines from a United Kingdom non-store retailer between December 2010 and December 2011. After removing cancellations, missing customer identifiers, non-positive quantities, and non-positive prices, the experimental table contains 397,884 positive known-customer transaction lines, 18,532 invoice identifiers, 4,338 customers, and 3,665 stock codes. We evaluate four customer-representation families: an RFM logistic baseline, an engineered LightGBM model, a compact Transformer encoder over purchase-token histories, and an LLM-persona representation that converts customer states into deterministic English persona text and embeds it through TF-IDF and singular-value decomposition. The binary task predicts whether a customer will make another purchase within 30 days after the current invoice state. The recommendation task ranks stock codes for the first post-cutoff basket. All results in the paper are produced by the attached code and data package. On the chronological test set, the RFM baseline obtains the highest next-purchase AUC of 0.740, the persona-augmented LightGBM obtains 0.724, engineered LightGBM obtains 0.719, and the compact Transformer obtains 0.588. For top-N recommendation, the RFM-popularity recommender obtains Hit@10 of 0.694, LightGBM obtains 0.679, persona augmentation obtains 0.658, and the compact Transformer obtains 0.256. These findings show that language-style personas add measurable information to engineered trees, but simple recency and repeat-purchase signals remain exceptionally strong on this dataset.