Controllable Long-Term User Memory for Multi-Session Dialogue: Confidence-Gated Writing, Time-Aware Retrieval-Augmented Generation, and Update/Forgetting

Xinzhuo Sun; Yifei Lu; Jing Chen

doi:10.69987/JACS.2023.30802

Authors

Xinzhuo Sun Computer Engineering, Cornell Tech, NY, USA Author
Yifei Lu Computer Science, UCSD, CA, USA Author
Jing Chen Industrial Engieering and Operations Research, UCB, CA, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30802

Keywords:

long-term memory, personalization, multi-session dialogue, retrieval-augmented generation (RAG), time-aware retrieval, memory updating, forgetting

Abstract

Large language models (LLMs) are increasingly deployed in applications that span many days or months, where users expect the system to remember stable preferences (e.g., dietary restrictions), respect long‑term constraints (e.g., “do not mention my workplace”), and adapt when preferences change. However, naïvely concatenating all dialogue history is costly and often counterproductive: it inflates context length, amplifies irrelevant or outdated information, and can entrench earlier mistakes. This paper studies a practical long‑term user memory mechanism for multi‑turn and multi‑session dialogue, organized around three research questions: (1) controllable memory writing—how to extract “worth storing” user attributes while minimizing noise and hallucinated writes; (2) time‑aware memory retrieval—how to select a small set of relevant memories during generation rather than dumping all history into the prompt; and (3) memory update/forgetting—how to handle preference drift and explicit deletion without leaving the model stuck with stale beliefs. We propose TimeRAG‑Memory, a modular pipeline with confidence‑gated memory writing, an exponential time‑decay retriever that combines semantic relevance with recency and conflict penalties, and an update/forget module that supports overwrite, decay, and user‑requested deletion. We conduct end-to-end experiments on MSC and Persona-Chat using their official train/validation/test splits. For LaMP-1, we follow the official benchmark release used in this paper and report results on its publicly released labeled subset (N = 50). For each system variant, we generate model outputs and compute ROUGE-L, BERTScore, Recall@k/MRR, and task-specific metrics under identical decoding settings and fixed random seeds.