Evidence-Grounded Financial RAG: Reducing Numerical Hallucination in LLM-Generated Corporate Risk Memos

Qiyou Wu; Jingwen Bai; Xiaohan Zhou

doi:10.69987/JACS.2023.30306

Authors

Qiyou Wu Construction Management, Northeast Forestry University, 150036, Harbin, China Author
Jingwen Bai Data Science, Columbia University, NY, USA Author
Xiaohan Zhou School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China Author

DOI:

https://doi.org/10.69987/JACS.2023.30306

Keywords:

financial language models, retrieval-augmented generation, XBRL, SEC Financial Statement Data Sets, numerical reasoning, hallucination, trustworthy AI, corporate risk memo

Abstract

Large language models produce fluent corporate risk summaries, but risk memos become unreliable when the generated text changes financial values, confuses periods, or cites facts that do not support the statement. This paper evaluates an evidence-grounded retrieval-augmented generation pipeline for numerical financial memo writing on the SEC Financial Statement Data Sets 2023 Q1–Q2. The experiment constructs a fact-level evidence base from SEC-style SUB, NUM, TAG, and PRE tables, generates company-quarter risk memos under three settings, and audits every financial claim using executable formulas and source-location checks. The evaluation contains 120 company-quarter memos per setting, 720 audited claims per setting, 2,640 evidence chunks, ten SIC-level industry groups, and deterministic code in the replication package. No-RAG generation achieved a 21.94% numeric error rate and a 100.00% evidence-grounding failure rate because it did not emit required source identifiers. Plain RAG reduced numeric errors to 11.67% and citation errors to 18.19%, but still failed 28.61% of audited claims. Verified RAG, which constrained retrieval by filing metadata and recalculated ratios before finalizing text, reduced the audited claim-level error rate to 0.56%, eliminated numeric and citation errors in this run, and achieved a 96.67% memo-level all-claims pass rate. The results show that retrieval alone is insufficient for financial memo generation; a calculator-backed verifier is required to make generated ratios, year-over-year changes, and citations consistent with the evidence base.