Semantic Verifier for Post-hoc Answer Validation in Chat Platforms: Claim Decomposition, Evidence Retrieval, NLI, and Traceable Citations

Xiaofei Luo

doi:10.69987/JACS.2024.40306

Authors

Xiaofei Luo Information Science, University of Illinois at Urbana-Champaign, IL, US Author

DOI:

https://doi.org/10.69987/JACS.2024.40306

Keywords:

post-hoc verification, fact checking, FEVER, evidence retrieval, natural language inference, calibrated confidence, traceable citations, hallucination reduction

Abstract

Large language model (LLM) assistants are increasingly deployed in consumer and enterprise chat platforms, yet their fluent outputs can include unsupported statements that reduce user trust. This paper presents a platform-level “semantic verifier” that performs post-hoc answer validation: it decomposes an assistant response into atomic claims, retrieves external evidence, applies natural language inference (NLI) to judge each claim, and returns a traceable set of citations and calibrated confidence estimates. We implement a reproducible end-to-end verifier and evaluate it on the FEVER benchmark, reporting Label Accuracy, Evidence F1, and the FEVER Score (strict correctness requiring both correct label and a complete evidence set). Because the publicly distributed FEVER splits provide gold evidence as Wikipedia page titles and sentence IDs, our retriever indexes evidence-page titles from the training split and predicts evidence sentence IDs using page-specific priors; the verifier’s NLI module uses a lightweight log-linear classifier trained on claims. On FEVER shared-task development data (19,998 claims), the end-to-end system achieves a FEVER Score of 0.1696, Label Accuracy of 0.5246, and Evidence F1 of 0.0514 under the official scorer. We further analyze confidence thresholding, calibration (ECE = 0.143), and the impact of limiting evidence to top-k sentences. Although the title-only evidence approximation constrains retrieval quality, the experiments quantify practical trade-offs that matter for platform integration: when to abstain, how to surface citations, and how confidence controls precision/recall. The verifier design generalizes to full-text Wikipedia retrieval, web evidence, and multi-hop reasoning, enabling scalable reductions in hallucination and improved transparency in chat products.