ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence

Xinzhuo Sun; Jing Chen; Binghua Zhou; Meng-Ju Kuo

doi:10.69987/JACS.2024.40705

Authors

Xinzhuo Sun Computer Engineering, Cornell Tech, NY, USA Author
Jing Chen Industrial Engieering and Operations Research, UCB, Berkeley, CA Author
Binghua Zhou Computer Science, USC, LA, USA Author
Meng-Ju Kuo Department of Electrical and Computer Engineering, CMU, PA, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40705

Keywords:

retrieval-augmented generation, contradiction detection, natural language inference, evidence structuring, citation evaluation, hallucination robustness

Abstract

Retrieval-augmented generation (RAG) grounds language-model outputs in external evidence, but it often fails when the retrieved material contains genuine disagreements. In multi-source environments, a retriever can return passages that are all relevant yet mutually inconsistent. A standard generator may then merge incompatible evidence into a single narrative, leading to self-contradictions, unstable stance decisions, and citations that are difficult to verify. We propose ConRAG, a contradiction-aware RAG framework that makes conflict explicit and actionable. ConRAG consists of two coordinated stages. The analysis stage (A-stage) tags each retrieved passage with an NLI-style relation to the query (Support, Refute, or Irrelevant), clusters passages into internally consistent evidence groups, and computes a conflict score that quantifies disagreement strength. The generation stage (G-stage) follows a constrained protocol: it first outputs an evidence table, then adjudicates the stance with calibrated uncertainty, and finally generates an answer where every nontrivial sentence is bound to traceable citations.

We define an evaluation suite spanning stance correctness and evidence quality (FEVER, SciFact), citation precision and recall (ALCE), and hallucination robustness (RAGTruth). We implement ConRAG end-to-end and conduct full empirical evaluations on the official splits of these benchmarks. All tables and figures report measured results obtained from actual system runs under a fixed and reproducible evaluation protocol (consistent preprocessing, identical retrieval/generation budgets across methods, and controlled random seeds).