Reproducible Evidence-Centric Evaluation of Multi-Hop Retrieval-Augmented QA on MuSiQue

Harry Wilson; Leo Carter

doi:10.69987/AIMLR.2025.60302

Authors

Harry Wilson Statistics, University of Leeds, Leeds, UK Author
Leo Carter Data Science, University of Leeds, Leeds, UK Author

DOI:

https://doi.org/10.69987/AIMLR.2025.60302

Keywords:

Retrieval-augmented generation, multi-hop question answering, evidence retrieval, MuSiQue, BM25, TF-IDF, evaluation

Abstract

Retrieval-augmented generation (RAG) is a natural fit for multi-hop question answering (QA) because it can explicitly retrieve and aggregate evidence across passages. However, the tight coupling between retrieval, evidence selection, and answer prediction complicates evaluation: a high answer score can mask missing evidence, and strong evidence recall can still fail if the reader is distractor-sensitive. This paper presents a fully reproducible, evidence-centric experimental study of multi-hop RAG-style pipelines on MuSiQue-Ans v1.0, a benchmark designed to require multi-hop reasoning. Using the official MuSiQue-Ans development split (2,417 questions; 20 candidate passages per question; 2–4 annotated supporting passages), we measure (i) answer Exact Match (EM) and token-level F1, (ii) retrieval Hit@k and evidence Recall@k, and (iii) answer containment in retrieved contexts. We implement lexical retrievers (BM25 and TF-IDF) and a deterministic lexical reader (LexR) that extracts candidate answer spans from the most query-overlapping sentences. On MuSiQue dev with k=10, BM25 achieves 95.95% Hit@10 and 69.11% evidence Recall@10, while producing 1.78% EM and 3.82% F1. Oracle retrieval that returns only supporting passages raises EM/F1 to 3.68%/8.75%, quantifying a large reader bottleneck even under perfect evidence. Detailed ablations, curves, runtime measurements, and an error taxonomy show that distractor passages degrade the reader as k increases and that retrieval misses explain 36.04% of BM25 failures at k=10. All results reported in this manuscript are empirically measured from the dataset and generated by a fixed-parameter, seeded evaluation pipeline.

Author Biography

Leo Carter, Data Science, University of Leeds, Leeds, UK

Reproducible Evidence-Centric Evaluation of Multi-Hop Retrieval-Augmented QA on MuSiQue

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Share

Final Sidebar