Risk-Calibrated Biomedical Search: Calibrated Selection of LLM-Style Query Expansions on BEIR TREC-COVID

Jing Chen; Xinzhuo Sun; Qiyou Wu; Matt Jackson

doi:10.69987/JACS.2024.40406

Authors

Jing Chen Industrial Engineering and Operations Research, UCB, CA, USA Author
Xinzhuo Sun Computer Science, Cornell Tech, NY, USA Author
Qiyou Wu Artificial Intelligence, Northeastern University, MA, USA Author
Matt Jackson Data Science, University of Pittsburgh, PA, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40406

Keywords:

query expansion, uncertainty calibration, robust retrieval, selective prediction, biomedical information retrieval, TREC-COVID, BEIR, coverage–risk trade-off

Abstract

Query expansion is a long-standing technique for closing vocabulary gaps between short user queries and long biomedical documents. Large language models (LLMs) have recently renewed interest in expansion by generating fluent synonym lists, MeSH-style descriptors, and drug aliases; however, aggressive generation can introduce query drift, causing large per-topic failures that are unacceptable in high-stakes biomedical search. This paper presents Risk-Calibrated Query Expansion (RCQE), a selective expansion framework that treats expansion as a risk-aware decision: for each query we generate multiple plausible expansion candidates and learn a calibrated selector that either (i) chooses a candidate expected to improve retrieval, or (ii) abstains and keeps the original query. We conduct full experiments on BEIR TREC-COVID (171,332 documents; 50 topics; 66,336 judged query-document pairs) using a reproducible BM25 implementation. Across topics, a naive always-expand strategy improves average nDCG@10 from 0.549 to 0.580 but harms 20% of topics, including catastrophic failures. RCQE improves average nDCG@10 to 0.613 and MAP to 0.213 under 5-fold cross-validation while reducing the conditional harm probability among expanded topics from 0.20 to 0.13 at 46% coverage. Coverage–risk curves show that tightening the calibrated acceptance threshold yields monotonic risk reductions with graceful degradation in effectiveness. These results demonstrate that uncertainty calibration is a practical control knob for robust biomedical query expansion.