Hallucination Detection and Confidence Calibration for Large Language Model Outputs: Reproducible Experiments on HaluEval

Thomas Reed; George Mason

doi:10.69987/AIMLR.2025.60401

Authors

Thomas Reed Computer Science, University of London, London, United Kingdom Author
George Mason Marketing, University of London, London, United Kingdom Author

DOI:

https://doi.org/10.69987/AIMLR.2025.60401

Keywords:

Hallucination detection, confidence calibration, expected calibration error, reliability diagram, large language models

Abstract

Large language models (LLMs) can generate fluent yet unsupported content (“hallucinations”), which undermines trust and complicates downstream decision making. Beyond detection accuracy, practical systems require calibrated confidence so that thresholds, abstention, and verification policies behave predictably. This paper reports fully reproducible experiments on the HaluEval benchmark [1] using the provided snapshot containing 64,507 labeled examples across four domains (QA, dialogue, summarization, and general user queries). We formulate hallucination recognition as binary classification given an input context (e.g., knowledge snippet or source document), a user query (when available), and a model-generated answer.

We evaluate several lightweight detectors based on TF-IDF and linear models, and we study post-hoc calibration using temperature scaling and (global and domain-conditional) Platt scaling. Our best system uses a linear SVM on 30k unigram TF-IDF features augmented with 11 overlap/length features, followed by domain-conditional Platt scaling trained on a held-out validation set. On the test split, it achieves AUROC 0.835 and F1 0.751 (threshold tuned on validation), while attaining an expected calibration error (ECE) of 0.009 with 15 bins. By contrast, applying a simple sigmoid to raw SVM scores attains AUROC 0.822 but yields substantially worse calibration (ECE 0.080). Across domains, we observe clear prior shifts—most notably in the general subset with only 18.1% hallucinated responses—which motivates domain-aware calibration and domain-specific operating points.

Author Biography

George Mason, Marketing, University of London, London, United Kingdom

Hallucination Detection and Confidence Calibration for Large Language Model Outputs: Reproducible Experiments on HaluEval

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Share

Final Sidebar