A Comparative Empirical Study of Over-Refusal Behavior in Closed-Source Large Language Models on Pseudo-Harmful Prompts

Xuanyi Fu; Danbing Zou

doi:10.69987/AIMLR.2025.60303

Authors

Xuanyi Fu M.S.E. in Computer Science,Johns Hopkins University,MD,USA Author
Danbing Zou Computer Science and Technology, Wuhan University, Wuhan, China Author

DOI:

https://doi.org/10.69987/AIMLR.2025.60303

Keywords:

Large language models, Over-refusal, Pseudo-harmful prompts, Behavioral evaluation

Abstract

Closed-source large language models increasingly mediate information access across consumer and enterprise applications, yet repeated reports of false refusals on benign questions suggest that alignment procedures may have over-corrected toward exaggerated safety. This paper presents a comparative empirical study of over-refusal behavior across six closed-source API-accessible LLMs — GPT-4o, GPT-4-Turbo, Claude-3.5-Sonnet, Claude-3-Opus, Gemini-1.5-Pro, and Gemini-1.5-Flash — using four publicly released benchmarks: XSTest (450 contrasting prompts), OR-Bench-Hard-1K together with its 600-prompt toxic control, PHTest (3,260 pseudo-harmful prompts), and CoCoNot (1,001 evaluation plus 379 contrast prompts). Refusal decisions are classified by the WildGuard refusal judge and audited against GPT-4o-mini on a stratified 500-sample split, with Cohen's κ agreement of 0.83. Across approximately 50,000 API responses, Claude-3-Opus exhibits the highest False Refusal Rate (FRR = 52.4% on OR-Bench-Hard-1K), while GPT-4o attains the lowest (18.5%); Unsafe Compliance Rate remains below 5.2% for all six models on OR-Bench-Toxic. Category-level analysis reveals that privacy-adjacent and figurative-language prompts dominate over-refusal triggers, and a linguistic mutation study shows that Claude-3-Opus is approximately 2.7 times more sensitive to surface mutations than GPT-4o. The findings offer a reproducible behavioral benchmark for practitioners selecting closed-source APIs.

A Comparative Empirical Study of Over-Refusal Behavior in Closed-Source Large Language Models on Pseudo-Harmful Prompts

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Share

Final Sidebar