Behavior-Level Jailbreak Resistance via Multi-Stage Refusal + Utility Preservation

Daren Zheng; Chenyu Li

doi:10.69987/JACS.2024.40107

Authors

Daren Zheng Information Technology, Carnegie Mellon University, PA, USA Author
Chenyu Li Applied Analytics, Columbia University, NY, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40107

Keywords:

jailbreaks, runtime guardrails, refusal strategies, prompt injection, safety evaluation, behavior taxonomy, explainability, utility preservation

Abstract

Runtime safety guardrails for large language models are routinely evaluated at the prompt level, yet deployment failures often manifest as behavior-level bypasses: adversaries reshape prompts (e.g., via persona prompts, prompt-injection wrappers, or character-level obfuscation) to elicit a prohibited behavior while preserving surface plausibility. This paper presents BehaviorGuard, a behavior-level threat-modeling guardrail that binds each input to a tuple (goal, behavior, category) and then enforces a multi-stage refusal policy that preserves utility through safe alternatives and explicit reason codes. We conduct full experimental evaluations on two public datasets: JailbreakBench/JBB-Behaviors (100 paired harmful and benign behaviors; 200 prompts) and Do-Not-Answer (939 harmful instructions with a 12-type harm taxonomy) [9]. BehaviorGuard implements (i) normalization to remove instruction-steering wrappers and recover the goal text, (ii) behavior-pair matching that distinguishes minimally different harmful vs benign intents, and (iii) a de-obfuscation path for spaced-character attacks. Across the combined benchmark (1,139 clean prompts), BehaviorGuard reduces jailbreak success (harmful prompts allowed) from 9.4% under a static denylist-retrieval baseline to 0.0% while preserving 99% benign pass rate on JBB-Behaviors. Under an obfuscation attack set (2,278 prompts), the baseline blocks all benign requests (0% benign pass rate), whereas BehaviorGuard maintains 100% benign pass rate and 0.0% jailbreak success. We additionally quantify refusal explainability (reason-code accuracy and stability), and show that multi-stage refusals increase safe-alternative coverage from 0% to 100% without leaking actionable harmful details.