VerifySafe: Toxicity-Safe Agent Responses under Adversarial Prompts with Evidence-Based Self-Verification

Daren Zheng; Boning Zhang; Julie Geibel

doi:10.69987/JACS.2024.40106

Authors

Daren Zheng Information Technology, Carnegie Mellon University, PA, USA Author
Boning Zhang Computer Science, Georgetown University, DC, USA Author
Julie Geibel Artificial Intelligence, Northeastern University, MA, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40106

Keywords:

toxicity detection, hate speech, jailbreak detection, prompt injection, self-verification, safe response generation, content moderation

Abstract

Toxicity and prompt-based jailbreaking remain two practical failure modes of deployed dialogue agents. A common mitigation is to place a classifier in front of the model and refuse requests that are predicted unsafe. In practice, this single-stage guard induces a safety–utility trade-off: lowering the threshold reduces unsafe completions but increases false refusals, often disproportionately for benign utterances mentioning identity terms. This paper connects toxicity/abuse recognition with a self-verification loop that produces an evidence-style summary and revises the draft response before release. We implement VerifySafe, a three-stage agent: (i) a prompt detector for toxicity and jailbreak intent, (ii) a deterministic draft response generator (used to isolate guard effects), and (iii) a self-verifier that re-scores the draft response, redacts verbatim unsafe spans, and emits a concise, non-revealing evidence summary based on feature attributions. We conduct full experimental evaluations on two specified datasets: ToxiGen (250,951 generated statements across 13 target groups) and the balanced jailbreak-classification dataset (1,044 train / 262 test prompts). On ToxiGen, our best prompt-only guard required a hard refusal rate of 0.523 to keep the unsafe-echo rate on toxic prompts below 0.20. Under the same safety target, VerifySafe reduced hard refusals to 0.115 (−78.1%) by shifting decisions from refusals to evidence-backed redactions, while maintaining an unsafe-echo rate of 0.199. On jailbreak-classification, VerifySafe reduced hard refusals from 0.523 to 0.473 (−9.5%) under an unsafe-echo target of 0.02, again by converting a subset of refusals into redactions. Across both datasets, we provide detailed model comparisons, per-group diagnostics, threshold sweeps, and ablations demonstrating how response-level verification reshapes the guard trade-off without relying on opaque model internals.