VerifySafe: Toxicity-Safe Agent Responses under Adversarial Prompts with Evidence-Based Self-Verification

Authors

  • Daren Zheng Information Technology, Carnegie Mellon University, PA, USA Author
  • Boning Zhang Computer Science, Georgetown University, DC, USA Author
  • Julie Geibel Artificial Intelligence, Northeastern University, MA, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40106

Keywords:

toxicity detection, hate speech, jailbreak detection, prompt injection, self-verification, safe response generation, content moderation

Abstract

Toxicity and prompt-based jailbreaking remain two practical failure modes of deployed dialogue agents. A common mitigation is to place a classifier in front of the model and refuse requests that are predicted unsafe. In practice, this single-stage guard induces a safety–utility trade-off: lowering the threshold reduces unsafe completions but increases false refusals, often disproportionately for benign utterances mentioning identity terms. This paper connects toxicity/abuse recognition with a self-verification loop that produces an evidence-style summary and revises the draft response before release. We implement VerifySafe, a three-stage agent: (i) a prompt detector for toxicity and jailbreak intent, (ii) a deterministic draft response generator (used to isolate guard effects), and (iii) a self-verifier that re-scores the draft response, redacts verbatim unsafe spans, and emits a concise, non-revealing evidence summary based on feature attributions. We conduct full experimental evaluations on two specified datasets: ToxiGen (250,951 generated statements across 13 target groups) and the balanced jailbreak-classification dataset (1,044 train / 262 test prompts). On ToxiGen, our best prompt-only guard required a hard refusal rate of 0.523 to keep the unsafe-echo rate on toxic prompts below 0.20. Under the same safety target, VerifySafe reduced hard refusals to 0.115 (−78.1%) by shifting decisions from refusals to evidence-backed redactions, while maintaining an unsafe-echo rate of 0.199. On jailbreak-classification, VerifySafe reduced hard refusals from 0.523 to 0.473 (−9.5%) under an unsafe-echo target of 0.02, again by converting a subset of refusals into redactions. Across both datasets, we provide detailed model comparisons, per-group diagnostics, threshold sweeps, and ablations demonstrating how response-level verification reshapes the guard trade-off without relying on opaque model internals.

Author Biography

  • Julie Geibel, Artificial Intelligence, Northeastern University, MA, USA

     

     

     

Downloads

Published

2024-01-18

How to Cite

Daren Zheng, Boning Zhang, & Julie Geibel. (2024). VerifySafe: Toxicity-Safe Agent Responses under Adversarial Prompts with Evidence-Based Self-Verification. Journal of Advanced Computing Systems , 4(1), 67-82. https://doi.org/10.69987/JACS.2024.40106

Share