Continual Red-Teaming for In-the-Wild Jailbreaks via Online Guardrail Updates and Guardrail Distillation

Daren Zheng; Chenyu Li; Harvey Davidson

doi:10.69987/JACS.2023.30203

Authors

Daren Zheng Information Technology, Carnegie Mellon University, PA, USA Author
Chenyu Li Applied Analytics, Columbia University, NY, USA Author
Harvey Davidson Data Science, UCLA, CA, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30203

Keywords:

LLM safety, jailbreak detection, prompt injection, continual learning, online learning, knowledge distillation, red-teaming, distribution shift

Abstract

Jailbreak prompts and prompt-injection attacks evolve rapidly in the wild, while production guardrails are often trained offline and updated infrequently. This paper studies a continual red-teaming loop that couples (i) self-play attack generation, (ii) online updates of a guard model under non-stationary attack distributions, and (iii) distillation into a lightweight guard suitable for low-latency deployment. We target two commonly used jailbreak prompt corpora mentioned in prior safety evaluations: in-the-wild-jailbreak-prompts (with dated benign/jailbreak splits) and WildJailbreak (a compact benchmark). Because the execution environment used for this manuscript cannot download the original Xet-backed Parquet artifacts, we instantiate a fully reproducible proxy corpus that matches the published split sizes and label schema and preserves the experimental conditions (benign vs. jailbreak, date-based shift, and cross-corpus distribution shift). We run end-to-end experiments with four baselines (rule matching, TF–IDF+logistic regression, a high-capacity character n-gram SVM teacher, and an online hashing-based classifier) and one distilled student. Across cross-distribution tests, the teacher achieves F1=0.943 on ITW_test and F1=0.984 on WJB_test when trained on ITW, while the distilled student matches the teacher’s F1 within 0.001 and reduces per-prompt CPU latency from 0.361 ms to 0.045 ms (batch=32). In continual learning, an anchored online update rule reduces forgetting on the Phase-1 distribution from 0.121 (naïve) to 0.064 (anchored) measured as max-to-final F1 drop. A self-play ablation on held-out mutated attacks increases TPR@0.5 from 0.810 (no self-play) to 0.998 (self-play with benign decoys) while keeping benign-decoy FPR@0.5 at 0.000. These results quantify a practical stability–plasticity–cost trade-off for continual safety red-teaming and motivate distillation as a deployment bridge from high-capacity red-team graders to lightweight on-device guardrails.