TB-Free RTL Anomaly Detection for Early Chip Verification: A Reproducible Binary Benchmark and Baseline Study on CVDP

Chenyao Zhu; Jingyi Chen; Yibang Liu

doi:10.69987/JACS.2024.41206

Authors

Chenyao Zhu Industrial Engineering & Operations Research, UC Berkeley, CA, USA Author
Jingyi Chen Electrical and Computer Engineering, Carnegie Mellon University, PA, USA Author
Yibang Liu Financial Engineering, Baruch College, NY, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.41206

Keywords:

RTL anomaly detection, early verification, testbench-free analysis, SystemVerilog, mutation testing, code review, bug-risk ranking, static machine learning

Abstract

Early digital verification is dominated by manual RTL review because simulation testbenches and assertions are typically incomplete or unavailable during pre-verification. This work studies testbench-free (TB-free) anomaly detection for SystemVerilog RTL as a practical surrogate for early bug-risk spotting: given an RTL snippet, a model assigns a probability that the snippet contains a defect-like anomaly without executing the design. We construct a fully reproducible binary benchmark from the Comprehensive Verilog Design Problems (CVDP) corpus. From the non-agentic code comprehension tasks (v1.0.2) we extract 114 RTL files across 113 distinct problems, window them into 546 snippets (30 lines, stride 15), and create paired clean/buggy samples by injecting exactly one mutation per snippet from five mutation operators (constant flips, edge-sensitivity flips, equality/inequality flips, boolean operator flips, and nonblocking-to-blocking assignment changes). The resulting benchmark contains 1092 labeled samples with a 50/50 class balance. We evaluate seven TB-free baselines under a strict group split by problem id to prevent design leakage: a heuristic risk score, a character 5-gram language model, a structural-feature logistic regressor, word-level TF-IDF linear models, and character-level TF-IDF models with and without structural features. The best-performing baseline, a character TF-IDF linear SVM, achieves AUROC 0.816, AUPRC 0.838, accuracy 0.791, and F1 0.772 on a held-out test split, with bootstrap 95% confidence interval [0.743, 0.880] for AUROC. These results quantify how much risk-ranking signal is available from RTL text alone and establish a transparent evaluation scaffold for LLM-assisted RTL review that does not depend on a testbench.