StimulusRL: A Universal Deep Reinforcement Learning Stimulus Agent for Coverage-Driven Chip Design Verification

Jingyi Chen; Chenyao Zhu

doi:10.69987/JACS.2026.60104

Authors

Jingyi Chen Electrical and Computer Engineering, Carnegie Mellon University, PA, USA Author
Chenyao Zhu Industrial Engineering & Operations Research, UC Berkeley, CA, USA Author

DOI:

https://doi.org/10.69987/JACS.2026.60104

Keywords:

design verification, stimulus generation, functional coverage, deep reinforcement learning, DQN, coverage-guided fuzzing, differential testing, cocotb, Verilator, UVM

Abstract

Modern chip design verification (DV) relies heavily on constrained-random simulation and manual testcase engineering to close functional coverage. This workflow is effective but increasingly expensive as designs scale and corner cases require long, protocol-valid stimulus sequences. We present StimulusRL, a universal deep reinforcement learning (RL) stimulus agent that learns to generate cycle-accurate stimuli from coverage feedback and differential bug oracles. StimulusRL formalizes stimulus generation as a Markov decision process (MDP) and trains a Deep Q-Network (DQN) policy that maps partial signal observations to legal stimulus actions. To support reproducible evaluation, we introduce DVSBench, a compact benchmark suite of five representative DUT families (FIFO, ALU, cache, arbiter, and SPI controller) with explicit functional coverage models and three injected bug variants per DUT. We conduct full experimental evaluations across 3 independent seeds with a fixed 2000-cycle budget and compare StimulusRL against three baselines: uniform random, constrained-random verification (CRV), and coverage-guided mutation fuzzing (CGM-Fuzz). Across DVSBench, StimulusRL matches baseline final coverage on four DUTs and achieves comparable coverage AUC on three DUTs while providing a learnable interface that can be integrated into cocotb/Verilator/UVM flows. In differential bug-finding, StimulusRL reliably detects cache and arbiter defects and discovers SPI waveform mismatches faster when successful, but exhibits lower success rate on the SPI controller, motivating improved reward shaping and hierarchical action modeling. All numbers, tables, and figures in this paper are generated from deterministic scripts with released seeds.