Test-in-the-loop LLM Repair: Verifiable Automated Program Repair on QuixBugs with a “Failing Test → Patch → Regression Test” Loop

Yunhe Li

doi:10.69987/JACS.2024.40206

Authors

Yunhe Li Computer and Information Technology University of Pennsylvania, PA, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40206

Keywords:

Automated program repair, test-driven repair, test-in-the-loop, large language models, QuixBugs, patch overfitting, regression testing, generate-and-validate

Abstract

Automated Program Repair (APR) has long promised to reduce debugging cost by synthesizing patches that satisfy a test oracle. Recent large language models (LLMs) have revived interest in APR because they can propose semantically rich edits, yet their outputs often remain unverifiable unless integrated with an execution-based validation loop. In this paper we study a concrete and fully reproducible variant of test-driven LLM repair, which we call Test-in-the-loop LLM Repair (TiLLR). TiLLR explicitly closes the loop “failing tests → candidate patch → regression tests → next patch” and treats tests as the ground-truth accept/reject criterion. We evaluate TiLLR on QuixBugs, a 40-program bilingual benchmark with failing and passing tests for Python and Java [1]. To make the end-to-end evaluation deterministic and runnable without external services, we instantiate the “LLM” component as a lightweight code language model used solely for ranking and selecting candidate edits from a template-driven operator space. The repair operators implement common one-line fixes (operator substitution, variable swap, off-by-one adjustment, and a constrained statement insertion), while localization uses failing-test stack traces to prioritize suspicious lines. Across 40 Python tasks, TiLLR produces 6 plausible patches (15.0% success rate) within a budget of 10 test-and-repair iterations. A single-shot ranking baseline (LM-1shot) achieves 3/40 (7.5%) and a pure rule/template baseline (RuleRepair) achieves 4/40 (10.0%) only when allowed a much larger search budget (50 candidates). TiLLR’s main advantage is sample efficiency: it reaches 10.0% after two attempts and 15.0% by six attempts, while RuleRepair does not succeed until attempt 10. Using a held-out regression split of passing tests, we observe zero overfitting among the plausible patches that are validated, and the successful patches are compact (mean unified-diff size: 2 changed lines). We further analyze performance by defect category, show success-vs-budget curves, and provide a detailed audit table of the repaired tasks. Overall, our findings support a pragmatic view: when tightly coupled with a test-in-the-loop validation protocol, even modest generative models can yield measurable, verifiable APR gains on small algorithmic programs, and the loop structure provides a clean axis to trade compute budget for reliability.