A Comparative Empirical Study of Prompting Strategies for Code Generation with Large Language Models

Fanyi Zhao; Mingzhuo Yu; Chuankai Luo

doi:10.69987/JACS.2025.51203

Authors

Fanyi Zhao Computer Science, Stevens Institute of Technology, NJ, USA Author
Mingzhuo Yu Computer Science, Northeastern University, MA, USA Author
Chuankai Luo Electronic Information Engineering, Tsinghua University, Beijing, China Author

DOI:

https://doi.org/10.69987/JACS.2025.51203

Keywords:

large language models, code generation, prompting strategies, empirical evaluation

Abstract

Large language models have demonstrated strong capabilities in automated code generation, yet the influence of prompting strategies on generation quality remains insufficiently characterized under controlled experimental conditions. This study presents a systematic comparative evaluation of five prompting strategies — direct instruction, few-shot, zero-shot chain-of-thought, few-shot chain-of-thought, and self-consistency — across seven code-oriented large language models spanning both open-source and proprietary families. Experiments are conducted on four established benchmarks: HumanEval, MBPP, HumanEval+, and MBPP+. Results measured by pass@1 indicate that self-consistency with ten sampling paths yields the highest average accuracy, achieving 82.3% on HumanEval with DeepSeek-Coder-33B-Instruct, while few-shot chain-of-thought offers the strongest single-pass performance. Smaller-parameter models exhibit larger relative gains from structured prompting, with Code Llama-13B-Instruct improving by 8.5 percentage points from direct prompting to self-consistency on HumanEval. Larger models such as GPT-4 show comparatively modest gains of 3.7 percentage points under the same comparison. Evaluation on the more rigorous EvalPlus benchmarks reveals consistent pass@1 reductions averaging 8.3 percentage points, confirming that standard benchmarks overestimate functional correctness. A cost-effectiveness analysis demonstrates that zero-shot chain-of-thought provides favorable accuracy-to-cost trade-offs for latency-sensitive deployments, while self-consistency is preferable when accuracy is prioritized over computational budget. These findings offer actionable guidance for selecting prompting strategies in practical code generation workflows.