How Prompt Specificity Affects Edge Case Handling in LLM-Generated Code: An Empirical Evaluation
DOI:
https://doi.org/10.69987/AIMLR.2024.50411Keywords:
large language models, code generation, prompt specificity, edge case evaluationAbstract
Large language models have demonstrated strong performance on code generation benchmarks, yet standard evaluations may overestimate their robustness by relying on insufficient test suites that fail to exercise edge cases. This study investigates how varying levels of prompt specificity influence the ability of LLMs to generate code that correctly handles edge cases. We define four incremental specificity levels ranging from minimal function signatures to prompts containing explicit edge case hints and evaluate four LLMs (GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder-V2, and Qwen2.5-Coder-32B) on the HumanEval+ benchmark, which augments 164 HumanEval problems with approximately 80 times more test cases targeting boundary conditions. We introduce the Edge Pass Rate (EPR) metric to isolate edge case handling from general functional correctness. Our results show that increasing prompt specificity from minimal to edge-explicit yields a mean EPR improvement of 15.9 percentage points across all models, roughly 1.8 times the corresponding pass@1 gain of 8.9 points. Boundary value and negative number categories benefit most, while type coercion edge cases remain resistant. Weaker models exhibit greater sensitivity to prompt specificity, suggesting that prompt investment yields disproportionate returns when computational resources constrain the choice of LLM.

