Findable then Explainable: Retrieval–Summary Integration for Code Intelligence on a Lightweight CodeSearchNet Subset

Yunhe Li

doi:10.69987/JACS.2024.40706

Authors

Yunhe Li Computer and Information Technology University of Pennsylvania, PA, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40706

Keywords:

code search, semantic retrieval, code summarization, retrieval-augmented generation, CodeSearchNet, BM25, TF-IDF, ROUGE, BLEU

Abstract

Code intelligence systems are often built as separate components for semantic code retrieval and code summarization. In practice, developers need both capabilities in a single interaction: first locate the right function (“findable”), then quickly understand it (“explainable”). This paper studies a retrieval–summary integrated pipeline that couples lexical code search with a lightweight retrieval-augmented summarizer. We conduct full experimental evaluations on a Python function subset derived from CodeSearchNet with official train/validation/test splits (389,224/24,327/43,910 functions). Task A evaluates natural-language-to-function retrieval using MRR@k and NDCG@k under the common paired-pool protocol (2,000 query–code pairs sampled from the test split; one relevant function per query). Task B evaluates function-to-docstring summarization using ROUGE and BLEU on the same 2,000 functions, where the target is the first sentence of the reference docstring. Our retriever uses BM25 over identifier-aware code tokens and is optionally re-ranked by a tuned score-fusion of BM25 and TF-IDF. On Task A, BM25 reaches MRR@10=0.6697 and NDCG@10=0.6998, while fusion re-ranking (α=0.9) yields NDCG@10=0.7000 and Recall@10=0.7985. For Task B, a name-based template summarizer attains ROUGE-1/2/L=0.2563/0.0640/0.2440, and a kNN retrieval-augmented summarizer with token filtering improves ROUGE-1 and ROUGE-L to 0.2795 and 0.2525, respectively. End-to-end, combining retrieval with RAG-filtered summaries produces ROUGE-1=0.2695 against the query docstrings. We further analyze precision–recall trade-offs, runtime efficiency, and qualitative failure modes, showing that lightweight RAG can improve surface-level summary faithfulness without requiring neural pretraining.