Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Galimzyanov, Timur; Kolomyttseva, Olga; Bogomolov, Egor

计算机科学 > 机器学习

arXiv:2510.20609 (cs)

[提交于 2025年10月23日 ]

标题：大规模实用代码RAG：在计算预算下的任务感知检索设计选择

标题： Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Authors:Timur Galimzyanov, Olga Kolomyttseva, Egor Bogomolov

摘要：我们研究在现实计算预算下的代码导向生成任务的检索设计。使用来自Long Code Arena的两个互补任务——代码补全和错误定位——我们系统地比较了在不同上下文窗口大小下各种检索配置，从三个维度进行比较：(i) 分块策略，(ii) 相似性评分，以及(iii) 分割粒度。 (1) 对于PL-PL，基于词级别的分割的稀疏BM25是最有效且实用的，显著优于密集替代方案，同时快一个数量级。 (2) 对于NL-PL，专有的密集编码器（Voyager-3系列）始终优于稀疏检索器，但需要100倍更大的延迟。 (3) 最优分块大小与可用上下文成比例：在小预算下，32-64行的分块效果最好，而在16000个标记时整个文件检索变得具有竞争力。 (4) 简单的基于行的分块在所有预算下都能与语法感知的分割相匹配。 (5) 不同配置之间的检索延迟最多相差200倍；基于BPE的分割过于缓慢，而BM25 + 词分割提供了最佳的质量-延迟权衡。因此，我们根据任务需求、模型约束和计算效率，提供了基于证据的有效代码导向RAG系统的实施建议。

摘要： We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 信息检索 (cs.IR)
MSC 类：	cs.LG, cs.IR, cs.SE, cs.AI
引用方式：	arXiv:2510.20609 [cs.LG]
	(或者 arXiv:2510.20609v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.20609

提交历史

来自： Timur Galimzyanov R [查看电子邮件]
[v1] 星期四， 2025 年 10 月 23 日 14:40:11 UTC (484 KB)

计算机科学 > 机器学习

标题：大规模实用代码RAG：在计算预算下的任务感知检索设计选择

标题： Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 大规模实用代码RAG：在计算预算下的任务感知检索设计选择 显示英文标题

标题： Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：大规模实用代码RAG：在计算预算下的任务感知检索设计选择