Semantic Source Code Segmentation using Small and Large Language Models

Dahou, Abdelhalim; Scherp, Ansgar; Kurten, Sebastian; Mathiak, Brigitte; Chauhan, Madhu

计算机科学 > 软件工程

arXiv:2507.08992v1 (cs)

[提交于 2025年7月11日 ]

标题：基于小模型和大语言模型的语义源代码分割

标题： Semantic Source Code Segmentation using Small and Large Language Models

Authors:Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan

摘要：源代码分割，将代码划分为功能一致的段落，对于软件开发中的知识检索和维护至关重要。虽然能够实现对大型代码库的高效导航和理解，但随着存储库的增长，手动和语法分析方法变得不切实际，尤其是对于像R这样的低资源语言及其研究领域（例如社会科学、心理学）。本文介绍了一种用于研究R代码分割的自动化、领域特定的方法，使用了大型和小型语言模型（LLMs/SLMs）。它提出了两种新方法和一个由人类标注的数据集，StatCodeSeg。我们探索了两种不同的方法：基于上下文的逐行分析和基于范围的段确定。我们对LLMs和微调过的SLMs进行了实验。为了支持我们方法的通用性，我们还对计算机科学领域的Python代码进行了实验。我们的结果表明，基于上下文的逐行分析优于基于范围的分割。使用较小的语言模型，如CodeBERT和CodeT5+的仅编码器版本，比它们的LLM counterparts 更好。最值得注意的是，这两种表现最好的模型在预训练期间没有见过R代码，而是仅在4,130行手动标注的代码上进行了微调。

摘要： Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science domain.Our results show that context-based line-by-line analysis is superior over range-based segmentation.Using smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.

评论：	18页，4图
主题：	软件工程 (cs.SE) ; 计算与语言 (cs.CL); 编程语言 (cs.PL)
引用方式：	arXiv:2507.08992 [cs.SE]
	(或者 arXiv:2507.08992v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.08992

提交历史

来自： Abdelhalim Hafedh Dahou [查看电子邮件]
[v1] 星期五， 2025 年 7 月 11 日 19:49:59 UTC (804 KB)

计算机科学 > 软件工程

标题：基于小模型和大语言模型的语义源代码分割

标题： Semantic Source Code Segmentation using Small and Large Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： 基于小模型和大语言模型的语义源代码分割 显示英文标题

标题： Semantic Source Code Segmentation using Small and Large Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于小模型和大语言模型的语义源代码分割