Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

Lucas, Keila; Gheyi, Rohit; Ribeiro, Márcio; Palomba, Fabio; Martins, Luana; Soares, Elvys

计算机科学 > 软件工程

arXiv:2507.13035 (cs)

[提交于 2025年7月17日 ]

标题：调查小型语言模型在检测手动测试用例中测试异味方面的性能

标题： Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

Authors:Keila Lucas, Rohit Gheyi, Márcio Ribeiro, Fabio Palomba, Luana Martins, Elvys Soares

摘要：人工测试，测试人员按照自然语言指令验证系统行为，对于发现自动化难以捕捉的问题仍然至关重要。然而，这些测试用例常常存在测试异味，如歧义、冗余或缺失检查等质量问题，这会降低测试的可靠性和可维护性。虽然存在检测工具，但它们通常需要手动定义规则且缺乏可扩展性。本研究探讨了小型语言模型（SLMs）在自动检测测试异味方面的潜力。我们在143个现实世界的Ubuntu测试用例上评估了Gemma3、Llama3.2和Phi-4，涵盖了七种类型的测试异味。 Phi-4取得了最佳结果，在检测含有测试异味的句子中达到了97%的pass@2，而Gemma3和Llama3.2分别达到了约91%。除了检测之外，SLMs还能自主解释问题并提出改进建议，即使没有明确的提示指令。它们能够在不依赖大量规则定义或语法分析的情况下，实现低成本、概念驱动的多种测试异味识别。这些发现突显了SLMs作为高效工具的潜力，能够保护数据隐私，并在现实场景中提高测试质量。

摘要： Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a pass@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.

评论：	7页，被巴西软件工程研讨会（SBES 2025）的有见地的想法和新兴成果（IIER）专题收录
主题：	软件工程 (cs.SE)
引用方式：	arXiv:2507.13035 [cs.SE]
	(或者 arXiv:2507.13035v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.13035

提交历史

来自： Keila Lucas [查看电子邮件]
[v1] 星期四， 2025 年 7 月 17 日 12:06:29 UTC (523 KB)

计算机科学 > 软件工程

标题：调查小型语言模型在检测手动测试用例中测试异味方面的性能

标题： Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： 调查小型语言模型在检测手动测试用例中测试异味方面的性能 显示英文标题

标题： Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：调查小型语言模型在检测手动测试用例中测试异味方面的性能