LLMShot: Reducing snapshot testing maintenance via LLMs

Kaynak, Ergün Batuhan; Lami, Mayasah; Moslemi, Sahand; Koyuncu, Anil

计算机科学 > 软件工程

arXiv:2507.10062 (cs)

[提交于 2025年7月14日 ]

标题： LLMShot：通过LLMs减少快照测试维护

标题： LLMShot: Reducing snapshot testing maintenance via LLMs

Authors:Ergün Batuhan Kaynak, Mayasah Lami, Sahand Moslemi, Anil Koyuncu

摘要：快照测试已成为现代软件开发中UI验证的关键技术，但由于频繁的UI变化导致测试失败，需要人工检查来区分真正的回归和有意的设计变更，因此存在大量的维护开销。这种手动分类过程在应用程序不断演变的过程中变得越来越繁重，从而需要自动分析解决方案。本文介绍了LLMShot，这是一种新颖的框架，利用基于视觉的大规模语言模型通过UI变化的分层分类自动分析快照测试失败。为了评估LLMShot的有效性，我们使用一个功能丰富的iOS应用程序和可配置的功能标志开发了一个全面的数据集，创建了产生真实快照差异的现实场景，这些差异代表了真实的开发工作流程。我们的评估使用Gemma3模型显示了出色的分类性能，12B版本在识别失败根本原因方面达到了84%以上的召回率，而4B模型则在连续集成环境中提供了可接受的性能和实际的部署优势。然而，我们对选择性忽略机制的探索揭示了当前基于提示的方法在可控视觉推理方面的显著局限性。LLMShot是第一个自动语义快照测试分析方法，为开发者提供了结构化的见解，可以大幅减少手动分类的工作量，并推动更智能的UI测试范式的发展。

摘要： Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages vision-based Large Language Models to automatically analyze snapshot test failures through hierarchical classification of UI changes. To evaluate LLMShot's effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.

评论：	被ICSME 2025接收
主题：	软件工程 (cs.SE)
引用方式：	arXiv:2507.10062 [cs.SE]
	(或者 arXiv:2507.10062v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.10062

提交历史

来自： Anil Koyuncu [查看电子邮件]
[v1] 星期一， 2025 年 7 月 14 日 08:47:19 UTC (1,041 KB)

计算机科学 > 软件工程

标题： LLMShot：通过LLMs减少快照测试维护

标题： LLMShot: Reducing snapshot testing maintenance via LLMs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： LLMShot：通过LLMs减少快照测试维护 显示英文标题

标题： LLMShot: Reducing snapshot testing maintenance via LLMs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： LLMShot：通过LLMs减少快照测试维护