Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis

Borse, Nikhil Sanjay; Subramaniam, Ravishankar Chatta; Rebello, N. Sanjay

物理学 > 物理教育

arXiv:2508.14764 (physics)

[提交于 2025年8月20日 (v1) ，最后修订 2025年8月31日 (此版本， v2)]

标题：大型语言模型与人类评分者在定性分析中的评分者间信度研究

标题： Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis

Authors:Nikhil Sanjay Borse, Ravishankar Chatta Subramaniam, N. Sanjay Rebello

摘要：定性分析通常仅限于小数据集，因为它耗时。此外，需要第二位人类评分者来确保结果的可靠性。如果我们能证明人工智能工具与人类评分具有高一致性，它们就可以取代人类评分者。我们研究了最先进的大型语言模型（LLMs）ChatGPT-4o 和 ChatGPT-4.5-preview 在对人工编码的音频转录本进行评分时的评分者间信度。我们探索了提示和超参数以优化模型性能。参与者是美国中西部一所大学的14个本科生小组，他们讨论了一个项目的解决问题策略。我们提示一个大型语言模型来复制人工编码，并计算了Cohen's Kappa 作为评分者间信度。在优化模型超参数和提示后，结果表明三个主题存在显著一致意见（${\kappa}>0.6$），一个主题存在中等一致意见。我们的研究结果展示了GPT-4o 和 GPT-4.5 在物理教育中高效、可扩展的定性分析的潜力，并指出了它们在评分通用领域概念方面的局限性。

摘要： Qualitative analysis is typically limited to small datasets because it is time-intensive. Moreover, a second human rater is required to ensure reliable findings. Artificial intelligence tools may replace human raters if we demonstrate high reliability compared to human ratings. We investigated the inter-rater reliability of state-of-the-art Large Language Models (LLMs), ChatGPT-4o and ChatGPT-4.5-preview, in rating audio transcripts coded manually. We explored prompts and hyperparameters to optimize model performance. The participants were 14 undergraduate student groups from a university in the midwestern United States who discussed problem-solving strategies for a project. We prompted an LLM to replicate manual coding, and calculated Cohen's Kappa for inter-rater reliability. After optimizing model hyperparameters and prompts, the results showed substantial agreement (${\kappa}>0.6$) for three themes and moderate agreement on one. Our findings demonstrate the potential of GPT-4o and GPT-4.5 for efficient, scalable qualitative analysis in physics education and identify their limitations in rating domain-general constructs.

评论：	7页，4图，2025年物理教育研究会议
主题：	物理教育 (physics.ed-ph)
引用方式：	arXiv:2508.14764 [physics.ed-ph]
	(或者 arXiv:2508.14764v2 [physics.ed-ph] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.14764

提交历史

来自： Nikhil Borse [查看电子邮件]
[v1] 星期三， 2025 年 8 月 20 日 15:12:52 UTC (361 KB)
[v2] 星期日， 2025 年 8 月 31 日 00:33:38 UTC (361 KB)

物理学 > 物理教育

标题：大型语言模型与人类评分者在定性分析中的评分者间信度研究

标题： Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

物理学 > 物理教育

标题： 大型语言模型与人类评分者在定性分析中的评分者间信度研究 显示英文标题

标题： Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：大型语言模型与人类评分者在定性分析中的评分者间信度研究