When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Badawi, Abeer; Rahimi, Elahe; Laskar, Md Tahmid Rahman; Grach, Sheri; Bertrand, Lindsay; Danok, Lames; Huang, Jimmy; Rudzicz, Frank; Dolatabadi, Elham

Computer Science > Computation and Language

arXiv:2510.19032 (cs)

[Submitted on 21 Oct 2025 ]

Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Title: 何时可以信任LLMs在心理健康中的应用？大规模基准用于可靠的LLMs评估

Authors:Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, Elham Dolatabadi

Abstract: Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/

Abstract: 评估大型语言模型（LLMs）在心理健康支持中的表现具有挑战性，因为治疗对话在情感和认知上都十分复杂。现有的基准测试在规模和可靠性方面存在局限，通常依赖于合成数据或社交媒体数据，并缺乏评估自动化评判者是否可信的框架。为解决大规模对话数据集和评判者可靠性评估的需求，我们引入了两个基准测试，它们提供了一个生成和评估的框架。 MentalBench-100k整合了来自三个真实场景数据集的10,000轮对话，每条对话配有九个LLM生成的回复，共产生100,000对回复。 MentalAlign-70k通过将四个高性能LLM评判者与人类专家在70,000项评分上的七种属性进行比较来重新定义评估，这些属性分为认知支持得分（CSS）和情感共鸣得分（ARS）。随后，我们采用情感认知一致框架，这是一种统计方法，使用组内相关系数（ICC）及其置信区间来量化LLM评判者与人类专家之间的一致性、一致性及偏差。我们的分析显示，LLM评判者存在系统性高估，认知属性如指导性和信息量具有较强的可靠性，而同理心的精确度较低，安全性和相关性方面则存在一些不可靠性。我们的贡献建立了可靠的、大规模评估LLM在心理健康领域表现的新方法论和实证基础。我们在以下链接发布了基准测试和代码：https://github.com/abeerbadawi/MentalBench/

Subjects:	Computation and Language (cs.CL) ; Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Cite as:	arXiv:2510.19032 [cs.CL]
	(or arXiv:2510.19032v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2510.19032

Submission history

From: Elham Dolatabadi [view email]
[v1] Tue, 21 Oct 2025 19:21:21 UTC (2,807 KB)

Computer Science > Computation and Language

Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Title: 何时可以信任LLMs在心理健康中的应用？大规模基准用于可靠的LLMs评估

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation Show Chinese title

Title: 何时可以信任LLMs在心理健康中的应用？ 大规模基准用于可靠的LLMs评估

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Title: 何时可以信任LLMs在心理健康中的应用？大规模基准用于可靠的LLMs评估