Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2510.19032

Help | Advanced Search

Computer Science > Computation and Language

arXiv:2510.19032 (cs)
[Submitted on 21 Oct 2025 ]

Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Title: 何时可以信任LLMs在心理健康中的应用? 大规模基准用于可靠的LLMs评估

Authors:Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, Elham Dolatabadi
Abstract: Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/
Abstract: 评估大型语言模型(LLMs)在心理健康支持中的表现具有挑战性,因为治疗对话在情感和认知上都十分复杂。 现有的基准测试在规模和可靠性方面存在局限,通常依赖于合成数据或社交媒体数据,并缺乏评估自动化评判者是否可信的框架。 为解决大规模对话数据集和评判者可靠性评估的需求,我们引入了两个基准测试,它们提供了一个生成和评估的框架。 MentalBench-100k整合了来自三个真实场景数据集的10,000轮对话,每条对话配有九个LLM生成的回复,共产生100,000对回复。 MentalAlign-70k通过将四个高性能LLM评判者与人类专家在70,000项评分上的七种属性进行比较来重新定义评估,这些属性分为认知支持得分(CSS)和情感共鸣得分(ARS)。 随后,我们采用情感认知一致框架,这是一种统计方法,使用组内相关系数(ICC)及其置信区间来量化LLM评判者与人类专家之间的一致性、一致性及偏差。 我们的分析显示,LLM评判者存在系统性高估,认知属性如指导性和信息量具有较强的可靠性,而同理心的精确度较低,安全性和相关性方面则存在一些不可靠性。 我们的贡献建立了可靠的、大规模评估LLM在心理健康领域表现的新方法论和实证基础。 我们在以下链接发布了基准测试和代码:https://github.com/abeerbadawi/MentalBench/
Subjects: Computation and Language (cs.CL) ; Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Cite as: arXiv:2510.19032 [cs.CL]
  (or arXiv:2510.19032v1 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2510.19032
arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Elham Dolatabadi [view email]
[v1] Tue, 21 Oct 2025 19:21:21 UTC (2,807 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
license icon view license
Current browse context:
cs.CL
< prev   |   next >
new | recent | 2025-10
Change to browse by:
cs
cs.CY
cs.HC

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号