OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Ravichandran, Sandhanakrishnan; Kumar, Shivesh; Da Silva, Rogerio Corga; Romano, Miguel; Berkels, Reinhard; van der Heijden, Michiel; Fail, Olivier; Gnanapragasam, Valentine Emmanuel

Quantitative Biology > Quantitative Methods

arXiv:2509.02594 (q-bio)

[Submitted on 29 Aug 2025 ]

Title: OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Title: OpenAI的HealthBench实战：在真实临床查询上评估基于大模型的医疗助手

Authors:Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Reinhard Berkels, Michiel van der Heijden, Olivier Fail, Valentine Emmanuel Gnanapragasam

Abstract: Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stake clincal scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, awareness and uncertainty handling etc. To address these limitations, we evaluate our agentic, RAG-based clinical support assistant, DR.INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR.INFO achieves a HealthBench score of 0.51, substantially outperforming leading frontier LLMs (GPT-5, o3, Grok 3, GPT-4, Gemini 2.5, etc.) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence, Pathway.md), it maintains a performance lead with a health-bench score of 0.54. These results highlight DR.INFOs strengths in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and completeness of a response. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building a reliable and trustworthy AI-enabled clinical support assistant.

Abstract: 评估大型语言模型（LLMs）在生成高质量、准确且情境意识强的临床问题答案方面的能力，需要超越传统基准测试，以评估这些系统在复杂、高风险的临床场景中的表现。传统的评估通常仅限于选择题，这无法捕捉到上下文推理、意识和不确定性处理等关键能力。为解决这些局限性，我们使用HealthBench，一个由评分标准驱动的基准测试，该测试由开放式、专家标注的健康对话组成，来评估我们的代理型、基于RAG的临床支持助手DR.INFO。在1000个具有挑战性的示例的Hard子集上，DR.INFO获得了0.51的HealthBench分数，显著优于领先的前沿LLMs（GPT-5、o3、Grok 3、GPT-4、Gemini 2.5等），在所有行为轴（准确性、完整性、指令遵循等）上均表现出色。在与类似的代理型RAG助手（OpenEvidence、Pathway.md）进行的单独100样本评估中，它保持了性能领先，HealthBench得分为0.54。这些结果突显了DR.INFO在沟通、指令遵循和准确性方面的优势，同时也揭示了在情境意识和回答完整性方面的改进空间。总体而言，这些发现强调了行为级别、基于评分标准的评估在构建可靠且值得信赖的AI支持的临床助手方面的实用性。

Comments:	13 pages, two graphs
Subjects:	Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
Cite as:	arXiv:2509.02594 [q-bio.QM]
	(or arXiv:2509.02594v1 [q-bio.QM] for this version)
	https://doi.org/10.48550/arXiv.2509.02594

Submission history

From: Valentine Emmanuel Gnanapragasam VmeG [view email]
[v1] Fri, 29 Aug 2025 09:51:41 UTC (917 KB)

Quantitative Biology > Quantitative Methods

Title: OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Title: OpenAI的HealthBench实战：在真实临床查询上评估基于大模型的医疗助手

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Quantitative Methods

Title: OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries Show Chinese title

Title: OpenAI的HealthBench实战：在真实临床查询上评估基于大模型的医疗助手

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries