ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Pal, Ankit; Lee, Jung-Oh; Zhang, Xiaoman; Sankarasubbu, Malaikannan; Roh, Seunghyeon; Kim, Won Jung; Lee, Meesun; Rajpurkar, Pranav

计算机科学 > 计算机视觉与模式识别

arXiv:2506.04353 (cs)

[提交于 2025年6月4日 ]

标题： ReXVQA：面向通用胸片理解的大规模视觉问答基准

标题： ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Authors:Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, Pranav Rajpurkar

摘要：我们提出了ReXVQA，这是胸部放射学视觉问答（VQA）领域最大且最全面的基准数据集，包含约696,000个问题，这些问题与训练、验证和测试集中总计160,000份胸部X光片研究配对。与之前主要依赖模板式查询的努力不同，ReXVQA引入了一套多样化且具有临床真实性的任务组合，涵盖了五种核心放射学推理技能：存在评估、位置分析、否定检测、鉴别诊断以及几何推理。我们评估了八种最先进的多模态大型语言模型，包括MedGemma-4B-it、Qwen2.5-VL、Janus-Pro-7B和Eagle2-9B。表现最佳的模型（MedGemma）达到了83.24%的整体准确率。为了缩小AI性能与临床专业知识之间的差距，我们进行了涉及三名放射科住院医师的全面人类阅读者研究，共评估了200个随机抽样的病例。我们的评估表明，MedGemma的表现（83.84%的准确率）优于人类阅读者（最佳放射科住院医师：77.27%），标志着AI在胸部X光解读方面超过专家水平评价的重要里程碑。读者研究表明，AI模型与人类专家之间存在明显的性能差异，放射科医生之间表现出较强的一致性，而人类阅读者与AI模型之间则显示出更多变的一致性模式。 ReXVQA为评估通用放射学AI系统建立了新的标准，提供了公开排行榜、细致的评估拆分、结构化解释以及类别级分解。该基准为下一代能够模仿专家级别临床推理的AI系统奠定了基础，这些系统不仅限于狭窄的病理分类。我们的数据集将在https://huggingface.co/datasets/rajpurkarlab/ReXVQA开源。

摘要： We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI); 计算工程、金融与科学 (cs.CE); 计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式：	arXiv:2506.04353 [cs.CV]
	(或者 arXiv:2506.04353v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.04353

提交历史

来自： Ankit Pal [查看电子邮件]
[v1] 星期三， 2025 年 6 月 4 日 18:11:59 UTC (20,688 KB)

计算机科学 > 计算机视觉与模式识别

标题： ReXVQA：面向通用胸片理解的大规模视觉问答基准

标题： ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： ReXVQA：面向通用胸片理解的大规模视觉问答基准 显示英文标题

标题： ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ReXVQA：面向通用胸片理解的大规模视觉问答基准