Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2507.17539

Help | Advanced Search

Computer Science > Artificial Intelligence

arXiv:2507.17539 (cs)
[Submitted on 23 Jul 2025 ]

Title: Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Title: 通过临床认知链推理构建眼科MLLM以实现定位-诊断协作

Authors:Xinyao Liu, Diping Song
Abstract: Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o's 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L \propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.
Abstract: 多模态大语言模型(MLLMs)在医学诊断领域展现出巨大的潜力。 然而,它们在像眼科这样的专业领域面临关键挑战,特别是标注粒度的碎片化和临床推理逻辑的不一致,这阻碍了精确的跨模态理解。 本文介绍了FundusExpert,一个具有集成定位-诊断推理能力的眼科专用MLLM,以及通过智能Fundus-Engine系统构建的FundusGen数据集。 Fundus-Engine实现了定位自动化,并利用基于MLLM的语义扩展,将全局疾病分类、局部目标检测和细粒度特征分析整合到单张眼底图像中。 此外,通过构建临床对齐的认知链,它引导模型生成可解释的推理路径。 FundusExpert在FundusGen的指令数据上进行微调,在眼科问答任务中取得了最佳性能,比40B MedRegA的平均准确率高出26.6%。 它在零样本报告生成任务中也表现出色,临床一致性达到77.0%,显著优于GPT-4o的47.6%。 此外,我们揭示了数据质量和模型能力之间的缩放定律($L \propto N^{0.068}$),证明FundusGen中的认知对齐标注提高了数据利用效率。 通过将区域级定位与诊断推理链相结合,我们的工作开发了一个可扩展的、临床对齐的MLLM,并探索了弥合特定MLLM中视觉-语言差距的途径。 我们的项目可在https://github.com/MeteorElf/FundusExpert找到。
Subjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Cite as: arXiv:2507.17539 [cs.AI]
  (or arXiv:2507.17539v1 [cs.AI] for this version)
  https://doi.org/10.48550/arXiv.2507.17539
arXiv-issued DOI via DataCite

Submission history

From: Xinyao Liu [view email]
[v1] Wed, 23 Jul 2025 14:19:30 UTC (1,540 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
  • Other Formats
license icon view license
Current browse context:
cs.AI
< prev   |   next >
new | recent | 2025-07
Change to browse by:
cs
cs.CV
eess
eess.IV

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号