MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

Wu, Ruiqi; Su, Na; Zhang, Chenran; Ma, Tengfei; Zhou, Tao; Cui, Zhiting; Tang, Nianfeng; Mao, Tianyu; Zhou, Yi; Fan, Wen; Wu, Tianxing; Jing, Shenqi; Fu, Huazhu

计算机科学 > 计算机视觉与模式识别

arXiv:2501.15798 (cs)

[提交于 2025年1月27日 ]

标题： MM-Retinal V2：迁移精英知识火花到视网膜视觉语言预训练

标题： MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

Authors:Ruiqi Wu, Na Su, Chenran Zhang, Tengfei Ma, Tao Zhou, Zhiting Cui, Nianfeng Tang, Tianyu Mao, Yi Zhou, Wen Fan, Tianxing Wu, Shenqi Jing, Huazhu Fu

摘要：视觉-语言预训练（VLP）已被研究用于在视网膜图像分析中跨各种下游任务进行泛化。尽管最近的方法展示了有希望的成就，但它们严重依赖大规模的私有图像-文本数据，而对预训练方式的关注较少，这限制了它们的进一步发展。在本工作中，我们引入了MM-Retinal V2，一个高质量的图像-文本配对数据集，包括CFP、FFA和OCT图像模态。然后，我们提出了一种新的视网膜视觉-语言预训练模型，即KeepFIT V2，该模型通过将精英数据火花中的知识整合到分类公共数据集中进行预训练。具体来说，采用初步的文本预训练来为文本编码器提供主要的眼科文本知识。此外，设计了一个混合图像-文本知识注入模块用于知识迁移，这本质上基于对比学习中的全局语义概念和生成学习中的局部外观细节的结合。在零样本、少样本和线性探测设置中的大量实验突显了KeepFIT V2的泛化能力和可迁移性，其性能与在大规模私有图像-文本数据集上训练的最先进的视网膜VLP模型相当。我们的数据集和模型可通过https://github.com/lxirich/MM-Retinal公开获取。

摘要： Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.

主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2501.15798 [cs.CV]
	(或者 arXiv:2501.15798v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.15798

提交历史

来自： Ruiqi Wu [查看电子邮件]
[v1] 星期一， 2025 年 1 月 27 日 05:49:06 UTC (3,743 KB)

计算机科学 > 计算机视觉与模式识别

标题： MM-Retinal V2：迁移精英知识火花到视网膜视觉语言预训练

标题： MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： MM-Retinal V2：迁移精英知识火花到视网膜视觉语言预训练 显示英文标题

标题： MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： MM-Retinal V2：迁移精英知识火花到视网膜视觉语言预训练