STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

Wang, Jinhong; Tong, Shuo; liu, Jian; Tang, Dongqi; Chen, Jintai; Ying, Haochao; Xu, Hongxia; Chen, Danny; Wu, Jian

计算机科学 > 计算机视觉与模式识别

arXiv:2506.01738 (cs)

[提交于 2025年6月2日 ]

标题： STORM：使用综合序数回归数据集 benchmarking 多语言大模型的视觉评级

标题： STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

Authors:Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu

摘要：视觉评分是人工智能（AI）对视觉内容进行多维量化的一项重要能力，主要应用于有序回归（OR）任务，如图像质量评估、人脸年龄估计和医学图像分级。然而，当前的多模态大型语言模型（MLLMs）在这方面的视觉评分能力表现不佳，同时还面临着缺乏相关数据集和基准的问题。在这项工作中，我们收集并发布了STORM，这是一个用于刺激MLLMs在通用视觉评分方面可信有序回归能力的数据集和基准。 STORM涵盖了五个常见视觉评分领域的14个有序回归数据集，包括655K张图像级对及其对应的精心策划的视觉问答（VQA）。重要的是，我们还提出了一种粗到细的处理管道，该管道能够动态考虑标签候选者并提供可解释的思路，为MLLMs提供了通用且可信的有序思维范式。此基准旨在评估MLLMs在理解评分标签的基本共同有序关系场景下的端到端和零样本性能。广泛的实验验证了我们框架的有效性，并揭示了更好的微调策略。 STORM数据集、基准和预训练模型可在以下网页上获取，以支持该领域的进一步研究。数据集和代码已在项目页面上发布：https://storm-bench.github.io/。

摘要： Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: https://storm-bench.github.io/.

评论：	NIPS2025 D&B轨道审稿中
主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.01738 [cs.CV]
	(或者 arXiv:2506.01738v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.01738

提交历史

来自： Jinhong Wang [查看电子邮件]
[v1] 星期一， 2025 年 6 月 2 日 14:48:15 UTC (3,033 KB)

计算机科学 > 计算机视觉与模式识别

标题： STORM：使用综合序数回归数据集 benchmarking 多语言大模型的视觉评级

标题： STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： STORM：使用综合序数回归数据集 benchmarking 多语言大模型的视觉评级 显示英文标题

标题： STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： STORM：使用综合序数回归数据集 benchmarking 多语言大模型的视觉评级