Benchmarking Large and Small MLLMs

Feng, Xuelu; Li, Yunsheng; Chen, Dongdong; Gao, Mei; Liu, Mengchen; Yuan, Junsong; Qiao, Chunming

计算机科学 > 计算机视觉与模式识别

arXiv:2501.04150 (cs)

[提交于 2025年1月4日 ]

标题：基准测试大型和小型多模态大语言模型

标题： Benchmarking Large and Small MLLMs

Authors:Xuelu Feng, Yunsheng Li, Dongdong Chen, Mei Gao, Mengchen Liu, Junsong Yuan, Chunming Qiao

摘要：大型多模态语言模型（MLLMs），如GPT-4V和GPT-4o，在理解和生成多模态内容方面取得了显著进展，展示了在各种任务中卓越的质量和能力。然而，它们的部署面临重大挑战，包括推理速度慢、计算成本高以及在设备端应用的不实用性。相比之下，小型MLLMs的出现，例如LLava系列模型和Phi-3-Vision，提供了有前景的替代方案，具有更快的推理速度、更低的部署成本以及处理特定领域场景的能力。尽管它们的存在日益增加，但大型和小型MLLMs之间的能力边界仍缺乏深入研究。在本工作中，我们进行了一项系统且全面的评估，以对小型和大型MLLMs进行基准测试，涵盖通用能力，如物体识别、时间推理和多模态理解，以及在工业和汽车等领域的实际应用。我们的评估结果显示，小型MLLMs在特定场景中可以达到与大型模型相当的性能，但在需要更深层次推理或细微理解的复杂任务中明显落后。此外，我们识别了小型和大型MLLMs中的常见失败案例，突显了即使最先进的模型也难以应对的领域。我们希望我们的发现能够指导研究界推动MLLMs的质量边界，提升其在各种应用中的可用性和有效性。

摘要： Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications.

主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2501.04150 [cs.CV]
	(或者 arXiv:2501.04150v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.04150

提交历史

来自： Xuelu Feng [查看电子邮件]
[v1] 星期六， 2025 年 1 月 4 日 07:44:49 UTC (82,019 KB)

计算机科学 > 计算机视觉与模式识别

标题：基准测试大型和小型多模态大语言模型

标题： Benchmarking Large and Small MLLMs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 基准测试大型和小型多模态大语言模型 显示英文标题

标题： Benchmarking Large and Small MLLMs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基准测试大型和小型多模态大语言模型