Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Benhammou, Yassir; Tiberio, Alessandro; Trautmann, Gabriel; Kalyan, Suman

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.15199 (cs)

[Submitted on 21 Apr 2025 ]

Title: Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Title: 零样本，但代价是什么？揭示MILS的LLM-CLIP框架用于图像描述中的隐藏开销

Authors:Yassir Benhammou, Alessandro Tiberio, Gabriel Trautmann, Suman Kalyan

Abstract: MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.

Abstract: MILS（多模态迭代LLM求解器）是一种最近发布的技术框架，声称“LLM可以在没有任何训练的情况下看和听”，通过利用基于迭代的LLM-CLIP方法实现零样本图像描述。尽管这种MILS方法表现出良好的性能，但我们的研究揭示了由于其昂贵的多步细化过程，这种成功伴随着隐藏的大量计算成本。相比之下，替代模型如BLIP-2和GPT-4V通过简化的一次性方法实现了竞争性的结果。我们假设MILS的迭代过程中固有的显著开销可能会削弱其实用性优势，从而挑战无需承担巨大资源需求即可实现零样本性能的说法。本工作首次揭示并量化了MILS在输出质量和计算成本之间的权衡，为设计更高效的多模态模型提供了关键见解。

Comments:	9 pages, 2 tables, 1 figure
Subjects:	Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:2504.15199 [cs.CV]
	(or arXiv:2504.15199v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.15199

Submission history

From: Yassir Benhammou [view email]
[v1] Mon, 21 Apr 2025 16:16:19 UTC (514 KB)

Computer Science > Computer Vision and Pattern Recognition

Title: Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Title: 零样本，但代价是什么？揭示MILS的LLM-CLIP框架用于图像描述中的隐藏开销

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title: Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning Show Chinese title

Title: 零样本，但代价是什么？ 揭示MILS的LLM-CLIP框架用于图像描述中的隐藏开销

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Title: 零样本，但代价是什么？揭示MILS的LLM-CLIP框架用于图像描述中的隐藏开销