ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

Liu, Zedong; Cheng, Shenggan; Tan, Guangming; You, Yang; Tao, Dingwen

计算机科学 > 分布式、并行与集群计算

arXiv:2507.10069 (cs)

[提交于 2025年7月14日 ]

标题： ElasticMM：使用弹性多模态并行的高效多模态LLMs服务

标题： ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

Authors:Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, Dingwen Tao

摘要：多模态大语言模型（MLLMs）通过引入特征提取器和投影模块，扩展了LLMs以处理图像、视频和音频。然而，这些附加组件——结合复杂的推理流程和异构工作负载——带来了显著的推理开销。因此，高效地服务MLLMs仍然是一个主要挑战。当前紧密耦合的服务架构难以区分混合请求类型或根据不同的推理阶段调整并行策略，导致首次标记时间（TTFT）延迟增加和资源利用率低下。为了解决这个问题，我们提出了弹性多模态并行性（EMP），一种新的服务范式，能够弹性适应不同请求类型和推理阶段之间的资源异构性。基于EMP，我们开发了ElasticMM，这是一种MLLM服务系统，它（1）通过模态感知的负载均衡器将请求分离为独立的模态组，并进行动态资源分配；（2）解耦推理阶段并通过弹性分区调度实现并行性调整和自适应扩展；（3）通过统一的多模态前缀缓存和非阻塞编码提高推理效率。在多种真实世界数据集上的实验表明，ElasticMM优于最先进的（SOTA）服务系统，将TTFT减少了最多4.2倍，并在满足服务级别目标（SLOs）的同时实现了3.2-4.5倍的更高吞吐量。

摘要： Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and heterogeneous workloads -- introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).

主题：	分布式、并行与集群计算 (cs.DC) ; 机器学习 (cs.LG)
引用方式：	arXiv:2507.10069 [cs.DC]
	(或者 arXiv:2507.10069v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.10069

提交历史

来自： Zedong Liu [查看电子邮件]
[v1] 星期一， 2025 年 7 月 14 日 08:53:48 UTC (1,700 KB)

计算机科学 > 分布式、并行与集群计算

标题： ElasticMM：使用弹性多模态并行的高效多模态LLMs服务

标题： ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： ElasticMM：使用弹性多模态并行的高效多模态LLMs服务 显示英文标题

标题： ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ElasticMM：使用弹性多模态并行的高效多模态LLMs服务