FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Mitra, Arkajyoti; Anjum, Afia; Agbaje, Paul; Pesé, Mert; Olufowobi, Habeeb

Computer Science > Computer Vision and Pattern Recognition

arXiv:2507.17088 (cs)

[Submitted on 23 Jul 2025 ]

Title: FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Title: FedVLM：通过联邦学习实现可扩展的个性化视觉-语言模型

Authors:Arkajyoti Mitra (1), Afia Anjum (1), Paul Agbaje (1), Mert Pesé (2), Habeeb Olufowobi (1) ((1) University of Texas at Arlington, (2) Clemson University)

Abstract: Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client's unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

Abstract: 视觉-语言模型（VLMs）表现出令人印象深刻的零样本和少量样本学习能力，使其成为多个下游任务的关键。然而，在大规模环境下微调这些模型仍然具有挑战性，特别是在联邦环境中，数据是去中心化的且在客户端之间是非独立同分布的。现有的参数高效微调方法如LoRA（低秩适应）减少了计算开销，但在处理异构客户端数据时表现不佳，导致泛化效果不理想。为了解决这些问题，我们提出了FedVLM，这是一种联邦LoRA微调框架，能够在保护模型隐私的同时实现视觉-语言模型的去中心化适应，并减少对集中式训练的依赖。为了进一步应对数据异质性问题，我们引入了个性化LoRA（pLoRA），它能够动态适应每个客户端的独特数据分布，显著提升本地适应效果，同时保持全局模型聚合。在RLAIF-V数据集上的实验表明，pLoRA在标准LoRA的基础上提高了客户端特定性能24.5%，展示了在非独立同分布设置中的优越适应性。 FedVLM为在联邦设置中微调VLMs提供了一个可扩展且高效的解决方案，推动了分布式学习场景中的个性化适应。

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2507.17088 [cs.CV]
	(or arXiv:2507.17088v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2507.17088

Submission history

From: Arkajyoti Mitra [view email]
[v1] Wed, 23 Jul 2025 00:05:02 UTC (2,660 KB)

Computer Science > Computer Vision and Pattern Recognition

Title: FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Title: FedVLM：通过联邦学习实现可扩展的个性化视觉-语言模型

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title: FedVLM: Scalable Personalized Vision-Language Models through Federated Learning Show Chinese title

Title: FedVLM：通过联邦学习实现可扩展的个性化视觉-语言模型

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: FedVLM: Scalable Personalized Vision-Language Models through Federated Learning