Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Xing, Fuyu; Wang, Zimu; Wang, Wei; Zhang, Haiyang

Computer Science > Computation and Language

arXiv:2509.12876v1 (cs)

[Submitted on 16 Sep 2025 ]

Title: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Title: 基准测试与提升LVLM在多媒体文档事件抽取中的性能

Authors:Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang

Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.

Abstract: 多媒体内容的激增需要开发有效的多媒体事件抽取（M2E2）系统。尽管大型视觉-语言模型（LVLMs）表现出强大的跨模态能力，但它们在M2E2任务中的应用仍缺乏深入研究。在本文中，我们对M2E2数据集上的代表性LVLMs进行了首次系统评估，包括DeepSeek-VL2和Qwen-VL系列。我们的评估涵盖了仅文本、仅图像和跨媒体子任务，在少量样本提示和微调设置下进行评估。我们的主要发现揭示了以下有价值的见解：（1）少量样本LVLMs在视觉任务中表现显著更好，但在文本任务中面临重大挑战；（2）使用LoRA对LVLMs进行微调可显著提升模型性能；（3）当结合多种模态时，LVLMs表现出强大的协同效应，在跨模态设置中实现了卓越的性能。我们进一步提供了详细的错误分析，以揭示语义精度、定位和跨模态基础等领域的持续挑战，这些仍然是提升M2E2能力的关键障碍。

Comments:	Accepted at INLG 2025. Camera-ready version
Subjects:	Computation and Language (cs.CL) ; Multimedia (cs.MM)
Cite as:	arXiv:2509.12876 [cs.CL]
	(or arXiv:2509.12876v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2509.12876

Submission history

From: Zimu Wang [view email]
[v1] Tue, 16 Sep 2025 09:29:02 UTC (3,029 KB)

Computer Science > Computation and Language

Title: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Title: 基准测试与提升LVLM在多媒体文档事件抽取中的性能

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents Show Chinese title

Title: 基准测试与提升LVLM在多媒体文档事件抽取中的性能

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents