"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Skoularikis, Anastasios; Papadopoulos, Stefanos-Iordanis; Papadopoulos, Symeon; Petrantonakis, Panagiotis C.

doi:10.1145/3746275.3762215

计算机科学 > 计算机视觉与模式识别

arXiv:2508.20670 (cs)

[提交于 2025年8月28日 ]

标题： “幽默、艺术还是虚假信息？”: 一种用于意图感知的合成图像检测的多模态数据集

标题： "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Authors:Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

摘要：最近在多模态AI方面的进展使得检测合成内容和不相关的内容取得了进步。然而，现有的研究大多忽略了AI生成图像背后的意图。为了填补这一空白，我们引入了S-HArM，这是一个多模态数据集，用于意图感知分类，包含来自Twitter/X和Reddit的9576对“真实场景”图像-文本对，标记为幽默/讽刺、艺术或虚假信息。此外，我们探索了三种提示策略（图像引导、描述引导和多模态引导），以使用Stable Diffusion构建大规模的合成训练数据集。我们进行了一项广泛的比较研究，包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。我们的结果表明，基于图像和多模态引导数据训练的模型在“真实场景”内容上的泛化能力更好，这是由于保留了视觉上下文。然而，整体性能仍然有限，突显了推断意图的复杂性以及需要专门架构的必要性。

摘要： Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

主题：	计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM)
引用方式：	arXiv:2508.20670 [cs.CV]
	(或者 arXiv:2508.20670v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.20670
相关 DOI:	https://doi.org/10.1145/3746275.3762215

提交历史

来自： Stefanos-Iordanis Papadopoulos [查看电子邮件]
[v1] 星期四， 2025 年 8 月 28 日 11:22:15 UTC (12,011 KB)

计算机科学 > 计算机视觉与模式识别

标题： “幽默、艺术还是虚假信息？”: 一种用于意图感知的合成图像检测的多模态数据集

标题： "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： “幽默、艺术还是虚假信息？”: 一种用于意图感知的合成图像检测的多模态数据集 显示英文标题

标题： "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： “幽默、艺术还是虚假信息？”: 一种用于意图感知的合成图像检测的多模态数据集