MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Govindarajan, Vijay; Patel, Pratik; Tripathi, Sahil; Hoque, Md Azizul; Kashyap, Gautam Siddharth

Computer Science > Computation and Language

arXiv:2509.12591 (cs)

[Submitted on 16 Sep 2025 ]

Title: MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Title: MAGIC增强的关键词提示用于CLIP模型的零样本音频描述生成

Authors:Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap

Abstract: Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

Abstract: 自动化音频描述（AAC）为音频片段生成描述，但由于数据集有限，相比图像描述面临更多挑战。为了克服这一问题，我们提出了零样本AAC系统，该系统利用预训练模型，无需大量训练。我们的方法使用预训练的音频CLIP模型提取听觉特征并生成结构化提示，指导大型语言模型（LLM）进行描述生成。与传统的贪心解码不同，我们的方法通过音频CLIP模型优化标记选择，确保与音频内容一致。实验结果表明，使用WavCaps模型进行MAGIC搜索时，自然语言生成平均得分提高了35%（从4.7到7.3）。性能受音频-文本匹配模型和关键词选择的显著影响，使用单个关键词提示时获得最佳结果，而未使用关键词列表时性能下降了50%。

Comments:	Accepted in The 26th International Conference on Web Information Systems Engineering (WISE), scheduled for 15-17 December 2025 in Marrakech, Morocco
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2509.12591 [cs.CL]
	(or arXiv:2509.12591v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2509.12591

Submission history

From: Gautam Siddharth Kashyap [view email]
[v1] Tue, 16 Sep 2025 02:36:00 UTC (37 KB)

Computer Science > Computation and Language

Title: MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Title: MAGIC增强的关键词提示用于CLIP模型的零样本音频描述生成

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title: MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models Show Chinese title

Title: MAGIC增强的关键词提示用于CLIP模型的零样本音频描述生成

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models