PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis

Zou, Dinghao; Gong, Yicheng; Li, Xiaokang; Cao, Xin; Lee, Sunbowen

Computer Science > Sound

arXiv:2509.11976 (cs)

[Submitted on 15 Sep 2025 (v1) , last revised 23 Sep 2025 (this version, v3)]

Title: PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis

Title: PoolingVQ：一种用于减少音频冗余并增强音乐情感分析中多模态融合的VQVAE变体

Authors:Dinghao Zou, Yicheng Gong, Xiaokang Li, Xin Cao, Sunbowen Lee

Abstract: Multimodal music emotion analysis leverages both audio and MIDI modalities to enhance performance. While mainstream approaches focus on complex feature extraction networks, we propose that shortening the length of audio sequence features to mitigate redundancy, especially in contrast to MIDI's compact representation, may effectively boost task performance. To achieve this, we developed PoolingVQ by combining Vector Quantized Variational Autoencoder (VQVAE) with spatial pooling, which directly compresses audio feature sequences through codebook-guided local aggregation to reduce redundancy, then devised a two-stage co-attention approach to fuse audio and MIDI information. Experimental results on the public datasets EMOPIA and VGMIDI demonstrate that our multimodal framework achieves state-of-the-art performance, with PoolingVQ yielding effective improvement. Our proposed metho's code is available at Anonymous GitHub

Abstract: 多模态音乐情感分析利用音频和MIDI模态来提升性能。虽然主流方法专注于复杂特征提取网络，但我们提出，缩短音频序列特征的长度以减少冗余，尤其是在与MIDI的紧凑表示对比时，可能会有效提升任务性能。为了实现这一点，我们通过将向量量化变分自编码器（VQVAE）与空间池化结合，开发了PoolingVQ，它通过代码本引导的局部聚合直接压缩音频特征序列以减少冗余，然后设计了一个两阶段的协同注意力方法来融合音频和MIDI信息。在公开数据集EMOPIA和VGMIDI上的实验结果表明，我们的多模态框架实现了最先进性能，其中PoolingVQ带来了有效的提升。我们提出的metho的代码可在Anonymous GitHub上获得

Subjects:	Sound (cs.SD) ; Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2509.11976 [cs.SD]
	(or arXiv:2509.11976v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2509.11976

Submission history

From: Dinghao Zou [view email]
[v1] Mon, 15 Sep 2025 14:24:04 UTC (1,202 KB)
[v2] Mon, 22 Sep 2025 13:57:49 UTC (1,214 KB)
[v3] Tue, 23 Sep 2025 02:20:49 UTC (1,214 KB)

Computer Science > Sound

Title: PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis

Title: PoolingVQ：一种用于减少音频冗余并增强音乐情感分析中多模态融合的VQVAE变体

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title: PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis Show Chinese title

Title: PoolingVQ：一种用于减少音频冗余并增强音乐情感分析中多模态融合的VQVAE变体

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: PoolingVQ: A VQVAE Variant for Reducing Audio Redundancy and Boosting Multi-Modal Fusion in Music Emotion Analysis