DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Fan, Cunhang; Zhang, Sheng; Zhang, Jingjing; Liu, Enrui; Li, Xinhui; Zhao, Minggang; Lv, Zhao

计算机科学 > 声音

arXiv:2507.07526 (cs)

[提交于 2025年7月10日 ]

标题： DMF2Mel：一种动态多尺度融合网络用于脑电驱动的梅尔频谱重建

标题： DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Authors:Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

摘要：从脑信号中解码语音是一个具有挑战性的研究问题。尽管现有技术在重建听觉刺激的梅尔频谱图方面取得了进展，特别是在单词或字母级别，但在精确重建分钟级连续想象语音方面仍存在核心挑战：传统模型难以在时间依赖性建模的效率和长序列解码中的信息保留之间取得平衡。为了解决这个问题，本文提出了动态多尺度融合网络（DMF2Mel），该网络包含四个核心组件：动态对比特征聚合模块（DC-FAM）、分层注意力引导的多尺度网络（HAMS-Net）、样条映射注意力机制以及双向状态空间模块（convMamba）。具体来说，DC-FAM通过局部卷积和全局注意力机制将与语音相关的“前景特征”从噪声“背景特征”中分离出来，有效抑制干扰并增强瞬态信号的表示。基于U-Net框架的HAMS-Net实现了高层语义和低层细节的跨尺度融合。 SplineMap注意力机制集成了自适应门控科莫戈罗夫-阿诺德网络（AGKAN），将全局上下文建模与基于样条的局部拟合相结合。 convMamba以线性复杂度捕捉长程时间依赖性，并增强了非线性动态建模能力。在SparrKULee数据集上的结果表明，DMF2Mel在已知受试者的梅尔频谱图重建中达到了0.074的皮尔逊相关系数（比基线提高了48%），在未知受试者中达到了0.048（比基线提高了35%）。代码可在以下链接获取：https://github.com/fchest/DMF2Mel。

摘要： Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.

评论：	被ACM MM 2025接受
主题：	声音 (cs.SD) ; 音频与语音处理 (eess.AS)
引用方式：	arXiv:2507.07526 [cs.SD]
	(或者 arXiv:2507.07526v1 [cs.SD] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.07526

提交历史

来自： Sheng Zhang [查看电子邮件]
[v1] 星期四， 2025 年 7 月 10 日 08:15:03 UTC (858 KB)

计算机科学 > 声音

标题： DMF2Mel：一种动态多尺度融合网络用于脑电驱动的梅尔频谱重建

标题： DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 声音

标题： DMF2Mel：一种动态多尺度融合网络用于脑电驱动的梅尔频谱重建 显示英文标题

标题： DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： DMF2Mel：一种动态多尺度融合网络用于脑电驱动的梅尔频谱重建