Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

Zhang, Qi; Wang, Huamin; Shen, Hangchi; Duan, Shukai; Wen, Shiping; Huang, Tingwen

计算机科学 > 声音

arXiv:2501.00348 (cs)

[提交于 2024年12月31日 ]

标题：时间信息重构与脉冲神经网络中的非对齐残差用于语音分类

标题： Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

Authors:Qi Zhang, Huamin Wang, Hangchi Shen, Shukai Duan, Shiping Wen, Tingwen Huang

摘要：最近，可以注意到大多数基于脉冲神经网络（SNNs）的模型仅使用相同的时间分辨率来处理语音分类问题，这使得这些模型无法在不同时间尺度上学习输入数据的信息。此外，由于许多模型的子模块前后数据的时间长度不同，有效的残差连接无法用于优化这些模型的训练过程。为了解决这些问题，一方面，我们通过参考人类大脑理解语音的分层处理过程，重构音频频谱的时间维度，提出了一种名为时间重构（TR）的新方法。然后，具有TR的重构SNN模型可以在不同时间尺度上学习输入数据的信息，并且因为使网络能够在不同时间分辨率上学习输入数据的信息，从而从音频数据中建模更全面的语义信息。另一方面，我们通过分析音频数据提出了非对齐残差（NAR）方法，使得残差连接可以用于时间长度不同的两个音频数据。我们在Spiking Speech Commands（SSC）、Spiking Heidelberg Digits（SHD）和Google Speech Commands v0.02（GSC）数据集上进行了大量实验。根据实验结果，我们在所有SNN模型的测试分类准确率上在SSC数据集上达到了最先进的（SOTA）结果81.02%，并在所有模型的分类准确率上在SHD数据集上获得了SOTA结果96.04%。

摘要： Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these models.To solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02\% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04\% on SHD for the classification accuracy of all models.

评论：	9页，5图
主题：	声音 (cs.SD) ; 人工智能 (cs.AI); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2501.00348 [cs.SD]
	(或者 arXiv:2501.00348v1 [cs.SD] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00348

提交历史

来自： Huamin Wang [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 08:52:40 UTC (2,221 KB)

计算机科学 > 声音

标题：时间信息重构与脉冲神经网络中的非对齐残差用于语音分类

标题： Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 声音

标题： 时间信息重构与脉冲神经网络中的非对齐残差用于语音分类 显示英文标题

标题： Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：时间信息重构与脉冲神经网络中的非对齐残差用于语音分类