PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

Chernyak, Bronya Roni; Segal, Yael; Shrem, Yosi; Keshet, Joseph

电气工程与系统科学 > 音频与语音处理

arXiv:2508.03190 (eess)

[提交于 2025年8月5日 ]

标题： PatchDSU：关键词检测中分布外泛化的不确定性建模

标题： PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

Authors:Bronya Roni Chernyak, Yael Segal, Yosi Shrem, Joseph Keshet

摘要：深度学习模型在许多任务中表现出色，但依赖于训练数据和测试数据遵循相同分布的假设。这一假设在现实世界中的语音系统中通常不成立，由于环境变化、录音条件和说话人多样性，分布偏移很常见。域偏移不确定性（DSU）方法根据输入特征统计信息对每个神经网络层的输入进行增强。它通过假设特征统计遵循多变量高斯分布，并用该分布中采样的特征替换输入来解决域外泛化问题。虽然在计算机视觉中有效，但由于数据的性质，将DSU应用于语音存在挑战。与静态视觉数据不同，语音是一种时间信号，通常由频谱图表示——频率随时间的变化。这种表示不能被视为简单的图像，当应用于整个输入时，产生的稀疏性可能导致特征统计偏差。为了解决关键词检测中的分布外问题，我们提出了PatchDSU，它通过将输入分成块并独立增强每个块来扩展DSU。我们在Google语音命令、Librispeech和TED-LIUM数据集上评估了PatchDSU和DSU以及其他方法。此外，我们在白色高斯噪声和MUSAN音乐噪声条件下评估了性能。我们还通过分析模型在未训练的数据集上的表现来探索域外泛化。总体而言，在大多数情况下，PatchDSU和DSU都优于其他方法。值得注意的是，与其它方法相比，PatchDSU在评估场景中表现出更一致的改进。

摘要： Deep learning models excel at many tasks but rely on the assumption that training and test data follow the same distribution. This assumption often does not hold in real-world speech systems, where distribution shifts are common due to varying environments, recording conditions, and speaker diversity. The method of Domain Shifts with Uncertainty (DSU) augments the input of each neural network layer based on the input feature statistics. It addresses the problem of out-of-domain generalization by assuming feature statistics follow a multivariate Gaussian distribution and substitutes the input with sampled features from this distribution. While effective for computer vision, applying DSU to speech presents challenges due to the nature of the data. Unlike static visual data, speech is a temporal signal commonly represented by a spectrogram - the change of frequency over time. This representation cannot be treated as a simple image, and the resulting sparsity can lead to skewed feature statistics when applied to the entire input. To tackle out-of-distribution issues in keyword spotting, we propose PatchDSU, which extends DSU by splitting the input into patches and independently augmenting each patch. We evaluated PatchDSU and DSU alongside other methods on the Google Speech Commands, Librispeech, and TED-LIUM. Additionally, we evaluated performance under white Gaussian and MUSAN music noise conditions. We also explored out-of-domain generalization by analyzing model performance on datasets they were not trained on. Overall, in most cases, both PatchDSU and DSU outperform other methods. Notably, PatchDSU demonstrates more consistent improvements across the evaluated scenarios compared to other approaches.

评论：	此作品已提交给IEEE以可能发表
主题：	音频与语音处理 (eess.AS) ; 机器学习 (cs.LG)
引用方式：	arXiv:2508.03190 [eess.AS]
	(或者 arXiv:2508.03190v1 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.03190

提交历史

来自： Bronya Roni Chernyak [查看电子邮件]
[v1] 星期二， 2025 年 8 月 5 日 07:57:01 UTC (1,042 KB)

电气工程与系统科学 > 音频与语音处理

标题： PatchDSU：关键词检测中分布外泛化的不确定性建模

标题： PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： PatchDSU：关键词检测中分布外泛化的不确定性建模 显示英文标题

标题： PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： PatchDSU：关键词检测中分布外泛化的不确定性建模