Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots

Lee, Gyeong-Tae

电气工程与系统科学 > 音频与语音处理

arXiv:2508.04333 (eess)

[提交于 2025年8月6日 ]

标题：基于HRTF定位线索的人形机器人双耳声音事件定位与检测神经网络

标题： Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots

Authors:Gyeong-Tae Lee

摘要：人形机器人需要同时估计声音事件类型和方向以实现情境感知，但传统的双通道输入在仰角估计和前后混淆方面存在困难。本文提出了一种双耳声音事件定位与检测（BiSELD）神经网络来解决这些挑战。 BiSELDnet从双耳输入特征中学习时间-频率模式和头部相关传递函数（HRTF）定位线索。引入了一种新颖的八通道双耳时间-频率特征（BTFF），包括左右梅尔频谱图、V图、双耳时间差（ITD）图（低于1.5 kHz）、双耳强度差（ILD）图（高于5 kHz且具有前后不对称性）以及频谱线索（SC）图（高于5 kHz用于仰角）。 BTFF的有效性在全向、水平和中位平面上得到了验证。 BiSELDnet，特别是基于高效Trinity模块的版本，被实现为每个声音事件类输出方向向量的时间序列，从而实现同时检测和定位。提出了向量激活图（VAM）可视化来分析网络学习，证实BiSELDnet专注于N1凹陷频率进行仰角估计。在城市背景噪声条件下的比较评估表明，所提出的BiSELD模型在双耳输入下显著优于最先进的（SOTA）SELD模型。

摘要： Humanoid robots require simultaneous sound event type and direction estimation for situational awareness, but conventional two-channel input struggles with elevation estimation and front-back confusion. This paper proposes a binaural sound event localization and detection (BiSELD) neural network to address these challenges. BiSELDnet learns time-frequency patterns and head-related transfer function (HRTF) localization cues from binaural input features. A novel eight-channel binaural time-frequency feature (BTFF) is introduced, comprising left/right mel-spectrograms, V-maps, an interaural time difference (ITD) map (below 1.5 kHz), an interaural level difference (ILD) map (above 5 kHz with front-back asymmetry), and spectral cue (SC) maps (above 5 kHz for elevation). The effectiveness of BTFF was confirmed across omnidirectional, horizontal, and median planes. BiSELDnets, particularly one based on the efficient Trinity module, were implemented to output time series of direction vectors for each sound event class, enabling simultaneous detection and localization. Vector activation map (VAM) visualization was proposed to analyze network learning, confirming BiSELDnet's focus on the N1 notch frequency for elevation estimation. Comparative evaluations under urban background noise conditions demonstrated that the proposed BiSELD model significantly outperforms state-of-the-art (SOTA) SELD models with binaural input.

评论：	200页
主题：	音频与语音处理 (eess.AS) ; 声音 (cs.SD)
引用方式：	arXiv:2508.04333 [eess.AS]
	(或者 arXiv:2508.04333v1 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.04333
期刊参考：	Ph.D. Dissertation, KAIST, 2024

提交历史

来自： Gyeong-Tae Lee Dr. [查看电子邮件]
[v1] 星期三， 2025 年 8 月 6 日 11:23:31 UTC (14,969 KB)

电气工程与系统科学 > 音频与语音处理

标题：基于HRTF定位线索的人形机器人双耳声音事件定位与检测神经网络

标题： Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： 基于HRTF定位线索的人形机器人双耳声音事件定位与检测神经网络 显示英文标题

标题： Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于HRTF定位线索的人形机器人双耳声音事件定位与检测神经网络