Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > eess > arXiv:2509.12583

Help | Advanced Search

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2509.12583 (eess)
[Submitted on 16 Sep 2025 (v1) , last revised 24 Sep 2025 (this version, v2)]

Title: Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

Title: 带有情感感知多注册融合的鲁棒音视频目标说话人提取

Authors:Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Juan Liu, Ming Li
Abstract: Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.
Abstract: 目标说话人提取(TSE)是鸡尾酒会场景中的一个关键挑战。 虽然利用多种模态,如语音、嘴唇、面部和表情嵌入,可以提高性能,但现实应用中常常由于模态间断性丢失而受到影响。 本文对不同模态融合策略在不同模样的丢失程度下的交互作用和鲁棒性进行了全面研究。 我们建立在一个最先进的音视频语音增强系统之上,并集成了四种不同的说话人身份线索:用于同步上下文信息的嘴唇嵌入,通过交叉注意力提取的语音说话人嵌入以保证声学一致性,用于说话人身份的静态面部嵌入,以及一种新的动态表情嵌入以获取逐帧的情感特征。 我们在两种关键训练方案下系统地评估了这些模态的不同组合:零模态丢失和80%模态丢失。 大量实验表明,虽然在理想(零模态丢失)条件下,完整的多模态集成能够实现最佳性能,但在测试时出现模态丢失而训练时未遇到的情况下,其有效性会显著下降。 至关重要的是,我们证明了在高(80%)模态丢失率下进行训练可以显著增强模型的鲁棒性,使系统即使在测试时模态严重缺失的情况下也能保持优越的性能。 我们的研究结果表明,语音嵌入表现出一致的鲁棒性,而所提出的表情嵌入提供了有价值的补充信息。 这项工作强调了考虑现实世界不完美的训练策略的重要性,超越了纯粹的性能最大化,以在多模态语音增强系统中实现实际的可靠性。
Subjects: Audio and Speech Processing (eess.AS) ; Sound (cs.SD)
Cite as: arXiv:2509.12583 [eess.AS]
  (or arXiv:2509.12583v2 [eess.AS] for this version)
  https://doi.org/10.48550/arXiv.2509.12583
arXiv-issued DOI via DataCite

Submission history

From: Zhan Jin [view email]
[v1] Tue, 16 Sep 2025 02:21:38 UTC (141 KB)
[v2] Wed, 24 Sep 2025 09:08:49 UTC (142 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
view license
Current browse context:
eess.AS
< prev   |   next >
new | recent | 2025-09
Change to browse by:
cs
cs.SD
eess

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号