DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

Lu, Ye-Xin; Gu, Yu; Wei, Kun; Du, Hui-Peng; Ai, Yang; Ling, Zhen-Hua

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2509.14684v1 (eess)

[Submitted on 18 Sep 2025 ]

Title: DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

Title: DAIEN-TTS：环境感知文本到语音合成的分离音频填充

Authors:Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling

Abstract: This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.

Abstract: 本文介绍了DAIEN-TTS，这是一种零样本文本到语音（TTS）框架，通过解耦音频补全实现环境感知的合成。通过利用独立的说话人和环境提示，DAIEN-TTS允许对合成语音的音色和背景环境进行独立控制。基于F5-TTS，提出的DAIEN-TTS首先结合了一个预训练的语音-环境分离（SES）模块，以将环境语音分解为干净语音和环境音频的梅尔频谱图。然后对两个梅尔频谱图应用不同长度的随机跨度掩码，这些掩码与文本嵌入一起作为补全被遮罩的环境梅尔频谱图的条件，从而实现个性化语音和随时间变化的环境音频的同时延续。为了进一步提高推理过程中的可控性，我们采用了双无类别引导（DCFG）用于语音和环境组件，并引入了信噪比（SNR）适应策略，以使合成语音与环境提示相匹配。实验结果表明，DAIEN-TTS生成的环境个性化语音具有高自然度、强说话人相似性和高环境保真度。

Comments:	Submitted to ICASSP 2026
Subjects:	Audio and Speech Processing (eess.AS) ; Sound (cs.SD)
Cite as:	arXiv:2509.14684 [eess.AS]
	(or arXiv:2509.14684v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2509.14684

Submission history

From: Ye-Xin Lu [view email]
[v1] Thu, 18 Sep 2025 07:23:53 UTC (6,643 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

Title: DAIEN-TTS：环境感知文本到语音合成的分离音频填充

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis Show Chinese title

Title: DAIEN-TTS：环境感知文本到语音合成的分离音频填充

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis