Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > eess > arXiv:2509.14684v1

Help | Advanced Search

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2509.14684v1 (eess)
[Submitted on 18 Sep 2025 ]

Title: DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

Title: DAIEN-TTS:环境感知文本到语音合成的分离音频填充

Authors:Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling
Abstract: This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
Abstract: 本文介绍了DAIEN-TTS,这是一种零样本文本到语音(TTS)框架,通过解耦音频补全实现环境感知的合成。 通过利用独立的说话人和环境提示,DAIEN-TTS允许对合成语音的音色和背景环境进行独立控制。 基于F5-TTS,提出的DAIEN-TTS首先结合了一个预训练的语音-环境分离(SES)模块,以将环境语音分解为干净语音和环境音频的梅尔频谱图。 然后对两个梅尔频谱图应用不同长度的随机跨度掩码,这些掩码与文本嵌入一起作为补全被遮罩的环境梅尔频谱图的条件,从而实现个性化语音和随时间变化的环境音频的同时延续。 为了进一步提高推理过程中的可控性,我们采用了双无类别引导(DCFG)用于语音和环境组件,并引入了信噪比(SNR)适应策略,以使合成语音与环境提示相匹配。 实验结果表明,DAIEN-TTS生成的环境个性化语音具有高自然度、强说话人相似性和高环境保真度。
Comments: Submitted to ICASSP 2026
Subjects: Audio and Speech Processing (eess.AS) ; Sound (cs.SD)
Cite as: arXiv:2509.14684 [eess.AS]
  (or arXiv:2509.14684v1 [eess.AS] for this version)
  https://doi.org/10.48550/arXiv.2509.14684
arXiv-issued DOI via DataCite

Submission history

From: Ye-Xin Lu [view email]
[v1] Thu, 18 Sep 2025 07:23:53 UTC (6,643 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
view license
Current browse context:
eess.AS
< prev   |   next >
new | recent | 2025-09
Change to browse by:
cs
cs.SD
eess

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号