REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Jiang, Yuepeng; Ning, Ziqian; Wang, Shuai; Wang, Chengjia; Bi, Mengxiao; Zhu, Pengcheng; Fu, Zhonghua; Xie, Lei

电气工程与系统科学 > 音频与语音处理

arXiv:2508.04996v2 (eess)

[提交于 2025年8月7日 (v1) ，最后修订 2025年8月8日 (此版本， v2)]

标题： REF-VC：具有扩散变压器的鲁棒、表达和快速零样本语音转换

标题： REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Authors:Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, Lei Xie

摘要：在现实世界的语音转换应用中，源语音中的环境噪声和用户对富有表现力输出的需求构成了关键挑战。基于传统ASR的方法确保了噪声鲁棒性，但抑制了韵律的丰富性，而基于SSL的模型提高了表现力，但存在音色泄漏和噪声敏感的问题。本文提出了REF-VC，一种具有噪声鲁棒性的富有表现力的语音转换系统。主要创新包括：(1) 一种随机擦除策略，以减轻SSL特征中固有的信息冗余，提高噪声鲁棒性和表现力；(2) 受E2TTS启发的隐式对齐，以抑制非必要特征的重建；(3) 集成快捷模型以加速流匹配推理，显著减少到4步。实验结果表明，REF-VC在噪声集的零样本场景中优于Seed-VC等基线方法，同时在干净集上与Seed-VC的表现相当。此外，REF-VC可以在一个模型中兼容歌唱语音转换。

摘要： In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL features, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that REF-VC outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.

主题：	音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.04996 [eess.AS]
	(或者 arXiv:2508.04996v2 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.04996

提交历史

来自： Yuepeng Jiang [查看电子邮件]
[v1] 星期四， 2025 年 8 月 7 日 03:08:49 UTC (1,715 KB)
[v2] 星期五， 2025 年 8 月 8 日 01:59:26 UTC (1,715 KB)

电气工程与系统科学 > 音频与语音处理

标题： REF-VC：具有扩散变压器的鲁棒、表达和快速零样本语音转换

标题： REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： REF-VC：具有扩散变压器的鲁棒、表达和快速零样本语音转换 显示英文标题

标题： REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： REF-VC：具有扩散变压器的鲁棒、表达和快速零样本语音转换