WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration

Santoso, Kevin Putra; Sholikah, Rizka Wakhidatus; Ginardi, Raden Venantius Hari

计算机科学 > 声音

arXiv:2508.21153 (cs)

[提交于 2025年8月28日 ]

标题： WaveLLDM：一种轻量级潜在扩散模型的设计与开发用于语音增强与恢复

标题： WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration

Authors:Kevin Putra Santoso, Rizka Wakhidatus Sholikah, Raden Venantius Hari Ginardi

摘要：高质量音频在在线通信、虚拟助手和多媒体行业等广泛的应用中至关重要。然而，由噪声、压缩和传输伪影引起的退化仍然是一个主要挑战。虽然扩散模型在音频修复方面已被证明是有效的，但它们通常需要大量的计算资源，并且难以处理更长的缺失段。本研究介绍了WaveLLDM（Wave Lightweight Latent Diffusion Model），这是一种将高效的神经音频编解码器与潜在扩散相结合的架构，用于音频修复和去噪。与在时间域或频谱域操作的传统方法不同，WaveLLDM在压缩的潜在空间中处理音频，从而降低计算复杂度，同时保持重建质量。在Voicebank+DEMAND测试集上的实证评估表明，WaveLLDM实现了准确的频谱重建，具有较低的对数频谱距离（LSD）分数（0.48至0.60），并且对未见过的数据具有良好的适应性。然而，与最先进的方法相比，它在感知质量和语音清晰度方面仍表现不佳，WB-PESQ分数范围为1.62至1.71，STOI分数在0.76至0.78之间。这些限制归因于架构调优不足、缺乏微调以及训练时间不足。尽管如此，结合神经音频编解码器和潜在扩散模型的灵活架构为未来的发展提供了坚实的基础。

摘要： High-quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log-Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state-of-the-art methods in terms of perceptual quality and speech clarity, with WB-PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine-tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.

主题：	声音 (cs.SD) ; 人工智能 (cs.AI); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.21153 [cs.SD]
	(或者 arXiv:2508.21153v1 [cs.SD] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.21153

提交历史

来自： Kevin Putra Santoso [查看电子邮件]
[v1] 星期四， 2025 年 8 月 28 日 18:38:42 UTC (1,541 KB)

计算机科学 > 声音

标题： WaveLLDM：一种轻量级潜在扩散模型的设计与开发用于语音增强与恢复

标题： WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 声音

标题： WaveLLDM：一种轻量级潜在扩散模型的设计与开发用于语音增强与恢复 显示英文标题

标题： WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： WaveLLDM：一种轻量级潜在扩散模型的设计与开发用于语音增强与恢复