CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Wang, Hankun; Guo, Yiwei; Shao, Chongtian; Li, Bohan; Chen, Xie; Yu, Kai

电气工程与系统科学 > 音频与语音处理

arXiv:2506.21074 (eess)

[提交于 2025年6月26日 ]

标题：编解码器滑液：通过动态帧率的神经语音编解码器时间冗余压缩

标题： CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Authors:Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Xie Chen, Kai Yu

摘要：神经语音编解码器已被广泛用于音频压缩和各种下游任务。当前主流的编解码器是固定帧率（FFR）的，它们为每个等长的片段分配相同数量的标记。然而，语音在时间信息密度上本质上是不均匀的。因此，许多标记被浪费在稳定状态段落上，比如长元音和静音。为了解决这种不匹配，我们提出了CodecSlime，这是一种通过在神经语音编解码器上支持动态帧率（DFR）首次实现压缩时间冗余的插件式方法。我们的方法是无监督的且与架构无关的，结合了两个关键创新，ScheDFR和Melt-and-Cool，分别用于适应推理和训练。当集成到典型的VQ-GAN编解码器主干中并在40 Hz DFR（$\approx$600 bps）下运行时，CodecSlime的重建WER相对于具有相同模型架构和类似比特率的传统FFR基线最多减少了46%，而其他指标也具有竞争力。 CodecSlime还能够在重建质量和比特率之间实现灵活的权衡：一个模型支持在多个帧率下进行推理，并且在相应的帧率下始终优于FFR模型。音频样本可在https://acadarmeria.github.io/codecslime/获取。

摘要： Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 46% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.

评论：	16页，5图，9表
主题：	音频与语音处理 (eess.AS) ; 声音 (cs.SD)
引用方式：	arXiv:2506.21074 [eess.AS]
	(或者 arXiv:2506.21074v1 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.21074

提交历史

来自： Hankun Wang [查看电子邮件]
[v1] 星期四， 2025 年 6 月 26 日 07:59:04 UTC (2,955 KB)

电气工程与系统科学 > 音频与语音处理

标题：编解码器滑液：通过动态帧率的神经语音编解码器时间冗余压缩

标题： CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： 编解码器滑液：通过动态帧率的神经语音编解码器时间冗余压缩 显示英文标题

标题： CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：编解码器滑液：通过动态帧率的神经语音编解码器时间冗余压缩