Long-Form Speech Generation with Spoken Language Models

Park, Se Jin; Salazar, Julian; Jansen, Aren; Kinoshita, Keisuke; Ro, Yong Man; Skerry-Ryan, RJ

计算机科学 > 计算与语言

arXiv:2412.18603v2 (cs)

[提交于 2024年12月24日 (v1) ，最后修订 2025年7月10日 (此版本， v2)]

标题：长格式语音生成与口语语言模型

标题： Long-Form Speech Generation with Spoken Language Models

Authors:Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

摘要：我们考虑了多分钟语音的生成建模，这是长格式多媒体生成和音频原生语音助手的要求。然而，无文本的口语语言模型在超过几十秒后难以生成合理的语音，这是由于语音标记的高时间分辨率导致连贯性丢失，长序列训练或外推的架构问题，以及推理时的内存成本。基于这些考虑，我们得出了SpeechSSM，这是第一个从和采样长格式口语音频（例如16分钟的朗读或即兴演讲）的语音语言模型家族，在单个解码会话中不使用文本中间步骤。 SpeechSSMs利用线性时间序列建模的最新进展，在多分钟生成中大大超越当前Transformer口语LM的连贯性和效率，同时在话语级别仍与它们保持一致。由于我们发现当前的口语语言评估缺乏信息，尤其是在这种新的长格式设置中，我们还引入了：LibriSpeech-Long，一个用于长格式语音评估的基准；新的基于嵌入的和LLM判断的指标；以及对长度和时间的质量测量。语音样本、LibriSpeech-Long数据集以及任何未来的代码或模型发布都可以在https://google.github.io/tacotron/publications/speechssm/找到。

摘要： We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.

评论：	被ICML 2025接收（口头报告）
主题：	计算与语言 (cs.CL) ; 声音 (cs.SD); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2412.18603 [cs.CL]
	(或者 arXiv:2412.18603v2 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2412.18603

提交历史

来自： Julian Salazar [查看电子邮件]
[v1] 星期二， 2024 年 12 月 24 日 18:56:46 UTC (298 KB)
[v2] 星期四， 2025 年 7 月 10 日 17:52:43 UTC (366 KB)

计算机科学 > 计算与语言

标题：长格式语音生成与口语语言模型

标题： Long-Form Speech Generation with Spoken Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 长格式语音生成与口语语言模型 显示英文标题

标题： Long-Form Speech Generation with Spoken Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：长格式语音生成与口语语言模型