LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

Ye, Zongli; Lian, Jiachen; Gupta, Akshaj; Zhou, Xuanru; Li, Haodong; Patel, Krish; Park, Hwi Joo; Zhou, Dingkun; Guo, Chenxu; Li, Shuhe; Wang, Sam; Zhou, Iris; Cho, Cheol Jun; Ezzes, Zoe; Vonk, Jet M. J.; Morin, Brittany T.; Bogley, Rian; Wauters, Lisa; Miller, Zachary A.; Gorno-Tempini, Maria Luisa; Anumanchipalli, Gopala

电气工程与系统科学 > 音频与语音处理

arXiv:2508.03937 (eess)

[提交于 2025年8月5日 (v1) ，最后修订 2025年8月13日 (此版本， v2)]

标题： LCS-CTC：利用软对齐提高语音转录的鲁棒性

标题： LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

摘要：语音转录对于细粒度的语言分析和下游语音应用至关重要。虽然连接主义时间分类（CTC）由于其效率而被广泛用于此类任务，但在识别性能方面往往不足，尤其是在语音不清晰和不流畅的情况下。在本工作中，我们提出了LCS-CTC，这是一种用于音素级语音识别的两阶段框架，结合了感知相似性的局部对齐算法和约束的CTC训练目标。通过预测细粒度的帧-音素成本矩阵并应用修改的最长公共子序列（LCS）算法，我们的方法识别出高置信度的对齐区域，这些区域用于约束CTC解码路径空间，从而减少过拟合并提高泛化能力，这使得稳健的识别和无文本强制对齐成为可能。在LibriSpeech和PPA上的实验表明，LCS-CTC始终优于原始CTC基线，表明其在流畅和不流畅语音中统一音素建模的潜力。

摘要： Phonetic speech transcription is crucial for fine-grained linguistic analysis and downstream speech applications. While Connectionist Temporal Classification (CTC) is a widely used approach for such tasks due to its efficiency, it often falls short in recognition performance, especially under unclear and nonfluent speech. In this work, we propose LCS-CTC, a two-stage framework for phoneme-level speech recognition that combines a similarity-aware local alignment algorithm with a constrained CTC training objective. By predicting fine-grained frame-phoneme cost matrices and applying a modified Longest Common Subsequence (LCS) algorithm, our method identifies high-confidence alignment zones which are used to constrain the CTC decoding path space, thereby reducing overfitting and improving generalization ability, which enables both robust recognition and text-free forced alignment. Experiments on both LibriSpeech and PPA demonstrate that LCS-CTC consistently outperforms vanilla CTC baselines, suggesting its potential to unify phoneme modeling across fluent and non-fluent speech.

评论：	2025 ASRU 正确作者名单
主题：	音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.03937 [eess.AS]
	(或者 arXiv:2508.03937v2 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.03937

提交历史

来自： Jiachen Lian [查看电子邮件]
[v1] 星期二， 2025 年 8 月 5 日 21:59:35 UTC (933 KB)
[v2] 星期三， 2025 年 8 月 13 日 18:58:04 UTC (933 KB)

电气工程与系统科学 > 音频与语音处理

标题： LCS-CTC：利用软对齐提高语音转录的鲁棒性

标题： LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： LCS-CTC：利用软对齐提高语音转录的鲁棒性 显示英文标题

标题： LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： LCS-CTC：利用软对齐提高语音转录的鲁棒性