Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

von Neumann, Thilo; Boeddeker, Christoph; Delcroix, Marc; Haeb-Umbach, Reinhold

doi:10.1109/TASLPRO.2025.3589862

电气工程与系统科学 > 音频与语音处理

arXiv:2508.02112 (eess)

[提交于 2025年8月4日 ]

标题：长文本多说话人语音识别的词错误率定义和算法

标题： Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

Authors:Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

摘要：评估语音识别器的主要指标，词错误率（WER）已以不同方式扩展，以处理由长篇多说话人语音识别器生成的转录文本。这些系统处理包含多个说话人和复杂说话模式的长转录文本，因此经典的WER无法应用。有基于说话人的方法，如连接最小排列WER（cpWER）和时间约束的cpWER（tcpWER），以及与说话人无关的方法，旨在忽略说话人混淆错误，如最优参考组合WER（ORC-WER）和MIMO-WER。这些WER评估不同的方面和错误类型（例如，时间错位）。尚未进行详细比较。因此，我们提出了现有WER的统一描述，并突出说明何时使用哪种指标。为了进一步分析由说话人混淆引起的错误数量，我们提出了无日志信息的cpWER（DI-cpWER）。它忽略说话人归因错误，其与cpWER的差异反映了说话人混淆对WER的影响。由于错误类型无法可靠地自动分类，我们讨论了可视化参考和假设转录文本之间序列对齐的方法，以帮助人工判断者发现错误。由于某些WER定义具有较高的计算复杂度，我们引入了一种贪心算法，以高精度（在我们的实验中偏差为$<0.1\%$）和多项式复杂度代替指数复杂度来近似ORC-WER和DI-cpWER。为了提高这些指标的合理性，我们还将来自tcpWER的时间约束引入ORC-WER和MIMO-WER，也显著降低了计算复杂度。

摘要： The predominant metric for evaluating speech recognizers, the Word Error Rate (WER) has been extended in different ways to handle transcripts produced by long-form multi-talker speech recognizers. These systems process long transcripts containing multiple speakers and complex speaking patterns so that the classical WER cannot be applied. There are speaker-attributed approaches that count speaker confusion errors, such as the concatenated minimum-permutation WER cpWER and the time-constrained cpWER (tcpWER), and speaker-agnostic approaches, which aim to ignore speaker confusion errors, such as the Optimal Reference Combination WER (ORC-WER) and the MIMO-WER. These WERs evaluate different aspects and error types (e.g., temporal misalignment). A detailed comparison has not been made. We therefore present a unified description of the existing WERs and highlight when to use which metric. To further analyze how many errors are caused by speaker confusion, we propose the Diarization-invariant cpWER (DI-cpWER). It ignores speaker attribution errors and its difference to cpWER reflects the impact of speaker confusions on the WER. Since error types cannot reliably be classified automatically, we discuss ways to visualize sequence alignments between the reference and hypothesis transcripts to facilitate the spotting of errors by a human judge. Since some WER definitions have high computational complexity, we introduce a greedy algorithm to approximate the ORC-WER and DI-cpWER with high precision ($<0.1\%$ deviation in our experiments) and polynomial complexity instead of exponential. To improve the plausibility of the metrics, we also incorporate the time constraint from the tcpWER into ORC-WER and MIMO-WER, also significantly reducing the computational complexity.

评论：	被接受用于IEEE音频、语音和语言处理汇刊（TASLP），第33卷
主题：	音频与语音处理 (eess.AS)
引用方式：	arXiv:2508.02112 [eess.AS]
	(或者 arXiv:2508.02112v1 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.02112
期刊参考：	IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3174-3188, 2025
相关 DOI:	https://doi.org/10.1109/TASLPRO.2025.3589862

提交历史

来自： Thilo von Neumann [查看电子邮件]
[v1] 星期一， 2025 年 8 月 4 日 06:42:48 UTC (85 KB)

电气工程与系统科学 > 音频与语音处理

标题：长文本多说话人语音识别的词错误率定义和算法

标题： Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： 长文本多说话人语音识别的词错误率定义和算法 显示英文标题

标题： Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：长文本多说话人语音识别的词错误率定义和算法