Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Flemotomos, Nikolaos; Hsiao, Roger; Swietojanski, Pawel; Hori, Takaaki; Can, Dogan; Zhuang, Xiaodan

电气工程与系统科学 > 音频与语音处理

arXiv:2411.00664v2 (eess)

[提交于 2024年11月1日 (v1) ，最后修订 2024年11月4日 (此版本， v2)]

标题：使用向量量化优化上下文语音识别以实现高效检索

标题： Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Authors:Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori, Dogan Can, Xiaodan Zhuang

摘要：神经上下文偏向允许语音识别模型利用与上下文相关的有用信息，从而提高转录准确性。然而，偏向机制通常基于音频和偏向条目目录之间的交叉注意力模块，这意味着计算复杂度可能会对偏向条目目录的规模以及由此带来的准确性提升产生严重的实际限制。本文提出了一种基于向量量化来近似交叉注意力评分的方法，并实现了对大型偏向目录的高效计算和内存使用。我们提议将此技术与基于检索的上下文偏向方法结合使用。首先，我们使用高效的量化检索模块，通过在音频上定位来筛选偏向条目。然后使用检索到的条目进行偏向处理。由于所提出的方案与偏向方法无关，我们研究了使用完整的交叉注意力、LLM提示以及两者的组合。结果显示，基于检索的筛选方法允许系统高效地利用数千个条目的偏向目录，在个人实体识别任务中相对错误率降低了71%。同时，与标准的点积交叉注意力相比，所提出的近似算法在多达一百万个条目列表的情况下，计算时间减少了20%，内存使用减少了85%-95%。

摘要： Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.

评论：	13页，7幅图，投稿至IEEE/ACM《音频、语音和语言处理汇刊》
主题：	音频与语音处理 (eess.AS) ; 计算与语言 (cs.CL)
引用方式：	arXiv:2411.00664 [eess.AS]
	(或者 arXiv:2411.00664v2 [eess.AS] 对于此版本)
	https://doi.org/10.48550/arXiv.2411.00664

提交历史

来自： Roger Hsiao [查看电子邮件]
[v1] 星期五， 2024 年 11 月 1 日 15:28:03 UTC (5,730 KB)
[v2] 星期一， 2024 年 11 月 4 日 17:05:58 UTC (5,731 KB)

电气工程与系统科学 > 音频与语音处理

标题：使用向量量化优化上下文语音识别以实现高效检索

标题： Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 音频与语音处理

标题： 使用向量量化优化上下文语音识别以实现高效检索 显示英文标题

标题： Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：使用向量量化优化上下文语音识别以实现高效检索