TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

Zheng, Haolong; Yegorova, Yekaterina; Hasegawa-Johnson, Mark

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2509.13395 (eess)

[Submitted on 16 Sep 2025 ]

Title: TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

Title: TICL：文本嵌入KNN用于语音上下文学习解锁大型多模态模型的语音识别能力

Authors:Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

Abstract: Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children's speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.

Abstract: 语音基础模型最近展示了执行语音上下文学习（SICL）的能力。选择有效的上下文示例对于SICL性能至关重要，但选择方法仍缺乏深入研究。在本工作中，我们提出了用于SICL的文本嵌入KNN（TICL），这是一种简单的流程，利用语义上下文来增强现成的大规模多模态模型的语音识别能力，而无需微调。在具有挑战性的自动语音识别任务中，包括带有口音的英语、多语言语音和儿童语音，我们的方法使模型能够超越零样本性能，相对WER减少高达84.7%。我们进行了消融研究以展示方法的鲁棒性和效率。

Subjects:	Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2509.13395 [eess.AS]
	(or arXiv:2509.13395v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2509.13395

Submission history

From: Haolong Zheng [view email]
[v1] Tue, 16 Sep 2025 17:07:23 UTC (233 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

Title: TICL：文本嵌入KNN用于语音上下文学习解锁大型多模态模型的语音识别能力

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models Show Chinese title

Title: TICL：文本嵌入KNN用于语音上下文学习 解锁大型多模态模型的语音识别能力

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

Title: TICL：文本嵌入KNN用于语音上下文学习解锁大型多模态模型的语音识别能力