PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

Fu, Li; Xin, Yu; Zeng, Sunlu; Fan, Lu; Wu, Youzheng; He, Xiaodong

Computer Science > Computation and Language

arXiv:2509.12647 (cs)

[Submitted on 16 Sep 2025 ]

Title: PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

Title: PAC：基于发音感知的上下文大语言模型自动语音识别

Authors:Li Fu, Yu Xin, Sunlu Zeng, Lu Fan, Youzheng Wu, Xiaodong He

Abstract: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model\'s ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.

Abstract: 本文提出了一种发音感知的上下文框架（PAC），以解决基于大型语言模型（LLM）的自动语音识别（ASR）系统中的两个关键挑战：有效的发音建模和鲁棒的同音词区分。这两点对于原始或长尾词识别至关重要。所提出的方法采用了一个两阶段的学习范式。首先，我们引入了一种发音引导的上下文学习方法。它采用了一种交错的字符-音素上下文建模策略，结合了仅字符的干扰项，鼓励模型利用音素线索进行准确识别。然后，我们提出了一种带有扰动标签采样的发音区分强化学习方法，以进一步增强模型区分上下文同音词的能力。在公开的英文Librispeech和中文AISHELL-1数据集上的实验结果表明，PAC：（1）与预训练的基于LLM的ASR模型相比，相对单词错误率（WER）分别降低了30.2%和53.8%，（2）与强基线相比，分别实现了长尾词有偏WER 31.8%和60.5%的相对降低。

Comments:	Submitted to ICASSP 2026
Subjects:	Computation and Language (cs.CL) ; Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2509.12647 [cs.CL]
	(or arXiv:2509.12647v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2509.12647

Submission history

From: Li Fu [view email]
[v1] Tue, 16 Sep 2025 04:07:28 UTC (140 KB)

Computer Science > Computation and Language

Title: PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

Title: PAC：基于发音感知的上下文大语言模型自动语音识别

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title: PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition Show Chinese title

Title: PAC：基于发音感知的上下文大语言模型自动语音识别

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition