Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Yusuyin, Saierdaer; Ma, Te; Huang, Hao; Ou, Zhijian

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2507.06249 (eess)

[Submitted on 4 Jul 2025 ]

Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Title: 基于音素的跨语言自动语音识别的发音词典自由训练通过联合随机逼近

Authors:Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou

Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5\% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.

Abstract: 最近，具有语音监督的预训练模型在跨语言语音识别中展示了其在数据效率和跨语言信息共享方面的优势。然而，一个限制是需要发音词典来进行基于音素的跨语言语音识别。在本研究中，我们的目标是消除对发音词典的需求，并提出一种基于潜在变量模型的方法，将音素视为离散的潜在变量。新方法包括一个语音到音素（S2P）模型和一个音素到字素（P2G）模型，并引入了一个字素到音素（G2P）模型作为辅助推理模型。为了联合训练这三个模型，我们利用了联合随机近似（JSA）算法，这是一种EM（期望最大化）算法的随机扩展，在估计离散潜在变量模型方面表现出优越的性能。基于Whistle多语言预训练S2P模型，在波兰语（130小时）和印度尼西亚语（20小时）中进行了跨语言实验。仅使用10分钟的音素监督，新方法JSA-SPG相比使用子词或完整音素监督的最佳跨语言微调方法实现了5%的错误率降低。此外，发现通过G2P模型的辅助支持，在语言领域适应（即利用跨领域纯文本数据）中，JSA-SPG比标准的语言模型融合实践在错误率降低方面提高了9%。为了促进可重复性和鼓励在此领域的进一步探索，我们开源了JSA-SPG的训练代码和完整流程。

Comments:	submitted to IEEE TASLP
Subjects:	Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2507.06249 [eess.AS]
	(or arXiv:2507.06249v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2507.06249

Submission history

From: Saierdaer Yusuyin [view email]
[v1] Fri, 4 Jul 2025 12:23:22 UTC (1,326 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Title: 基于音素的跨语言自动语音识别的发音词典自由训练通过联合随机逼近

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation Show Chinese title

Title: 基于音素的跨语言自动语音识别的发音词典自由训练通过联合随机逼近

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation