Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > eess > arXiv:2507.06249

Help | Advanced Search

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2507.06249 (eess)
[Submitted on 4 Jul 2025 ]

Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Title: 基于音素的跨语言自动语音识别的发音词典自由训练通过联合随机逼近

Authors:Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou
Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5\% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.
Abstract: 最近,具有语音监督的预训练模型在跨语言语音识别中展示了其在数据效率和跨语言信息共享方面的优势。 然而,一个限制是需要发音词典来进行基于音素的跨语言语音识别。 在本研究中,我们的目标是消除对发音词典的需求,并提出一种基于潜在变量模型的方法,将音素视为离散的潜在变量。 新方法包括一个语音到音素(S2P)模型和一个音素到字素(P2G)模型,并引入了一个字素到音素(G2P)模型作为辅助推理模型。 为了联合训练这三个模型,我们利用了联合随机近似(JSA)算法,这是一种EM(期望最大化)算法的随机扩展,在估计离散潜在变量模型方面表现出优越的性能。 基于Whistle多语言预训练S2P模型,在波兰语(130小时)和印度尼西亚语(20小时)中进行了跨语言实验。 仅使用10分钟的音素监督,新方法JSA-SPG相比使用子词或完整音素监督的最佳跨语言微调方法实现了5%的错误率降低。 此外,发现通过G2P模型的辅助支持,在语言领域适应(即利用跨领域纯文本数据)中,JSA-SPG比标准的语言模型融合实践在错误率降低方面提高了9%。 为了促进可重复性和鼓励在此领域的进一步探索,我们开源了JSA-SPG的训练代码和完整流程。
Comments: submitted to IEEE TASLP
Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as: arXiv:2507.06249 [eess.AS]
  (or arXiv:2507.06249v1 [eess.AS] for this version)
  https://doi.org/10.48550/arXiv.2507.06249
arXiv-issued DOI via DataCite

Submission history

From: Saierdaer Yusuyin [view email]
[v1] Fri, 4 Jul 2025 12:23:22 UTC (1,326 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
license icon view license
Current browse context:
eess.AS
< prev   |   next >
new | recent | 2025-07
Change to browse by:
cs
cs.AI
cs.CL
eess

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号