Audio and Speech Processing

New submissions
Cross-lists
Replacements

See recent articles

Showing new listings for Friday, 26 September 2025

Total of 27 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2509.20396 [cn-pdf, pdf, html, other]: Title: Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling

Title: 基于不确定性语音音素难度评分的非规范语音数据高效自动语音识别个性化方法

Niclas Pokel, Pehuén Moure, Roman Boehringer, Yingqiang Gao

Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI) ; Sound (cs.SD)

Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.

自动语音识别（ASR）系统在处理因脑性瘫痪或结构异常等状况导致的非典型语音时存在困难。高声学变异性以及训练数据的稀缺严重降低了模型性能。这项工作引入了一种数据高效的个性化方法，该方法量化音素级别的不确定性以指导微调。我们利用蒙特卡洛Dropout来估计模型最难以处理的音素，并使用这些估计值进行有针对性的过采样策略。我们在英语和德语数据集上验证了我们的方法。至关重要的是，我们证明了模型推导出的不确定性与专家临床语言治疗报告中识别出的困难音素有很强的相关性，据我们所知，这是首次成功将模型不确定性与语音难度的专家评估对齐的工作。我们的结果表明，这种经过临床验证的、由不确定性引导的采样显著提高了ASR准确性，提供了一个实用的个性化和包容性的ASR框架。
[2] arXiv:2509.20397 [cn-pdf, pdf, html, other]: Title: Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

Title: 变分低秩适应用于个性化受损语音识别

Niclas Pokel, Pehuén Moure, Roman Boehringer, Shih-Chii Liu, Yingqiang Gao

Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI)

Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite recent advancements, state-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability. Moreover, collecting and annotating non-normative speech is burdensome: speaking is effortful for many affected individuals, while laborious annotation often requires caregivers familiar with the speaker. This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. We validate our method on the English UA-Speech dataset and a newly collected German speech dataset, BF-Sprache, from a child with structural speech impairment. The dataset and approach are designed to reflect the challenges of low-resource settings that include individuals with speech impairments. Our method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency, offering a practical path toward inclusive ASR.

由于先天性疾病（如脑瘫、唐氏综合征或阿佩尔综合征）以及中风、创伤事故或肿瘤导致的获得性脑损伤引起的言语障碍，对自动语音识别（ASR）系统构成了重大挑战。尽管最近取得了进展，但像Whisper这样的最先进ASR模型仍然难以处理非正常语音，这是由于训练数据可用性有限和高声学变异性所致。此外，收集和注释非正常语音是一项繁重的任务：对许多受影响的个体来说，说话是费力的，而繁琐的注释通常需要熟悉说话者的护理人员。这项工作介绍了一种基于贝叶斯低秩适应的新型ASR个性化方法，用于数据高效的微调。我们在英语UA-Speech数据集和一个新收集的德语语音数据集BF-Sprache上验证了我们的方法，该数据集来自一名有结构性言语障碍的儿童。该数据集和方法设计旨在反映包括言语障碍个体在内的低资源环境所面临的挑战。我们的方法在提高受损语音的ASR准确性的同时，保持了数据和注释的效率，为实现包容性ASR提供了一条实用的路径。
[3] arXiv:2509.20410 [cn-pdf, pdf, html, other]: Title: Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

Title: Phoenix-VAD：全双工语音交互的流式语义端点检测

Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong

Subjects: Audio and Speech Processing (eess.AS) ; Sound (cs.SD)

Spoken dialogue models have significantly advanced intelligent human\textendash computer interaction, yet they lack a plug\textendash and\textendash play full\textendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix\textendashVAD, an LLM\textendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix\textendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix\textendash VAD achieves excellent and competitive performance. Furthermore, this design enables the full\textendash duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next\textendash generation human\textendash computer interaction.

Spoken dialogue models have significantly advanced intelligent human\textendash computer interaction, yet they lack a plug\textendash and\textendash play full\textendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix\textendashVAD , an LLM\textendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix\textendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. 实验在语义完整和不完整的语音场景中表明，Phoenix\textendash VAD 获得了卓越且具有竞争力的性能。此外，这种设计使完整的\textendash 全双工预测模块能够独立于对话模型进行优化，为下一\textendash 代人\textendash 机交互提供了更可靠和灵活的支持。
[4] arXiv:2509.20485 [cn-pdf, pdf, html, other]: Title: Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Title: 通过离散标记的条件预测对语音合成中的韵律和可懂度进行客观评估

Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

Comments: Under review for IEEE OJSP

Subjects: Audio and Speech Processing (eess.AS) ; Machine Learning (cs.LG) ; Sound (cs.SD)

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

目标评估合成语音对于推进语音生成系统至关重要，但现有的可理解性和韵律度量仍然在范围上有限，并且与人类感知的相关性较弱。词错误率 (WER) 仅提供一种粗略的基于文本的可理解性度量，而 F0-RMSE 和相关基于音高的度量则提供了一种狭窄的、依赖参考的韵律视图。为解决这些限制，我们提出了 TTScore，这是一种基于离散语音标记条件预测的目标化且无参考的评估框架。 TTScore 使用两个以输入文本为条件的序列到序列预测器：TTScore-int，通过内容标记测量可理解性，TTScore-pro，通过韵律标记评估韵律。对于每个合成话语，预测器计算相应标记序列的可能性，产生可解释的分数，以捕捉与预期语言内容和韵律结构的对齐情况。在 SOMOS、VoiceMOS 和 TTSArena 基准上的实验表明， TTScore-int 和 TTScore-pro 提供了可靠、特定方面的评估，并且与整体质量的人类判断具有更强的相关性，优于现有的关注可理解性和韵律的度量。
[5] arXiv:2509.20741 [cn-pdf, pdf, html, other]: Title: Real-Time System for Audio-Visual Target Speech Enhancement

Title: 用于音视频目标语音增强的实时系统

T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

Comments: Accepted into WASPAA 2025 demo session

Subjects: Audio and Speech Processing (eess.AS) ; Emerging Technologies (cs.ET) ; Machine Learning (cs.LG)

We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the presence of interfering speakers. However, to our knowledge, no prior work has demonstrated an interactive system for real-time audio-visual speech enhancement operating on CPU hardware. RAVEN fills this gap by using pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information. The system generalizes across environmental noise, interfering speakers, transient sounds, and even singing voices. In this demonstration, attendees will be able to experience live audio-visual target speech enhancement using a microphone and webcam setup, with clean speech playback through headphones.

我们展示了一个RAVEN的实时演示，这是一个设计为完全在CPU上运行的实时音视频语音增强系统。在单通道、仅音频的设置中，语音增强传统上被当作从环境噪声中提取干净语音的任务。最近的工作探索了使用视觉提示（如嘴唇运动）来提高鲁棒性，特别是在有干扰说话者的情况下。然而，据我们所知，之前没有工作展示了在CPU硬件上运行的实时音视频语音增强交互系统。 RAVEN通过使用来自音视频语音识别模型的预训练视觉嵌入来编码嘴唇运动信息，填补了这一空白。该系统能够跨环境噪声、干扰说话者、瞬态声音甚至歌唱声音进行泛化。在本次演示中，与会者将能够通过麦克风和网络摄像头设置体验实时音视频目标语音增强，并通过耳机播放干净语音。
[6] arXiv:2509.20802 [cn-pdf, pdf, html, other]: Title: SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Title: SPADE：高效LLM-TTS的结构化剪枝与自适应蒸馏

Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung

Comments: Submitted to ICASSP 2026

Subjects: Audio and Speech Processing (eess.AS) ; Sound (cs.SD)

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

本文的目标是介绍SPADE，一种用于高效大型语言模型文本转语音（LLM-TTS）的结构化剪枝和自适应蒸馏框架。最近的LLM-TTS系统实现了强大的可控性和零样本泛化能力，但其庞大的参数量和高延迟限制了实际部署。SPADE通过结合（i）基于词错误率的层重要性指标引导的剪枝步骤以去除非必要的Transformer层，以及（ii）多级知识蒸馏以恢复自回归一致性来解决这一问题。在零样本基准测试中，SPADE在将Transformer深度减半的同时保持接近一致的感知质量，VRAM使用量最多减少20%，并且使用不到原始训练数据5%的数据实现高达1.7倍的实时因子。这些结果表明，紧凑的LLM-TTS模型可以在保持自然度和说话人相似性的同时实现实用的实时语音生成。音频样本可在https://mm.kaist.ac.kr/projects/SPADE/ 获取。
[7] arXiv:2509.20875 [cn-pdf, pdf, html, other]: Title: PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables

Title: PAS-SE：可穿戴设备中语音拾取的个性化辅助传感器语音增强

Mattes Ohlenbusch, Mikolaj Kegler, Marko Stamenovic

Comments: Submitted to ICASSP 2026

Subjects: Audio and Speech Processing (eess.AS)

Speech enhancement for voice pickup in hearables aims to improve the user's voice by suppressing noise and interfering talkers, while maintaining own-voice quality. For single-channel methods, it is particularly challenging to distinguish the target from interfering talkers without additional context. In this paper, we compare two strategies to resolve this ambiguity: personalized speech enhancement (PSE), which uses enrollment utterances to represent the target, and auxiliary-sensor speech enhancement (AS-SE), which uses in-ear microphones as additional input. We evaluate the strategies on two public datasets, employing different auxiliary sensor arrays, to investigate their cross-dataset generalization. We propose training-time augmentations to facilitate cross-dataset generalization of AS-SE systems. We also show that combining PSE and AS-SE (PAS-SE) provides complementary performance benefits, especially when enrollment speech is recorded with the in-ear microphone. We further demonstrate that PAS-SE personalized with noisy in-ear enrollments maintains performance benefits over the AS-SE system.

语音增强用于可穿戴设备的语音拾取，旨在通过抑制噪声和干扰说话者来改善用户的语音，同时保持自身语音质量。对于单通道方法来说，在没有额外上下文的情况下区分目标语音和干扰说话者尤其具有挑战性。在本文中，我们比较了两种解决这种歧义的策略：个性化语音增强（PSE），它使用注册语音来表示目标语音；以及辅助传感器语音增强（AS-SE），它使用耳内麦克风作为额外输入。我们在两个公共数据集上评估这些策略，并采用不同的辅助传感器阵列，以研究它们的跨数据集泛化能力。我们提出了训练时的增强方法，以促进AS-SE系统的跨数据集泛化。我们还表明，结合PSE和AS-SE（PAS-SE）可以提供互补的性能优势，尤其是在注册语音是通过耳内麦克风记录的情况下。我们进一步证明，使用带有噪声耳内注册的PAS-SE在性能优势上超过了AS-SE系统。
[8] arXiv:2509.21003 [cn-pdf, pdf, html, other]: Title: TF-Restormer: Complex Spectral Prediction for Speech Restoration

Title: TF-Restormer：语音恢复的复数频谱预测

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong, Hyuing-Min Park

Comments: Preprint. Under review

Subjects: Audio and Speech Processing (eess.AS)

Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations. We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling. To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details. As a single model across sampling rates, TF-Restormer consistently outperforms prior systems, achieving balanced gains in signal fidelity and perceptual quality, while its streaming mode maintains competitive effectiveness for real-time application. Code and demos are available at https://tf-restormer.github.io/demo.

在现实条件下进行语音恢复具有挑战性，因为存在叠加的失真，如削波、带通滤波、数字伪影、噪声和混响以及低采样率。现有的系统，包括基于声码器的方法，通常会牺牲信号保真度，而扩散模型在流媒体中仍不实用。此外，大多数系统假设固定的目标采样率，需要外部重采样，这会导致冗余计算。我们提出了TF-Restormer，这是一种编码器-解码器架构，它通过时间-频率双路径编码器专注于输入带宽，并通过带有频率扩展查询的轻量解码器重建缺失的高频带。它能够在任意输入输出速率下实现高效且通用的恢复，而无需冗余的重采样。为了支持跨不同速率的对抗训练，我们引入了一个共享的与采样频率无关（SFI）的STFT判别器。 TF-Restormer还通过因果时间模块支持流媒体，并通过将频谱归纳偏差注入频率模块来提高在极端退化情况下的鲁棒性。最后，我们提出了一种缩放对数频谱损失，在严重条件下稳定优化的同时强调预测良好的频谱细节。作为一个跨采样率的单一模型，TF-Restormer始终优于先前系统，在信号保真度和感知质量方面实现了平衡增益，同时其流媒体模式在实时应用中保持了竞争力。代码和演示可在 https://tf-restormer.github.io/demo 获取。
[9] arXiv:2509.21060 [cn-pdf, pdf, html, other]: Title: Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Title: 测量音频对正确性的影响：音频贡献感知的大规模音频语言模型训练后调整

Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Weilong Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong

Subjects: Audio and Speech Processing (eess.AS)

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance across these benchmarks.

大型音频语言模型（LALMs）是多模态AI的重要前沿，用于解决各种音频任务。最近，由于在基础模型上的显著性能提升，LALMs的后训练受到了越来越多的关注。虽然单阶段后训练如强化学习（RL）已显示出有希望的结果，但多阶段方法如监督微调（SFT）随后接RL仍不够理想。在多个训练阶段分配数据以最大化LALM能力尚未得到充分探索，而且针对此类研究的大规模高质量数据集也仍然缺乏。为了解决这些问题，我们首先介绍了AudioMCQ，这是一个包含571k个样本的全面音频多选题数据集，具有两种类型的思维链注释。其次，我们研究了LALMs中普遍存在的零音频贡献现象，其中模型仅从文本信息中得出正确答案而无需处理音频内容。我们提出了音频贡献过滤，以将数据分为弱音频贡献和强音频贡献子集。基于这些见解，我们开发了两种有效的后训练范式：弱到强（在弱音频贡献数据上进行SFT，然后在强音频贡献数据上进行RL）和混合到强（在混合音频贡献数据上进行SFT，然后在强音频贡献数据上进行RL）。通过使用AudioMCQ，我们在DCASE 2025音频问答挑战中获得第一名。此外，利用我们的数据集和不同的训练策略，我们在MMAU-test-mini上达到78.2%，在MMAU上达到75.6%，在MMAR上达到67.1%，在MMSU上达到70.7%，在这些基准测试中建立了新的最先进性能。
[10] arXiv:2509.21087 [cn-pdf, pdf, html, other]: Title: Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Title: 现代语音增强系统是否容易受到对抗攻击？

Rostislav Makarov, Lea Schönherr, Timo Gerkmann

Comments: Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Subjects: Audio and Speech Processing (eess.AS) ; Machine Learning (cs.LG) ; Sound (cs.SD)

Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

机器学习方法在语音增强中的应用变得越来越具有表现力，使得对输入信号的修改能力越来越强大。在本文中，我们证明这种表现力引入了一种漏洞：先进的语音增强模型可能容易受到对抗攻击。具体来说，我们展示了通过原始输入进行精心设计和心理声学掩码的对抗噪声可以被注入，从而使增强后的语音输出传达完全不同的语义含义。我们通过实验验证了现代预测性语音增强模型确实可以以这种方式被操控。此外，我们指出，具有随机采样器的扩散模型由于设计原因，对这种对抗攻击具有内在的鲁棒性。
[11] arXiv:2509.21185 [cn-pdf, pdf, html, other]: Title: Hybrid Real- And Complex-Valued Neural Network Concept For Low-Complexity Phase-Aware Speech Enhancement

Title: 基于低复杂度相位感知的混合实值和复值神经网络概念

Luan Vinícius Fiorio, Alex Young, Ronald M. Aarts

Subjects: Audio and Speech Processing (eess.AS)

In this paper, we propose hybrid real- and complex-valued neural networks for speech enhancement. Real- or complex-valued models are either inefficient or present high complexity. We devise a straightforward design method for extending a real-valued network into its hybrid counterpart. Based on speech intelligibility and quality metrics, we compare the real, complex, and hybrid versions of a convolutional and a convolutional-recurrent architecture. The hybrid network consistently outperforms its counterparts with the same number of parameters. Additionally, the hybrid models' complexity in terms of multiply-accumulate operations is substantially lower than that of their counterparts.

在本文中，我们提出了用于语音增强的混合实值和复值神经网络。实值或复值模型要么效率低下，要么复杂度较高。我们设计了一种简单的设计方法，将实值网络扩展为其混合版本。基于语音可懂度和质量指标，我们比较了卷积和卷积-递归架构的实值、复值和混合版本。混合网络在参数数量相同的情况下始终优于其对应版本。此外，混合模型在乘加操作方面的复杂度明显低于其对应版本。
[12] arXiv:2509.21214 [cn-pdf, pdf, html, other]: Title: MeanSE: Efficient Generative Speech Enhancement with Mean Flows

Title: MeanSE：使用均值流的高效生成语音增强

Jiahe Wang, Hongyu Wang, Wei Wang, Lei Yang, Chenda Li, Wangyou Zhang, Lufen Tan, Yanmin Qian

Comments: Submitted to ICASSP 2026

Subjects: Audio and Speech Processing (eess.AS)

Speech enhancement (SE) improves degraded speech's quality, with generative models like flow matching gaining attention for their outstanding perceptual quality. However, the flow-based model requires multiple numbers of function evaluations (NFEs) to achieve stable and satisfactory performance, leading to high computational load and poor 1-NFE performance. In this paper, we propose MeanSE, an efficient generative speech enhancement model using mean flows, which models the average velocity field to achieve high-quality 1-NFE enhancement. Experimental results demonstrate that our proposed MeanSE significantly outperforms the flow matching baseline with a single NFE, exhibiting extremely better out-of-domain generalization capabilities.

语音增强（SE）提高了退化语音的质量，生成模型如流匹配因其出色的感知质量而受到关注。然而，基于流的模型需要多次函数评估（NFEs）才能实现稳定和令人满意的性能，导致计算负载高且1-NFE性能较差。在本文中，我们提出了MeanSE，一种使用均值流的高效生成语音增强模型，通过建模平均速度场来实现高质量的1-NFE增强。实验结果表明，我们提出的MeanSE在单个NFE下显著优于流匹配基线，表现出极强的域外泛化能力。

[13] arXiv:2509.20405 (cross-list from cs.CR) [cn-pdf, pdf, html, other]: Title: Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Title: 为什么语音深度伪造检测器无法泛化：开放世界中检测的局限性

Visar Berisha, Prad Kadambi, Isabella Lenz

Subjects: Cryptography and Security (cs.CR) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

Speech deepfake detectors are often evaluated on clean, benchmark-style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt" for AI-based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst-case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross-testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high-stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.

语音深度伪造检测器通常在干净的、基准风格的条件下进行评估，但部署时却发生在不断变化的设备、采样率、编解码器、环境和攻击家族的开放世界中。这为基于人工智能的检测器造成了“覆盖债务”：每一个新的条件都会与现有条件相乘，产生数据盲点，这些盲点的增长速度超过了数据收集的速度。由于攻击者可以针对这些未覆盖区域，最坏情况下的性能（而不是平均基准分数）决定了安全性。为了展示覆盖债务问题的影响，我们分析了最近一个跨测试框架的结果。按真实域和伪造发布年份分组性能，出现了两个模式：较新的合成器会消除检测器依赖的旧有特征，并且对话语音域（电话会议、采访、社交媒体）始终是最难保护的。这些发现表明，不应仅依靠检测来做出高风险决策。检测器应被视为层防御中的辅助信号，该防御包括来源、身份凭证和政策保障。
[14] arXiv:2509.20655 (cross-list from cs.CL) [cn-pdf, pdf, html, other]: Title: Building Tailored Speech Recognizers for Japanese Speaking Assessment

Title: 为日语口语评估构建定制的语音识别器

Yotaro Kubo, Richard Sproat, Chihiro Taguchi, Llion Jones

Subjects: Computation and Language (cs.CL) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

This paper presents methods for building speech recognizers tailored for Japanese speaking assessment tasks. Specifically, we build a speech recognizer that outputs phonemic labels with accent markers. Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions that include accent marks. We propose two methods to mitigate data sparsity. First, a multitask training scheme introduces auxiliary loss functions to estimate orthographic text labels and pitch patterns of the input signal, so that utterances with only orthographic annotations can be leveraged in training. The second fuses two estimators, one over phonetic alphabet strings, and the other over text token sequences. To combine these estimates we develop an algorithm based on the finite-state transducer framework. Our results indicate that the use of multitask learning and fusion is effective for building an accurate phonemic recognizer. We show that this approach is advantageous compared to the use of generic multilingual recognizers. The relative advantages of the proposed methods were also compared. Our proposed methods reduced the average of mora-label error rates from 12.3% to 7.1% over the CSJ core evaluation sets.

本文提出了针对日语口语评估任务构建语音识别器的方法。具体而言，我们构建了一个输出带有声调标记的音素标签的语音识别器。尽管日语资源丰富，但用于训练生成包含声调标记的准确音素转写的数据显示量很小。我们提出了两种方法来缓解数据稀疏性问题。首先，多任务训练方案引入了辅助损失函数来估计输入信号的正字法文本标签和音高模式，从而使仅具有正字法注释的发言可以在训练中得到利用。第二种方法融合了两个估计器，一个基于音素字母表字符串，另一个基于文本标记序列。为了结合这些估计，我们开发了一种基于有限状态转换器框架的算法。我们的结果表明，使用多任务学习和融合对于构建准确的音素识别器是有效的。我们展示了这种方法相比使用通用多语言识别器的优势。还比较了所提出方法的相对优势。我们提出的方法在CSJ核心评估集上将音节标签错误率的平均值从12.3%降低到了7.1%。
[15] arXiv:2509.20706 (cross-list from cs.CL) [cn-pdf, pdf, html, other]: Title: MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Title: MI-Fuse：使用闭源大音频语言模型的无监督域适应标签融合

Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee

Comments: 5 pages, 2 figures, 2 tables

Subjects: Computation and Language (cs.CL) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

大型音频语言模型（LALMs）在语音任务上表现出强大的零样本能力，这表明其在语音情感识别（SER）中有前景。然而，在实际部署中，当源数据不可用且强大的LALMs只能通过API访问时，SER常常会失败。我们提出问题：仅给定未标记的目标领域音频和一个仅通过API访问的LALM，能否适应一个学生模型以在目标领域超越LALM？为此，我们提出了MI-Fuse，这是一种去噪标签融合框架，通过一个源领域训练的SER分类器作为辅助教师来补充LALM。该框架从两个教师中获取多个随机预测，通过基于互信息的不确定性加权它们的均值分布，并通过指数移动平均教师稳定训练。在三个公共情感数据集和六种跨领域迁移中的实验显示了一致的提升，学生模型超过了LALM，并比最强基线高出3.9%。这种方法在不共享源数据的情况下增强了情感感知的语音系统，实现了现实的适应。
[16] arXiv:2509.20969 (cross-list from cs.SD) [cn-pdf, pdf, html, other]: Title: SingVERSE: A Diverse, Real-World Benchmark for Singing Voice Enhancement

Title: SingVERSE：一种多样化的现实世界歌唱语音增强基准

Shaohan Jiang, Junan Zhang, Yunjia Zhang, Jing Yang, Fan Fan, Zhizheng Wu

Comments: Demopage: https://singverse.github.io, Dataset: https://huggingface.co/datasets/amphion/SingVERSE

Subjects: Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

This paper presents a benchmark for singing voice enhancement. The development of singing voice enhancement is limited by the lack of realistic evaluation data. To address this gap, this paper introduces SingVERSE, the first real-world benchmark for singing voice enhancement, covering diverse acoustic scenarios and providing paired, studio-quality clean references. Leveraging SingVERSE, we conduct a comprehensive evaluation of state-of-the-art models and uncover a consistent trade-off between perceptual quality and intelligibility. Finally, we show that training on in-domain singing data substantially improves enhancement performance without degrading speech capabilities, establishing a simple yet effective path forward. This work offers the community a foundational benchmark together with critical insights to guide future advances in this underexplored domain. Demopage: https://singverse.github.io

本文提出了一种用于歌唱语音增强的基准测试。歌唱语音增强的发展受到缺乏真实评估数据的限制。为了解决这一差距，本文介绍了SingVERSE，这是首个现实世界的歌唱语音增强基准测试，涵盖了多种声学场景，并提供了成对的、录音室质量的干净参考。利用SingVERSE，我们对最先进的模型进行了全面评估，并发现感知质量和可懂度之间存在一致的权衡。最后，我们表明，在领域内歌唱数据上进行训练可以显著提高增强性能，而不会损害语音能力，确立了一条简单而有效的前进路径。这项工作为社区提供了一个基础基准测试以及关键见解，以指导这一尚未充分探索领域的未来进展。演示页面：https://singverse.github.io

[17] arXiv:2503.04713 (replaced) [cn-pdf, pdf, html, other]: Title: Scaling Rich Style-Prompted Text-to-Speech Datasets

Title: 丰富风格提示的文本到语音数据集的扩展

Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi

Comments: EMNLP 2025

Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL) ; Machine Learning (cs.LG) ; Sound (cs.SD)

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

我们引入了副语言语音标题（ParaSpeechCaps），这是一个大规模数据集，用丰富的风格标题对语音语句进行标注。虽然在小规模人工标注数据集中已经探索了丰富的抽象标签（例如，喉音、鼻音、痛苦），但现有的大规模数据集仅涵盖基本标签（例如，低音、慢速、大声）。我们结合现成的文本和语音嵌入器、分类器以及一个音频语言模型，首次自动扩展丰富的标签标注。 ParaSpeechCaps总共包含59个风格标签，包括说话者级别的内在标签和语句级别的情境标签。它包括342小时的人工标注数据（PSC-Base）和2427小时的自动标注数据（PSC-Scaled）。我们在ParaSpeechCaps上微调了Parler-TTS，这是一个开源的风格提示TTS模型，在结合现有丰富风格标签数据集的最佳基线基础上，实现了改进的风格一致性（+7.9% 一致性MOS）和语音质量（+15.5% 自然度MOS）。我们对一些数据集设计选择进行了消融实验，为该领域的未来工作奠定基础。我们的数据集、模型和代码已发布在https://github.com/ajd12342/paraspeechcaps 。
[18] arXiv:2508.07282 (replaced) [cn-pdf, pdf, html, other]: Title: Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild

Title: 学到的教训：重新审视有效野生语音情感识别的关键训练策略

Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee

Comments: Proceedings of Interspeech 2025

Subjects: Audio and Speech Processing (eess.AS)

In this study, we revisit key training strategies in machine learning often overlooked in favor of deeper architectures. Specifically, we explore balancing strategies, activation functions, and fine-tuning techniques to enhance speech emotion recognition (SER) in naturalistic conditions. Our findings show that simple modifications improve generalization with minimal architectural changes. Our multi-modal fusion model, integrating these optimizations, achieves a valence CCC of 0.6953, the best valence score in Task 2: Emotional Attribute Regression. Notably, fine-tuning RoBERTa and WavLM separately in a single-modality setting, followed by feature fusion without training the backbone extractor, yields the highest valence performance. Additionally, focal loss and activation functions significantly enhance performance without increasing complexity. These results suggest that refining core components, rather than deepening models, leads to more robust SER in-the-wild.

在本研究中，我们重新审视机器学习中常被忽视的关键训练策略，这些策略往往被更深层次的架构所取代。具体而言，我们探索了平衡策略、激活函数和微调技术，以在自然条件下增强语音情感识别（SER）。我们的研究结果表明，简单的修改可以在最小改变架构的情况下提高泛化能力。我们的多模态融合模型，结合了这些优化，达到了0.6953的效价CCC，这是任务2：情感属性回归中的最佳效价分数。值得注意的是，在单模态设置中分别对RoBERTa和WavLM进行微调，然后在不训练主干提取器的情况下进行特征融合，获得了最高的效价性能。此外，焦点损失和激活函数显著提高了性能，而不会增加复杂度。这些结果表明，精炼核心组件而非加深模型，可以带来更稳健的野外语音情感识别。
[19] arXiv:2509.19881 (replaced) [cn-pdf, pdf, html, other]: Title: MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

Title: MAGE：具有掩码生成模型的粗到细语音增强器

The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran, Joon Son Chung, Duc Dung Nguyen

Comments: Submitted to ICASSP 2026

Subjects: Audio and Speech Processing (eess.AS) ; Sound (cs.SD)

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

语音增强仍然具有挑战性，因为效率和感知质量之间存在权衡。在本文中，我们介绍了MAGE，一种掩码音频生成增强器，通过紧凑且稳健的设计推进了生成式语音增强。与之前具有随机掩码的掩码生成模型不同，MAGE采用了一种稀缺性感知的从粗到细的掩码策略，在早期步骤中优先考虑频繁标记，在后续优化中考虑稀有标记，从而提高效率和泛化能力。我们还提出了一种轻量级校正模块，通过检测低置信度预测并重新掩码进行优化，进一步稳定推理。基于BigCodec并微调自Qwen2.5-0.5B，MAGE通过选择性层保留减少到2亿参数。在DNS Challenge和嘈杂的LibriSpeech上的实验表明，MAGE实现了最先进的感知质量，并显著降低了下游识别的词错误率，优于更大的基线。音频示例可在https://hieugiaosu.github.io/MAGE/获取。
[20] arXiv:2509.19928 (replaced) [cn-pdf, pdf, html, other]: Title: Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

Title: 测量零样本文本转语音中的韵律多样性：一个新的度量标准、基准和探索

Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen

Subjects: Audio and Speech Processing (eess.AS)

Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.

语音多样性对于在零样本文本到语音（TTS）中实现自然性和表现力至关重要。然而，常用的声学指标仅能捕捉语音变化的部分视角，并且与人类感知相关性较差，导致可靠量化语音多样性的问题尚未得到充分研究。为了弥补这一差距，我们引入了ProsodyEval，这是一个提供语音平均意见分数（PMOS）以及传统声学指标的语音多样性评估数据集。ProsodyEval包含来自7种主流TTS系统的1000个语音样本，以及2000个人工评分。在此基础上，我们提出了离散化语音加权编辑距离（DS-WED），一种新的客观多样性度量方法，通过语义标记上的加权编辑距离来量化语音变化。在ProsodyEval上的实验表明，DS-WED与人类判断的相关性显著高于现有的声学指标，同时在HuBERT和WavLM的语音标记化中保持高度鲁棒性。利用DS-WED，我们在LibriSpeech test-clean和Seed-TTS test-en上对最先进的开源TTS系统进行了基准测试，进一步探索揭示了影响语音多样性的几个因素，包括生成建模范式、持续时间控制和强化学习。此外，我们发现当前的大规模音频语言模型（LALMs）在捕捉语音变化方面仍然有限。音频样本可在https://prosodyeval.github.io获取。
[21] arXiv:2410.21876 (replaced) [cn-pdf, pdf, html, other]: Title: Application of Audio Fingerprinting Techniques for Real-Time Scalable Speech Retrieval and Speech Clusterization

Title: 音频指纹技术在实时可扩展语音检索和语音聚类中的应用

Kemal Altwlkany, Sead Delalić, Adis Alihodžić, Elmedin Selmanović, Damir Hasić

Comments: Proceedings of the International Convention MIPRO

Subjects: Information Retrieval (cs.IR) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.

音频指纹技术近年来取得了巨大进展，使得即使在查询的音频样本高度损坏或在嘈杂条件下录制的情况下，也能实现准确且快速的音频检索。可以预见，现有的大多数工作都集中在音乐上，例如苹果公司的Shazam或谷歌的Now Playing等流行的音乐识别服务，这些服务设计用于移动设备上的单个音频识别。然而，语音的频谱内容与音乐不同，这需要对当前的音频指纹技术进行修改。本文提供了对现有技术进行适应以解决电信和云通信平台中语音检索这一专门挑战的新见解。重点是实现在批量处理中快速且准确的音频检索，而不是通常在集中式服务器上处理单个请求。此外，本文展示了如何利用这种方法来支持基于语音转录文本的音频聚类，而无需进行实际的语音到文本转换。这种优化使得处理速度显著加快，而无需使用GPU计算，这是实时操作通常所需的条件，而实时操作通常与最先进的语音到文本工具相关。
[22] arXiv:2412.15299 (replaced) [cn-pdf, pdf, html, other]: Title: LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

Title: LAMA-UT：通过正字法统一和语言特定转写实现的语言无关多语言自动语音识别

Sangmin Lee, Woo-Jin Chung, Hong-Goo Kang

Comments: Accepted to AAAI 2025 (Oral Presentation)

Subjects: Computation and Language (cs.CL) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

构建一个在所有语言中都能公平表现的通用多语言自动语音识别（ASR）模型长期以来一直是一个挑战，这是由于其固有的困难。为了解决这个任务，我们引入了一种语言无关的多语言ASR管道，通过正字法统一和语言特定的转写（LAMA-UT）。 LAMA-UT在不使用任何语言特定模块的情况下，能够达到在少量数据上训练的最先进模型的性能。我们的管道包括两个关键步骤。首先，我们利用一个通用的转写生成器，将正字法特征统一为罗马化形式，并捕捉不同语言之间的共同语音特征。其次，我们利用一个通用转换器，将这些通用转写转换为语言特定的转写。在实验中，我们展示了利用通用转写进行大规模多语言ASR的有效性。与Whisper相比，我们的管道相对错误减少率为45%，并且在仅使用0.1%的Whisper训练数据的情况下，性能与MMS相当。此外，我们的管道不依赖任何语言特定模块。然而，它在零样本ASR方法上表现相当，这些方法使用了额外的语言特定词典和语言模型。我们期望这个框架能成为灵活多语言ASR系统的基础，即使对于未见过的语言也能具备泛化能力。
[23] arXiv:2506.02545 (replaced) [cn-pdf, pdf, html, other]: Title: On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs

Title: 关于PSTN、VoIP和神经音频编解码器中的语言和性别偏见

Kemal Altwlkany, Amar Kuric, Emanuel Lacic

Comments: Proceedings of Interspeech 2025

Subjects: Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

In recent years, there has been a growing focus on fairness and inclusivity within speech technology, particularly in areas such as automatic speech recognition and speech sentiment analysis. When audio is transcoded prior to processing, as is the case in streaming or real-time applications, any inherent bias in the coding mechanism may result in disparities. This not only affects user experience but can also have broader societal implications by perpetuating stereotypes and exclusion. Thus, it is important that audio coding mechanisms are unbiased. In this work, we contribute towards the scarce research with respect to language and gender biases of audio codecs. By analyzing the speech quality of over 2 million multilingual audio files after transcoding through a representative subset of codecs (PSTN, VoIP and neural), our results indicate that PSTN codecs are strongly biased in terms of gender and that neural codecs introduce language biases.

近年来，语音技术领域对公平性和包容性的关注不断增加，特别是在自动语音识别和语音情感分析等应用中。在处理之前对音频进行转码时，例如在流媒体或实时应用中，编码机制中的固有偏差可能导致差异。这不仅会影响用户体验，还可能通过延续刻板印象和排斥现象产生更广泛的社会影响。因此，音频编码机制应尽可能无偏。在本研究中，我们为针对音频编解码器的语言和性别偏差的稀缺研究做出了贡献。通过对代表性的编解码器子集（PSTN、VoIP 和神经）进行转码后，分析超过 200 万种多语言音频文件的语音质量，我们的结果表明，PSTN 编解码器在性别方面存在明显偏差，而神经编解码器则引入了语言偏差。
[24] arXiv:2507.16080 (replaced) [cn-pdf, pdf, html, other]: Title: Interpretable Embeddings of Speech Enhance and Explain Brain Encoding Performance of Audio Models

Title: 可解释的语音增强嵌入提升和解释音频模型的脑编码性能

Riki Shimizu, Richard J. Antonello, Chandan Singh, Nima Mesgarani

Comments: 19 pages, 5 figures

Subjects: Neurons and Cognition (q-bio.NC) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

Speech foundation models (SFMs) are increasingly hailed as powerful computational models of human speech perception. However, since their representations are inherently black-box, it remains unclear what drives their alignment with brain responses. To remedy this, we built linear encoding models from six interpretable feature families: mel-spectrogram, Gabor filter bank features, speech presence, phonetic, syntactic, and semantic features, and contextualized embeddings from three state-of-the-art SFMs (Whisper, HuBERT, WavLM), quantifying electrocorticography (ECoG) response variance shared between feature classes. Variance-partitioning analyses revealed several key insights: First, the SFMs' alignment with the brain can be mostly explained by their ability to learn and encode simple interpretable speech features. Second, SFMs exhibit a systematic trade-off between encoding of brain-relevant low-level and high-level features across layers. Finally, our results show that SFMs learn brain-relevant semantics which cannot be explained by lower-level speech features, with this capacity increasing with model size and context length. Together, our findings suggest a principled approach to build more interpretable, accurate, and efficient encoding models of the brain by augmenting SFM embeddings with interpretable features.

语音基础模型（SFMs）越来越被认为是人类语音感知的强大计算模型。然而，由于它们的表示本质上是黑箱，尚不清楚是什么驱动它们与大脑反应的对齐。为了解决这个问题，我们从六个可解释的特征族构建了线性编码模型：梅谱图、伽博滤波器组特征、语音存在、语音音素、句法和语义特征，以及来自三个最先进的SFMs（Whisper、HuBERT、WavLM）的上下文嵌入，量化了特征类别之间共享的脑电图（ECoG）响应方差。方差分解分析揭示了几个关键见解：首先，SFMs与大脑的对齐主要可以通过它们学习和编码简单可解释的语音特征的能力来解释。其次，在不同层中，SFMs在编码与大脑相关的低级和高级特征之间表现出系统性的权衡。最后，我们的结果表明， SFMs学习的与大脑相关的语义无法由低级语音特征解释，这种能力随着模型大小和上下文长度的增加而增强。总的来说，我们的研究结果表明，通过将可解释的特征与SFM嵌入相结合，可以建立更可解释、准确和高效的脑编码模型。
[25] arXiv:2509.14128 (replaced) [cn-pdf, pdf, html, other]: Title: Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

Title: Canary-1B-v2 & Parakeet-TDT-0.6B-v3：多语言自动语音识别和音频摘要的高效高性能模型

Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, Boris Ginsburg

Comments: Mini Version of it Submitted to ICASSP 2026

Subjects: Computation and Language (cs.CL) ; Audio and Speech Processing (eess.AS)

This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.

本报告介绍了Canary-1B-v2，这是一种快速且稳健的多语言模型，用于自动语音识别（ASR）和语音到文本翻译（AST）。该模型采用FastConformer编码器和Transformer解码器，支持25种主要为欧洲的语言。该模型在170万小时的总数据样本上进行训练，包括Granary和NeMo ASR Set 3.0，非语音音频被加入以减少ASR和AST中的幻觉现象。我们描述了其两阶段的预训练和微调过程，以及动态数据平衡，还进行了nGPT编码器的实验。结果表明，nGPT在大规模数据上表现良好，而FastConformer在微调后表现出色。对于时间戳，Canary-1B-v2使用NeMo强制对齐器（NFA）和辅助CTC模型，为ASR和AST提供可靠的时间段级时间戳。评估显示，Canary-1B-v2在英语ASR方面优于Whisper-large-v3，同时快10倍，并且在多语言ASR和AST性能上与更大的模型如Seamless-M4T-v2-large和基于LLM的系统具有竞争力。我们还发布了Parakeet-TDT-0.6B-v3，这是v2的后续版本，在相同的25种语言上提供多语言ASR，仅需6亿参数。
[26] arXiv:2509.15362 (replaced) [cn-pdf, pdf, html, other]: Title: Speech Language Models for Under-Represented Languages: Insights from Wolof

Title: 语音语言模型用于代表性不足的语言：来自沃洛夫语的见解

Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina

Subjects: Computation and Language (cs.CL) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)

We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality unsupervised speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.

我们介绍了在训练沃洛夫语语音语言模型方面的历程，这是一种在西非使用的资源不足的语言，并分享了关键见解。我们首先强调收集大规模、自发的高质量无监督语音数据的重要性，并表明在此数据集上持续预训练HuBERT优于基础模型和非洲中心模型在自动语音识别（ASR）上的表现。然后，我们将此语音编码器集成到沃洛夫LLM中，以训练该语言的第一个语音LLM，扩展其功能以执行语音翻译等任务。此外，我们探索了在转录或翻译之前让语音LLM执行多步骤思维链进行训练。我们的结果表明，语音LLM不仅提高了语音识别效果，而且在语音翻译方面也表现良好。这些模型和代码将公开共享。
[27] arXiv:2509.18196 (replaced) [cn-pdf, pdf, html, other]: Title: MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

Title: MNV-17：用于语音中非语言发声识别的高质量表演普通话数据集

Jialong Mai, Jinxin Ji, Xiaofen Xing, Chen Yang, Weidong Chen, Jingyuan Xing, Xiangmin Xu

Comments: Official dataset available at: https://github.com/yongaifadian1/MNV-17. Submitted to ICASSP 2026

Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI) ; Audio and Speech Processing (eess.AS)

Mainstream Automatic Speech Recognition (ASR) systems excel at transcribing lexical content, but largely fail to recognize nonverbal vocalizations (NVs) embedded in speech, such as sighs, laughs, and coughs. This capability is important for a comprehensive understanding of human communication, as NVs convey crucial emotional and intentional cues. Progress in NV-aware ASR has been hindered by the lack of high-quality, well-annotated datasets. To address this gap, we introduce MNV-17, a 7.55-hour performative Mandarin speech dataset. Unlike most existing corpora that rely on model-based detection, MNV-17's performative nature ensures high-fidelity, clearly articulated NV instances. To the best of our knowledge, MNV-17 provides the most extensive set of nonverbal vocalization categories, comprising 17 distinct and well-balanced classes of common NVs. We benchmarked MNV-17 on four mainstream ASR architectures, evaluating their joint performance on semantic transcription and NV classification. The dataset and the pretrained model checkpoints will be made publicly available to facilitate future research in expressive ASR.

主流自动语音识别（ASR）系统在转录词汇内容方面表现出色，但大多无法识别嵌入在语音中的非语言声音（NVs），例如叹息、笑声和咳嗽。这一能力对于全面理解人类交流至关重要，因为NVs传达了重要的情感和意图线索。由于缺乏高质量、标注良好的数据集，NV感知ASR的发展受到了阻碍。为解决这一差距，我们引入了MNV-17，这是一个7.55小时的表演性普通话语音数据集。与大多数依赖模型检测的现有语料库不同，MNV-17的表演性质确保了高保真度且清晰发音的NV实例。据我们所知，MNV-17提供了最广泛的非语言声音类别，包含17种不同且平衡的常见NV类。我们在四种主流ASR架构上对MNV-17进行了基准测试，评估了它们在语义转录和NV分类上的联合性能。该数据集和预训练模型检查点将向公众开放，以促进表达性ASR的未来研究。

Total of 27 entries

Showing up to 2000 entries per page: fewer | more | all

Audio and Speech Processing

Showing new listings for Friday, 26 September 2025

New submissions (showing 12 of 12 entries )

Cross submissions (showing 4 of 4 entries )

Replacement submissions (showing 11 of 11 entries )