Skip to main content
CenXiv.org
此网站处于试运行阶段,支持我们!
我们衷心感谢所有贡献者的支持。
贡献
赞助
cenxiv logo > eess.AS

帮助 | 高级搜索

音频与语音处理

  • 新提交
  • 交叉列表
  • 替换

查看 最近的 文章

显示 2025年09月05日, 星期五 新的列表

总共 22 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有

新提交 (展示 4 之 4 条目 )

[1] arXiv:2509.03902 [中文pdf, pdf, html, 其他]
标题: 基于球形和线性麦克风阵列的分层稀疏声场重建
标题: Hierarchical Sparse Sound Field Reconstruction with Spherical and Linear Microphone Arrays
Shunxi Xu, Craig T. Jin
评论: 被APSIPA ASC 2025接受
主题: 音频与语音处理 (eess.AS)

球形麦克风阵列(SMAs)广泛用于声场分析,稀疏恢复(SR)技术可以通过将声场建模为主要平面波的稀疏叠加来显著提高其空间分辨率。 然而,SMAs的空间分辨率本质上受到其球面谐波阶数的限制,并且在混响环境中性能通常会下降。 本文提出了一种具有残差精炼的两阶段SR框架,该框架结合了中心SMAs和四个周围线性麦克风阵列(LMAs)的观测数据。 核心思想是通过将SMA作为主要估计器,LMAs作为空间互补的精炼器,来利用互补的空间特性。 仿真结果表明,与仅使用SMA和直接一步联合处理相比,所提出的SMA-LMA方法在不同混响条件下显著提高了空间能量图的重建效果。 这些结果证明了所提出框架在复杂声学环境中增强空间保真度和鲁棒性的有效性。

Spherical microphone arrays (SMAs) are widely used for sound field analysis, and sparse recovery (SR) techniques can significantly enhance their spatial resolution by modeling the sound field as a sparse superposition of dominant plane waves. However, the spatial resolution of SMAs is fundamentally limited by their spherical harmonic order, and their performance often degrades in reverberant environments. This paper proposes a two-stage SR framework with residue refinement that integrates observations from a central SMA and four surrounding linear microphone arrays (LMAs). The core idea is to exploit complementary spatial characteristics by treating the SMA as a primary estimator and the LMAs as a spatially complementary refiner. Simulation results demonstrate that the proposed SMA-LMA method significantly enhances spatial energy map reconstruction under varying reverberation conditions, compared to both SMA-only and direct one-step joint processing. These results demonstrate the effectiveness of the proposed framework in enhancing spatial fidelity and robustness in complex acoustic environments.

[2] arXiv:2509.04072 [中文pdf, pdf, html, 其他]
标题: LibriQuote:用于情感零样本语音合成的虚构角色对话语音数据集
标题: LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis
Gaspard Michel, Elena V. Epure, Christophe Cerisara
主题: 音频与语音处理 (eess.AS) ; 计算与语言 (cs.CL) ; 声音 (cs.SD)

文本到语音(TTS)系统最近通过扩展到大型语音数据集,实现了更具表现力和自然的语音合成。 然而,这种大规模语料库中具有表现力的语音比例通常是不清楚的。 此外,现有的具有表现力的语音语料库通常规模较小,主要用于基准测试TTS系统。 在本文中,我们介绍了LibriQuote数据集,这是一个源自朗读有声书的英语语料库,旨在用于微调和基准测试具有表现力的零样本TTS系统。 训练数据集包括12.7K小时的朗读、非表现性语音和5.3K小时的大多为表现性语音,这些语音来自角色引语。 表现性子集中的每个语音都补充了其写作背景,以及用于描述引语的言语动词和副词的伪标签(\textit{例如 "他轻声低语"})。 此外,我们提供了一个具有挑战性的7.5小时测试集,用于基准测试TTS系统:给定一个中性参考语音作为输入,我们评估系统在保持参考音色的同时合成具有表现力的语音的能力。 我们通过显示与非表现性语音相比,该测试集涵盖了广泛的情绪以及各种口音,从而对测试集进行了定性验证。 广泛的主观和客观评估表明,在LibriQuote上微调基线TTS系统显著提高了其合成语音的可理解性,并且最近的系统无法合成出与真实语音同样具有表现力和自然的语音。 该数据集和评估代码可免费使用。 音频样本可在https://libriquote.github.io/找到。

Text-to-speech (TTS) systems have recently achieved more expressive and natural speech synthesis by scaling to large speech datasets. However, the proportion of expressive speech in such large-scale corpora is often unclear. Besides, existing expressive speech corpora are typically smaller in scale and primarily used for benchmarking TTS systems. In this paper, we introduce the LibriQuote dataset, an English corpus derived from read audiobooks, designed for both fine-tuning and benchmarking expressive zero-shot TTS system. The training dataset includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech drawn from character quotations. Each utterance in the expressive subset is supplemented with the context in which it was written, along with pseudo-labels of speech verbs and adverbs used to describe the quotation (\textit{e.g. ``he whispered softly''}). Additionally, we provide a challenging 7.5 hour test set intended for benchmarking TTS systems: given a neutral reference speech as input, we evaluate system's ability to synthesize an expressive utterance while preserving reference timbre. We validate qualitatively the test set by showing that it covers a wide range of emotions compared to non-expressive speech, along with various accents. Extensive subjective and objective evaluations show that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility, and that recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances. The dataset and evaluation code are freely available. Audio samples can be found at https://libriquote.github.io/.

[3] arXiv:2509.04280 [中文pdf, pdf, 其他]
标题: 通过域不变嵌入变换的语音增强测试时适应
标题: Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation
Tobias Raichle, Niels Edinger, Bin Yang
评论: 此工作已提交给IEEE以可能发表
主题: 音频与语音处理 (eess.AS)

基于深度学习的语音增强模型在测试分布与训练条件匹配时表现出色,但在部署到具有领域偏移的不可预测现实环境时性能往往会下降。 为解决这一挑战,我们提出了LaDen(潜在去噪),这是首个专为语音增强设计的测试时适应方法。 我们的方法利用强大的预训练语音表示进行潜在去噪,通过噪声嵌入的线性变换近似干净语音表示。 我们证明这种变换在不同领域中具有良好的泛化能力,能够在没有标记目标数据的情况下为目标领域进行有效伪标记。 生成的伪标签使得语音增强模型能够在各种声学环境中进行有效的测试时适应。 我们提出一个涵盖多个数据集的综合基准,包括噪声类型、说话人特征和语言变化等多种领域偏移。 我们的大量实验表明,LaDen在感知指标上始终优于基线方法,尤其是在说话人和语言领域偏移方面表现突出。

Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present LaDen (latent denoising), the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.

[4] arXiv:2509.04390 [中文pdf, pdf, html, 其他]
标题: 使用图形硬件加速高度混响空间的交互式听觉化
标题: Accelerated Interactive Auralization of Highly Reverberant Spaces using Graphics Hardware
Hannes Rosseel, Toon van Waterschoot
评论: 8页,6图,提交至《音频工程协会杂志》
主题: 音频与语音处理 (eess.AS) ; 声音 (cs.SD)

交互式声学听觉化使用户能够实时探索虚拟声学环境,从而实现对无法进入、声学特性已改变或难以访问的音乐厅或历史礼拜空间(HWS)的声学再现。 交互式声学合成需要将输入信号与一组模拟空间时间声学响应的合成滤波器进行实时卷积。 音乐厅和HWS的声学特征都表现为较长的混响时间,导致合成滤波器包含大量滤波器抽头。 因此,卷积过程可能计算量较大,引入显著延迟,限制了听觉化系统的实时交互性。 本文介绍了基于多通道扬声器的实时听觉化系统的实现。 该系统能够使用GPU加速实时合成高度混响空间的声学特性。 本文比较了传统的基于CPU的卷积与GPU加速的卷积,结果表明后者可以以显著更低的延迟实现实时性能。 此外,该系统将声学合成与GPU上的声学反馈消除集成在一起,创建了一个统一的基于扬声器的听觉化框架,最大限度地减少了处理延迟。

Interactive acoustic auralization allows users to explore virtual acoustic environments in real-time, enabling the acoustic recreation of concert hall or Historical Worship Spaces (HWS) that are either no longer accessible, acoustically altered, or impractical to visit. Interactive acoustic synthesis requires real-time convolution of input signals with a set of synthesis filters that model the space-time acoustic response of the space. The acoustics in concert halls and HWS are both characterized by a long reverberation time, resulting in synthesis filters containing many filter taps. As a result, the convolution process can be computationally demanding, introducing significant latency that limits the real-time interactivity of the auralization system. In this paper, the implementation of a real-time multichannel loudspeaker-based auralization system is presented. This system is capable of synthesizing the acoustics of highly reverberant spaces in real-time using GPU-acceleration. A comparison between traditional CPU-based convolution and GPU-accelerated convolution is presented, showing that the latter can achieve real-time performance with significantly lower latency. Additionally, the system integrates acoustic synthesis with acoustic feedback cancellation on the GPU, creating a unified loudspeaker-based auralization framework that minimizes processing latency.

交叉提交 (展示 4 之 4 条目 )

[5] arXiv:2509.03525 (交叉列表自 cs.CL) [中文pdf, pdf, 其他]
标题: 基于语音的认知筛查:LLM适应策略的系统评估
标题: Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori
主题: 计算与语言 (cs.CL) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

超过一半患有阿尔茨海默病及相关痴呆症的美国成年人未被诊断,基于语音的筛查提供了一种可扩展的检测方法。 我们比较了使用DementiaBank语音语料库进行痴呆检测的大语言模型适应策略,评估了九个仅文本模型和三个多模态音频-文本模型在DementiaBank语音语料库录音上的表现。 适应方法包括不同演示选择策略的上下文学习、增强推理的提示、参数高效的微调以及多模态集成。 结果表明,类别中心演示实现了最高的上下文学习性能,推理提升了小型模型,而标记级别的微调通常产生了最佳分数。 添加分类头显著提高了表现不佳的模型。 在多模态模型中,微调的音频-文本系统表现良好,但并未超越顶级的仅文本模型。 这些发现表明,模型适应策略,包括演示选择、推理设计和调优方法,对基于语音的痴呆检测具有关键影响,并且经过适当调整的开放权重模型可以匹配或超越商业系统。

Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.

[6] arXiv:2509.03526 (交叉列表自 cs.CL) [中文pdf, pdf, html, 其他]
标题: 通过强化行为对齐增强语音大语言模型
标题: Enhancing Speech Large Language Models through Reinforced Behavior Alignment
Yansong Liu, Jiateng Li, Yuan Liu
主题: 计算与语言 (cs.CL) ; 音频与语音处理 (eess.AS)

最近大型语言模型(LLMs)的进展激发了研究者们将它们的语言能力扩展到文本以外的其他模态的兴趣,这导致了具有处理用户请求能力的语音基础LLMs(SpeechLMs)的出现,这些模型可以以语音或文本格式处理用户请求。然而,由于模态间的差异,这些SpeechLMs在遵循指令方面与文本基础的LLM相比仍存在显著性能差距,尤其是在面对用户语音的动态和多变性质时。为了解决这一挑战,本文介绍了一个称为强化行为对齐(RBA)的框架,旨在增强SpeechLMs的语言生成能力。RBA不依赖于从人类标注中进行监督微调,而是采用一种自我合成方法,通过一个强大的教师LLM生成大量高保真对齐数据。然后使用基于强化学习的方法将SpeechLMs的行为与教师的行为对齐。实验结果表明,这种方法有效提升了SpeechLMs的指令遵循能力,其表现优于传统的蒸馏基线。关键的是,我们证明RBA可以无缝扩展到包括语音问答和语音到文本翻译在内的任务,在仅使用自生成数据的情况下,在开放基准测试中取得了最先进的性能。

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

[7] arXiv:2509.03529 (交叉列表自 cs.CL) [中文pdf, pdf, html, 其他]
标题: 基于人工智能的工具以提高信息交叉评估的多模态建议
标题: Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages
Alejandro Álvarez Castro, Joaquín Ordieres-Meré
评论: 发表于NLMLT2025(https://airccse.org/csit/V15N16.html),15页,5图
主题: 计算与语言 (cs.CL) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

收益电话代表了一种独特丰富且半结构化的财务沟通来源,结合了脚本化的管理层评论和非脚本的分析师对话。 尽管最近在财务情感分析方面的进展已经整合了多模态信号,例如文本内容和语音语调,但大多数系统依赖于平坦的文档级或句子级模型,无法捕捉这些互动的分层话语结构。 本文介绍了一种新颖的多模态框架,旨在通过将收益电话编码为层次化话语树来生成语义丰富且结构意识的嵌入。 每个节点,包括独白或问答对,都包含从文本、音频和视频中提取的情感信号,以及结构化元数据,包括连贯性分数、主题标签和回答覆盖评估。 提出了一种两阶段的变压器架构:第一阶段使用对比学习在节点级别对多模态内容和话语元数据进行编码,第二阶段则合成整个会议的全局嵌入。 实验结果表明,生成的嵌入形成了稳定的、语义上有意义的表示,反映了情感基调、结构逻辑和主题一致性。 除了财务报告外,所提出的系统还可推广到其他高风险的非脚本交际领域,如远程医疗、教育和政治话语,提供了一种强大且可解释的多模态话语表示方法。 这种方法对于下游任务如财务预测和话语评估具有实际应用价值,同时提供了一种适用于涉及高风险沟通的其他领域的可推广方法。

Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.

[8] arXiv:2509.03913 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
标题: SwinSRGAN:基于Swin变换器的生成对抗网络用于高保真语音超分辨率
标题: SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution
Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv
评论: 5页
主题: 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

语音超分辨率(SR)从低分辨率语音信号中重建高频内容。 现有系统在两阶段梅尔声码器管道中常常存在表示不匹配,并且由仅使用CNN的生成器过度平滑幻觉的高频内容。 扩散和流模型计算成本高,其在不同领域和采样率下的鲁棒性仍然有限。 我们提出了SwinSRGAN,一个在修改离散余弦变换(MDCT)幅度上运行的端到端框架。 它是一个基于Swin Transformer的U-Net,通过结合时域MPD/MSD判别器与专门针对高频带的多频带MDCT判别器的混合对抗方案来捕捉长距离的谱时依赖关系。 我们对arcsinh压缩的MDCT应用了稀疏感知正则化器,以更好地保留瞬态成分。 该系统在一个步骤中将不同采样率的输入上采样到48 kHz,并实时运行。 在标准基准测试中,SwinSRGAN降低了客观误差并提高了ABX偏好分数。 在HiFi-TTS上无需微调的零样本测试中,它优于NVSR和mdctGAN,展示了在数据集上的强大泛化能力。

Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets

替换提交 (展示 14 之 14 条目 )

[9] arXiv:2410.07982 (替换) [中文pdf, pdf, html, 其他]
标题: 无需窗函数的DFT用于实时音乐分析,具有降低噪声和延迟的特点
标题: Window Function-less DFT with Reduced Noise and Latency for Real-Time Music Analysis
Cai Biesinger, Hiromitsu Awano, Masanori Hashimoto
评论: 5页,4图,最终版本已被EUSIPCO 2025接收。由于arXiv的格式问题,使用TeX生成的PDF。此版本:全文文字更清晰,用更代表性的图替换了最终图表,向多个公式中添加了二进制索引k
主题: 音频与语音处理 (eess.AS)

音乐分析应用程序需要能够提供高时间和频率分辨率,同时在已经存在噪声的信号中最小化噪声的算法。 实时分析还需要低延迟和低计算需求。 我们提出了一种基于DFT的算法,通过扩展一种在不使用窗函数的情况下后处理DFT输出的方法,来满足所有这些要求。 我们的方法显著减少了旁瓣和噪声,并在不牺牲频率分辨率的情况下提高了时间分辨率。 我们使用指数间隔的输出频段,这些频段直接映射到音乐中的音符。 与现有的FFT和基于DFT的方法相比,性能的提高为改进的实时可视化创造了可能性,并有助于其他应用如自动转录的分析质量的提升。

Music analysis applications demand algorithms that can provide both high time and frequency resolution while minimizing noise in an already-noisy signal. Real-time analysis additionally demands low latency and low computational requirements. We propose a DFT-based algorithm that accomplishes all these requirements by extending a method that post-processes DFT output without the use of window functions. Our approach yields greatly reduced sidelobes and noise, and improves time resolution without sacrificing frequency resolution. We use exponentially spaced output bins which directly map to notes in music. The resulting improved performance, compared to existing FFT and DFT-based approaches, creates possibilities for improved real-time visualizations, and contributes to improved analysis quality in other applications such as automatic transcription.

[10] arXiv:2411.14013 (替换) [中文pdf, pdf, html, 其他]
标题: 揭露合成语音:通过音频指纹进行模型归属和AI生成语音的检测
标题: Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints
Matías Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer
主题: 音频与语音处理 (eess.AS) ; 密码学与安全 (cs.CR) ; 机器学习 (cs.LG)

随着语音生成技术在质量和可访问性方面的持续进步,恶意使用案例(包括模仿、虚假信息和欺骗)的风险迅速增加。 本工作通过引入一种简单、无需训练但有效的检测AI生成语音并将其归因于其源模型的方法来应对这一威胁。 具体而言,我们解决了三个关键任务:(1) 开放世界设置下的单模型归因,目标是确定给定的音频样本是否由特定的目标神经语音合成系统生成(仅能访问该系统的数据);(2) 封闭世界设置下的多模型归因,目标是从已知的候选池中识别出生成系统;以及最后但同样重要的是(3) 合成语音与真实语音的检测。 我们的方法利用标准化平均残差——输入音频信号与其经过低通滤波器或EnCodec音频自编码器过滤后的版本之间的差异。 我们证明了这些残差能够一致地捕捉到多种语音合成系统引入的伪影,作为归因的显著且模型无关的指纹。 在广泛的实验中,我们的方法在大多数场景下实现了超过99%的AUROC分数,在包含真实语音与多个合成系统生成的合成音频配对的增强基准数据集上进行了评估。 此外,我们的鲁棒性分析强调了该方法在存在适度加性噪声的情况下仍能保持高性能的能力。 由于其简单性、高效性和在语音合成系统和语言上的强大泛化能力,该技术为数字取证和安全应用提供了一个实用的工具。

As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method's ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.

[11] arXiv:2506.07634 (替换) [中文pdf, pdf, html, 其他]
标题: SongBloom:通过交错自回归草图和扩散优化的连贯歌曲生成
标题: SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li
评论: 提交至NeurIPS2025
主题: 音频与语音处理 (eess.AS) ; 多媒体 (cs.MM)

生成具有连贯结构、和谐的乐器和人声元素的音乐在歌曲生成中仍然是一个重大挑战。 现有的语言模型和基于扩散的方法常常难以在全局连贯性和局部保真度之间取得平衡,导致输出缺乏音乐性或出现不连贯的进展和不匹配的歌词。 本文介绍了 $\textbf{SongBloom}$,一种用于完整长度歌曲生成的新框架,该框架利用了自回归草图和基于扩散的细化的交替范式。 SongBloom 使用了一个自回归扩散模型,该模型结合了扩散模型的高保真度和语言模型的可扩展性。 具体而言,它从短到长逐步扩展音乐草图,并从粗粒度到细粒度地细化细节。 交替生成范式有效地整合了先验语义和声学上下文以指导生成过程。 实验结果表明, SongBloom 在主观和客观指标上均优于现有方法,并达到了与最先进的商业音乐生成平台相当的性能。 音频样本可在我们的演示页面上找到:https://cypress-yang.github.io/SongBloom_demo。 代码和模型权重已发布在 https://github.com/Cypress-Yang/SongBloom 。

Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo. The code and model weights have been released on https://github.com/Cypress-Yang/SongBloom .

[12] arXiv:2507.23266 (替换) [中文pdf, pdf, html, 其他]
标题: CUHK-EE 用于 NCMMSC 2025 vTAD 挑战的系统
标题: CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025
Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee
评论: 被中国第20届人机语音通信会议(NCMMSC 2025)接收
主题: 音频与语音处理 (eess.AS) ; 声音 (cs.SD)

本文介绍了由香港中文大学电子工程系(EE)数字信号处理与语音技术实验室(DSP&STL)为第二十届全国人机语音通信会议(NCMMSC 2025)vTAD挑战赛开发的语音音色属性检测(vTAD)系统。所提出的系统利用WavLM-Large嵌入结合注意力统计池化(ASTP)来提取稳健的说话人表示,随后采用两种Diff-Net变体,即前馈神经网络(FFN)和增强型挤压与激励残差FFN(SE-ResFFN),以比较话语对之间的音色属性强度。实验结果表明,WavLM-Large+FFN系统在未见过的说话人上泛化效果更好,准确率达到77.96%,等错误率(EER)为21.79%,而WavLM-Large+SE-ResFFN模型在“已见”设置中表现优异,准确率为94.42%,EER为5.49%。这些发现突显了模型复杂性与泛化能力之间的权衡,并强调了架构选择在细粒度说话人建模中的重要性。我们的分析还揭示了说话人身份、注释主观性和数据不平衡对系统性能的影响,指明了未来提升音色属性检测鲁棒性和公平性的方向。

This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model excels in the 'Seen' setting with 94.42% accuracy and 5.49% EER. These findings highlight a trade-off between model complexity and generalisation, and underscore the importance of architectural choices in fine-grained speaker modelling. Our analysis also reveals the impact of speaker identity, annotation subjectivity, and data imbalance on system performance, pointing to future directions for improving robustness and fairness in timbre attribute detection.

[13] arXiv:2508.08715 (替换) [中文pdf, pdf, html, 其他]
标题: MultiGen:适合儿童的多语言语音生成器与大语言模型
标题: MultiGen: Child-Friendly Multilingual Speech Generator with LLMs
Xiaoxue Gao, Huayun Zhang, Nancy F. Chen
评论: 5页
主题: 音频与语音处理 (eess.AS) ; 人工智能 (cs.AI) ; 计算与语言 (cs.CL) ; 信号处理 (eess.SP)

生成语音模型在改善人机交互方面展示了显著的潜力,提供了诸如儿童语言学习等有价值的实际应用。 然而,在多样化的语言和文化背景下,实现高质量且适合儿童的语音生成仍然是一个挑战,尤其是对于资源较少的语言。 在本文中,我们提出了 MultiGen,一种具有适合儿童互动的多语言语音生成模型,利用适用于资源较少语言的语音生成的大型语言模型架构。 我们提出使用适合年龄的多语言语音生成,结合大型语言模型架构,可以在三种资源较少的语言中通过文化相关的上下文,帮助年幼的儿童与人工智能系统进行交流:新加坡口音的普通话、马来语和泰米尔语。 从客观指标和主观评估的实验结果表明,所提出的MultiGen在性能上优于基线方法。

Generative speech models have demonstrated significant potential in improving human-machine interactions, offering valuable real-world applications such as language learning for children. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiGen, a multilingual speech generation model with child-friendly interaction, leveraging LLM architecture for speech generation tailored for low-resource languages. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, which can be used to facilitate young children's communication with AI systems through culturally relevant context in three low-resource languages: Singaporean accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiGen compared to baseline methods.

[14] arXiv:2509.01889 (替换) [中文pdf, pdf, html, 其他]
标题: 从评估到优化:面向下游应用的神经语音评估
标题: From Evaluation to Optimization: Neural Speech Assessment for Downstream Applications
Yu Tsao
评论: 5页,1图
主题: 音频与语音处理 (eess.AS)

合成语音和处理后语音的评估长期以来一直是音频工程和语音科学的核心。 尽管主观听觉测试仍然是评估感知质量和可懂度的黄金标准,但它们的高成本、时间要求和有限的可扩展性在现代语音技术的快速开发周期中带来了重大挑战。 传统的客观指标虽然计算效率高,但通常依赖于干净的参考信号,使其成为侵入性方法。 这带来了一个主要限制,因为在实际应用中干净的信号往往不可用。 近年来,许多基于神经网络的语音评估模型被开发出来以预测质量和可懂度,取得了有希望的结果。 除了在评估中的作用外,这些模型越来越多地被整合到下游语音处理任务中。 本综述关注它们在两个主要领域的作用:(1) 作为可微分的感知代理,不仅评估还指导语音增强和合成模型的优化;以及 (2) 实现显著语音特征的检测,以支持更精确和高效的下游处理。 最后,我们讨论了当前的局限性,并概述了未来的研究方向,以进一步推进语音评估在语音处理流程中的集成。

The evaluation of synthetic and processed speech has long been a cornerstone of audio engineering and speech science. Although subjective listening tests remain the gold standard for assessing perceptual quality and intelligibility, their high cost, time requirements, and limited scalability present significant challenges in the rapid development cycles of modern speech technologies. Traditional objective metrics, while computationally efficient, often rely on a clean reference signal, making them intrusive approaches. This presents a major limitation, as clean signals are often unavailable in real-world applications. In recent years, numerous neural network-based speech assessment models have been developed to predict quality and intelligibility, achieving promising results. Beyond their role in evaluation, these models are increasingly integrated into downstream speech processing tasks. This review focuses on their role in two main areas: (1) serving as differentiable perceptual proxies that not only assess but also guide the optimization of speech enhancement and synthesis models; and (2) enabling the detection of salient speech characteristics to support more precise and efficient downstream processing. Finally, we discuss current limitations and outline future research directions to further advance the integration of speech assessment into speech processing pipelines.

[15] arXiv:2509.03013 (替换) [中文pdf, pdf, html, 其他]
标题: 基于不确定性感知的Whisper嵌入和sLSTM的语音可懂度评估
标题: Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM
Ryandhimas E. Zezario, Dyah A.M.G. Wisnu, Hsin-Min Wang, Yu Tsao
评论: 被APSIPA ASC 2025接收
主题: 音频与语音处理 (eess.AS) ; 声音 (cs.SD)

非侵入式语音可懂度预测由于说话人差异、噪声条件和主观感知的多样性而仍然具有挑战性。 我们提出了一种考虑不确定性的方法,该方法结合Whisper嵌入和统计特征,特别是跨嵌入维度计算的均值、标准差和熵。 通过特征维度上的softmax计算得到的熵作为不确定性的代理,补充了由均值和标准差捕获的全局信息。 为了建模语音的序列结构,我们采用了一个标量长短期记忆(sLSTM)网络,该网络能够高效地捕捉长距离依赖关系。 在此基础上,我们提出了iMTI-Net,一种改进的多目标可懂度预测网络,在多任务学习框架中集成了卷积神经网络(CNN)和sLSTM组件。 它联合预测人类可懂度分数以及来自Google ASR和Whisper的基于机器的词错误率(WER)。 实验结果表明,iMTI-Net在多个评估指标上优于原始的MTI-Net,证明了引入考虑不确定性的特征以及CNN-sLSTM架构的有效性。

Non-intrusive speech intelligibility prediction remains challenging due to variability in speakers, noise conditions, and subjective perception. We propose an uncertainty-aware approach that leverages Whisper embeddings in combination with statistical features, specifically the mean, standard deviation, and entropy computed across the embedding dimensions. The entropy, computed via a softmax over the feature dimension, serves as a proxy for uncertainty, complementing global information captured by the mean and standard deviation. To model the sequential structure of speech, we adopt a scalar long short-term memory (sLSTM) network, which efficiently captures long-range dependencies. Building on this foundation, we propose iMTI-Net, an improved multi-target intelligibility prediction network that integrates convolutional neural network (CNN) and sLSTM components within a multitask learning framework. It jointly predicts human intelligibility scores and machine-based word error rates (WER) from Google ASR and Whisper. Experimental results show that iMTI-Net outperforms the original MTI-Net across multiple evaluation metrics, demonstrating the effectiveness of incorporating uncertainty-aware features and the CNN-sLSTM architecture.

[16] arXiv:2306.17477 (替换) [中文pdf, pdf, html, 其他]
标题: 超越语音:面向商业家用助手机器上的连续3D手部姿态跟踪
标题: Beyond-Voice: Towards Continuous 3D Hand Pose Tracking on Commercial Home Assistant Devices
Yin Li, Rohan Reddy, Cheng Zhang, Rajalakshmi Nandakumar
评论: 被IPSN 2024接受
主题: 声音 (cs.SD) ; 人机交互 (cs.HC) ; 音频与语音处理 (eess.AS)

家庭助手及其语音用户界面(VUI)的日益流行,使它们成为智能家居设备的理想中心控制枢纽。然而,当前的外形设计严重依赖于VUI,这带来了可访问性和可用性问题;一些最新的产品配备了额外的摄像头和显示屏,这成本高昂并引发隐私担忧。这些担忧共同促使了Beyond-Voice的出现,这是一种新颖的高保真声学传感系统,使得商用家庭助手设备能够持续跟踪和重建手部姿势。它利用现有的机载麦克风和扬声器,将家庭助手转变为一个主动的声纳系统。我们将高分辨率的距离轮廓输入到深度学习模型中,该模型可以分析多个身体部位的运动,并预测21个手指关节的3D位置,将声学手部跟踪的粒度提升到一个新的水平。它可以在不同环境和用户之间运行,而无需个性化的训练数据。一项针对3种不同环境中11名参与者的用户研究显示,Beyond-Voice在没有任何测试对象提供的训练数据的情况下,可以以平均绝对误差16.47mm跟踪关节。

The surging popularity of home assistants and their voice user interface (VUI) have made them an ideal central control hub for smart home devices. However, current form factors heavily rely on VUI, which poses accessibility and usability issues; some latest ones are equipped with additional cameras and displays, which are costly and raise privacy concerns. These concerns jointly motivate Beyond-Voice, a novel high-fidelity acoustic sensing system that allows commodity home assistant devices to track and reconstruct hand poses continuously. It transforms the home assistant into an active sonar system using its existing onboard microphones and speakers. We feed a high-resolution range profile to the deep learning model that can analyze the motions of multiple body parts and predict the 3D positions of 21 finger joints, bringing the granularity for acoustic hand tracking to the next level. It operates across different environments and users without the need for personalized training data. A user study with 11 participants in 3 different environments shows that Beyond-Voice can track joints with an average mean absolute error of 16.47mm without any training data provided by the testing subject.

[17] arXiv:2403.10796 (替换) [中文pdf, pdf, html, 其他]
标题: CoPlay:适用于声学感知的音频无关认知缩放
标题: CoPlay: Audio-agnostic Cognitive Scaling for Acoustic Sensing
Yin Li, Bo Liu, Rajalakshmi Nanadakumar
评论: ICCCN'25
主题: 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

声学传感通过利用智能设备上的扬声器和麦克风,在健康监测、手势接口和成像等各种应用中展现出巨大的潜力。 然而,在声学传感的持续研究和开发中,有一个问题经常被忽视:当同一扬声器同时用于传感和其他传统应用(如播放音乐)时,会导致干扰,使其在现实世界中难以使用。 强超声波传感信号与音乐混合会导致扬声器混音器过载。 为应对这种信号过载的问题,目前的解决方案是截断或缩小幅度,这两种方法都会影响音乐播放质量以及传感范围和准确性。 为了解决这一挑战,我们提出了CoPlay,一种基于深度学习的优化算法,以认知方式适应传感信号。 它可以1)在并发音乐留下的可用带宽内最大化传感信号幅度,以优化传感范围和准确性,并2)最小化可能影响音乐播放的后续频率失真。 在这项工作中,我们设计了一个深度学习模型,并将其测试于常见的传感信号类型(正弦波或调频连续波FMCW)作为输入,同时考虑各种无意识的并发音乐和语音。 首先,我们评估了模型性能,以展示生成信号的质量。 然后我们在现实世界中进行了下游声学传感任务的实地研究。 一项有12名用户的实验表明,使用我们适应后的信号进行呼吸监测和手势识别,其准确性与无并发音乐的情况相似,而截断或缩小幅度则表现出较差的准确性。 定性研究也表明,音乐播放质量没有下降,这与传统的截断或缩小幅度方法不同。

Acoustic sensing manifests great potential in various applications that encompass health monitoring, gesture interface and imaging by leveraging the speakers and microphones on smart devices. However, in ongoing research and development in acoustic sensing, one problem is often overlooked: the same speaker, when used concurrently for sensing and other traditional applications (like playing music), could cause interference in both making it impractical to use in the real world. The strong ultrasonic sensing signals mixed with music would overload the speaker's mixer. To confront this issue of overloaded signals, current solutions are clipping or down-scaling, both of which affect the music playback quality and also sensing range and accuracy. To address this challenge, we propose CoPlay, a deep learning based optimization algorithm to cognitively adapt the sensing signal. It can 1) maximize the sensing signal magnitude within the available bandwidth left by the concurrent music to optimize sensing range and accuracy and 2) minimize any consequential frequency distortion that can affect music playback. In this work, we design a deep learning model and test it on common types of sensing signals (sine wave or Frequency Modulated Continuous Wave FMCW) as inputs with various agnostic concurrent music and speech. First, we evaluated the model performance to show the quality of the generated signals. Then we conducted field studies of downstream acoustic sensing tasks in the real world. A study with 12 users proved that respiration monitoring and gesture recognition using our adapted signal achieve similar accuracy as no-concurrent-music scenarios, while clipping or down-scaling manifests worse accuracy. A qualitative study also manifests that the music play quality is not degraded, unlike traditional clipping or down-scaling methods.

[18] arXiv:2504.09885 (替换) [中文pdf, pdf, html, 其他]
标题: 分离以协作:用于协调钢琴手部运动合成的双流扩散模型
标题: Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li
评论: 15页,7图,被ACMMM 2025接收
主题: 声音 (cs.SD) ; 计算机视觉与模式识别 (cs.CV) ; 音频与语音处理 (eess.AS)

自动化协调双手钢琴演奏的合成面临重大挑战,特别是在捕捉双手之间的复杂编排的同时,保持其独特的运动学特征。 在本文中,我们提出了一种双流神经框架,旨在从音频输入生成同步的手部动作,解决建模双手独立性和协调性的关键挑战。 我们的框架引入了两个关键创新:(i) 一种解耦的基于扩散的生成框架,通过双噪声初始化独立建模每只手的运动,同时为每只手采样不同的潜在噪声,并利用共享的位置条件,以及 (ii) 一种手协调不对称注意力(HCAA)机制,抑制对称(共模)噪声以突出不对称的手部特定特征,同时在去噪过程中自适应增强手间协调性。 全面评估表明,我们的框架在多个指标上优于现有最先进方法。 我们的项目可在 https://monkek123king.github.io/S2C_page/ 获取。

Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics. Our project is available at https://monkek123king.github.io/S2C_page/.

[19] arXiv:2506.08570 (替换) [中文pdf, pdf, html, 其他]
标题: 自回归与流匹配:文本到音乐生成建模范式的比较研究
标题: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Or Tal, Felix Kreuk, Yossi Adi
主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 音频与语音处理 (eess.AS)

文本到音乐生成的最新进展使模型能够合成高质量的音乐片段、完整作品,甚至响应细粒度的控制信号,例如和弦进行。最先进的(SOTA)系统在许多维度上存在显著差异,例如训练数据集、建模范式和架构选择。这种多样性使得公平评估模型并确定哪些设计选择对性能影响最大变得复杂。虽然数据和架构等因素很重要,但在本研究中,我们仅专注于建模范式。我们进行系统的实证分析以隔离其影响,提供有关相关权衡和涌现行为的见解,这些见解可以指导未来的文本到音乐生成系统。具体而言,我们比较了两种 arguably 最常见的建模范式:自回归解码和条件流匹配。我们通过使用相同的训练数据集、训练配置和相似的主干架构从头开始训练所有模型来进行受控比较。性能在多个方面进行评估,包括生成质量、对推理配置的鲁棒性、可扩展性、对文本和时间对齐条件的遵循情况,以及以音频补全形式的编辑能力。这项比较研究揭示了每种范式的独特优势和局限性,提供了可操作的见解,可以为文本到音乐生成不断发展的领域中的未来架构和训练决策提供参考。音频示例可在以下位置获取:https://huggingface.co/spaces/ortal1602/ARvsFM

Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM

[20] arXiv:2509.00813 (替换) [中文pdf, pdf, html, 其他]
标题: AImoclips:一种用于评估文本到音乐生成中情感传达的基准
标题: AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation
Gyehun Go, Satbyul Han, Ahyeon Choi, Eunjin Choi, Juhan Nam, Jeong Mi Park
评论: 将发表于HCMIR25:第三届以人为中心的音乐信息研究研讨会
主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

最近在文本到音乐(TTM)生成方面的进展使得使用自然语言提示进行可控且富有表现力的音乐创作成为可能。 然而,与人类偏好或文本对齐相比,TTM系统的感情保真度仍大多未被深入研究。 在本研究中,我们引入了AImoclips,这是一个用于评估TTM系统如何将预期情感传达给人类听众的基准,涵盖了开源和商业模型。 我们选择了12种情感意图,涵盖效价-唤醒空间的四个象限,并使用六种最先进的TTM系统生成了超过1,000个音乐片段。 共有111名参与者在9点李克特量表上对每个片段的感知效价和唤醒度进行了评分。 我们的结果显示,商业系统倾向于生成比预期更令人愉快的音乐,而开源系统则往往表现相反。 在所有模型中,高唤醒条件下情感传达更为准确。 此外,所有系统都表现出对情感中立的偏见,突显了情感可控性的一个关键限制。 这个基准为模型特定的情感渲染特性提供了有价值的见解,并支持未来情感对齐的TTM系统的开发。

Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.

[21] arXiv:2509.01153 (替换) [中文pdf, pdf, html, 其他]
标题: EZhouNet:一种基于图神经网络和锚定区间的呼吸音事件检测框架
标题: EZhouNet:A framework based on graph neural network and anchor interval for the respiratory sound event detection
Yun Chu, Qiuhao Wang, Enze Zhou, Qian Liu, Gang Zheng
期刊参考: 生物医学信号处理与控制 2026-02 | 期刊文章
主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

听诊是早期诊断呼吸系统和肺部疾病的关键方法,依赖于有经验的医疗专业人员。 然而,该过程通常具有主观性,不同专家之间存在差异。 因此,出现了许多基于深度学习的自动分类方法,其中大多数专注于呼吸音分类。 相比之下,对呼吸音事件检测的研究仍有限。 现有的声音事件检测方法通常依赖于帧级预测,随后进行后处理以生成事件级输出,这使得间隔边界难以直接学习。 此外,许多方法只能处理固定长度的音频,限制了它们在可变长度呼吸音中的适用性。 另外,呼吸音位置信息对检测性能的影响尚未得到广泛探索。 为了解决这些问题,我们提出了一种基于图神经网络的框架,使用锚定间隔,能够处理可变长度的音频,并为异常呼吸音事件提供更精确的时间定位。 我们的方法提高了呼吸音检测的灵活性和适用性。 在SPRSound 2024和HF Lung V1数据集上的实验表明了所提方法的有效性,将呼吸位置信息纳入其中可以增强对异常声音的区分能力。 参考实现可在https://github.com/chumingqian/EzhouNet获取。

Auscultation is a key method for early diagnosis of respiratory and pulmonary diseases, relying on skilled healthcare professionals. However, the process is often subjective, with variability between experts. As a result, numerous deep learning-based automatic classification methods have emerged, most of which focus on respiratory sound classification. In contrast, research on respiratory sound event detection remains limited. Existing sound event detection methods typically rely on frame-level predictions followed by post-processing to generate event-level outputs, making interval boundaries challenging to learn directly. Furthermore, many approaches can only handle fixed-length audio, limiting their applicability to variable-length respiratory sounds. Additionally, the impact of respiratory sound location information on detection performance has not been extensively explored. To address these issues, we propose a graph neural network-based framework with anchor intervals, capable of handling variable-length audio and providing more precise temporal localization for abnormal respiratory sound events. Our method improves both the flexibility and applicability of respiratory sound detection. Experiments on the SPRSound 2024 and HF Lung V1 datasets demonstrate the effectiveness of the proposed approach, and incorporating respiratory position information enhances the discrimination between abnormal sounds. The reference implementation is available at https://github.com/chumingqian/EzhouNet.

[22] arXiv:2509.02020 (替换) [中文pdf, pdf, html, 其他]
标题: FireRedTTS-2:面向播客和聊天机器人的长对话语音生成
标题: FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu
主题: 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

当前对话生成方法通常需要完整的对话文本才能进行合成,并生成包含所有声音的单一、不可分割的语音,这使得它们不适合交互式聊天;此外,它们存在合成不稳定、说话人转换不准确和语调不连贯的问题。 在本工作中,我们提出了FireRedTTS-2,这是一个用于多说话人对话生成的长格式流式TTS系统,能够提供稳定、自然的语音,实现可靠的说话人切换和上下文感知的语调。 一个全新的12.5Hz流式语音分词器加速了训练和推理,扩展了最大对话长度,编码了更丰富的语义以稳定文本到标记的建模,并支持高保真流式生成以用于实时应用。 我们采用文本-语音交错格式,按时间顺序将带有说话人标签的文本与对齐的语音标记连接起来,并使用双Transformer进行建模:一个大型解码器仅Transformer在第一层预测标记,较小的一个完成后续层。 实验结果表明,FireRedTTS-2可以无缝集成到聊天框架中,并通过最小的微调,根据隐含的上下文线索生成情感丰富的语音。 在播客生成中,它在客观可理解性、说话人轮换可靠性以及具有上下文一致语调的感知自然度方面超越了现有的系统,包括MoonCast、Zipvoice-Dialogue和MOSS-TTSD。 我们的演示可在https://fireredteam.github.io/demos/firered_tts_2处获得。

Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.

总共 22 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有
  • 关于
  • 帮助
  • contact arXivClick here to contact arXiv 联系
  • 订阅 arXiv 邮件列表点击这里订阅 订阅
  • 版权
  • 隐私政策
  • 网络无障碍帮助
  • arXiv 运营状态
    通过...获取状态通知 email 或者 slack

京ICP备2025123034号