声音

新提交
交叉列表
替换

查看最近的文章

显示 2025年07月11日，星期五新的列表

总共 23 条目

显示最多 2000 每页条目：较少 | 更多 | 所有

[1] arXiv:2507.07270 [中文pdf, pdf, html, 其他]: 标题：基于瓶颈迭代网络的视听语音分离

标题： Audio-Visual Speech Separation via Bottleneck Iterative Network

Sidong Zhang, Shiv Shankar, Trang Nguyen, Andrea Fanelli, Madalina Fiterau

评论：被第42届国际机器学习大会音频机器学习研讨会接收

主题：声音 (cs.SD) ; 多媒体 (cs.MM) ; 音频与语音处理 (eess.AS)

从非听觉线索中整合信息可以显著提高语音分离模型的性能。通常，此类模型使用深度模态特定网络来获取单模态特征，但可能过于昂贵或过于轻量而缺乏能力。在本工作中，我们提出了一种称为瓶颈迭代网络（BIN）的迭代表示精炼方法，这是一种通过轻量级融合块反复进行的技术，同时通过融合标记来瓶颈化融合表示。这有助于提高模型的能力，同时避免模型规模的大幅增加，并在模型性能和训练成本之间取得平衡。我们在具有挑战性的噪声音频-视觉语音分离任务上测试了BIN，并表明我们的方法在NTCD-TIMIT和LRS3+WHAM!数据集上的SI-SDRi指标上始终优于最先进的基准模型，同时在几乎所有设置中实现了训练时间和GPU推理时间超过50%的减少。

Integration of information from non-auditory cues can significantly improve the performance of speech-separation models. Often such models use deep modality-specific networks to obtain unimodal features, and risk being too costly or lightweight but lacking capacity. In this work, we present an iterative representation refinement approach called Bottleneck Iterative Network (BIN), a technique that repeatedly progresses through a lightweight fusion block, while bottlenecking fusion representations by fusion tokens. This helps improve the capacity of the model, while avoiding major increase in model size and balancing between the model performance and training cost. We test BIN on challenging noisy audio-visual speech separation tasks, and show that our approach consistently outperforms state-of-the-art benchmark models with respect to SI-SDRi on NTCD-TIMIT and LRS3+WHAM! datasets, while simultaneously achieving a reduction of more than 50% in training and GPU inference time across nearly all settings.
[2] arXiv:2507.07318 [中文pdf, pdf, html, 其他]: 标题： SonicMotion：使用潜在扩散模型的动态空间音频场景

标题： SonicMotion: Dynamic Spatial Audio Soundscapes with Latent Diffusion Models

Christian Templin, Yanda Zhu, Hao Wang

主题：声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

空间音频是沉浸式娱乐的重要组成部分，如虚拟现实/增强现实（VR/AR），在电影和音乐中也日益受到欢迎。空间音频最常见的格式被描述为一阶全向音频（FOA）。我们旨在扩展FOA生成式人工智能模型的最新进展，以实现具有动态声源的3D场景生成。我们提出的端到端模型SonicMotion有两种变体，它们在用户输入和声源定位精度方面有所不同。除了我们的模型外，我们还介绍了一个新的模拟空间音频-字幕对数据集。我们模型的评估表明，它们能够在保持最先进的模型的语义对齐和音频质量的同时，捕捉到所需的空域属性。

Spatial audio is an integral part of immersive entertainment, such as VR/AR, and has seen increasing popularity in cinema and music as well. The most common format of spatial audio is described as first-order Ambisonics (FOA). We seek to extend recent advancements in FOA generative AI models to enable the generation of 3D scenes with dynamic sound sources. Our proposed end-to-end model, SonicMotion, comes in two variations which vary in their user input and level of precision in sound source localization. In addition to our model, we also present a new dataset of simulated spatial audio-caption pairs. Evaluation of our models demonstrate that they are capable of matching the semantic alignment and audio quality of state of the art models while capturing the desired spatial attributes.
[3] arXiv:2507.07384 [中文pdf, pdf, html, 其他]: 标题： VP-SelDoA：通过语义-空间匹配的视觉提示选择性到达方向估计

标题： VP-SelDoA: Visual-prompted Selective DoA Estimation of Target Sound via Semantic-Spatial Matching

Yu Chen, Xinyuan Qian, Hongxu Zhu, Jiadong Wang, Kainan Chen, Haizhou Li

评论：正在审核中

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

音视频声源定位（AV-SSL）通过利用听觉和视觉信号的互补优势来确定声源的位置。然而，现有的AV-SSL方法面临三个主要挑战：1）在多源场景中无法选择性地隔离目标声源，2）语义视觉特征与空间声学特征之间的错位，3）对配对的音视频数据过度依赖。为克服这些限制，我们引入了跨实例音视频定位（CI-AVL），这是一个新任务，利用同一声音事件类别的不同实例的图像来定位目标声源，从而减少对配对数据的依赖，同时增强泛化能力。我们提出的VP-SelDoA通过语义级模态融合来解决这一具有挑战性的任务，并采用频率-时间ConMamba架构生成目标选择性掩码以实现声音隔离。我们进一步开发了一个语义-空间匹配机制，通过集成的交叉注意力和自注意力机制来对齐异构的语义和空间特征。为了促进CI-AVL的研究，我们构建了一个名为VGG-SSL的大规模数据集，包含296个声音事件类别中的13,981个空间音频片段。大量实验表明，我们提出的方法优于最先进的音视频定位方法，实现了12.04的平均绝对误差（MAE）和78.23%的准确率（ACC）。

Audio-visual sound source localization (AV-SSL) identifies the position of a sound source by exploiting the complementary strengths of auditory and visual signals. However, existing AV-SSL methods encounter three major challenges: 1) inability to selectively isolate the target sound source in multi-source scenarios, 2) misalignment between semantic visual features and spatial acoustic features, and 3) overreliance on paired audio-visual data. To overcome these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that leverages images from different instances of the same sound event category to localize target sound sources, thereby reducing dependence on paired data while enhancing generalization capabilities. Our proposed VP-SelDoA tackles this challenging task through a semantic-level modality fusion and employs a Frequency-Temporal ConMamba architecture to generate target-selective masks for sound isolation. We further develop a Semantic-Spatial Matching mechanism that aligns the heterogeneous semantic and spatial features via integrated cross- and self-attention mechanisms. To facilitate the CI-AVL research, we construct a large-scale dataset named VGG-SSL, comprising 13,981 spatial audio clips across 296 sound event categories. Extensive experiments show that our proposed method outperforms state-of-the-art audio-visual localization methods, achieving a mean absolute error (MAE) of 12.04 and an accuracy (ACC) of 78.23%.
[4] arXiv:2507.07526 [中文pdf, pdf, html, 其他]: 标题： DMF2Mel：一种动态多尺度融合网络用于脑电驱动的梅尔频谱重建

标题： DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

评论：被ACM MM 2025接受

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

从脑信号中解码语音是一个具有挑战性的研究问题。尽管现有技术在重建听觉刺激的梅尔频谱图方面取得了进展，特别是在单词或字母级别，但在精确重建分钟级连续想象语音方面仍存在核心挑战：传统模型难以在时间依赖性建模的效率和长序列解码中的信息保留之间取得平衡。为了解决这个问题，本文提出了动态多尺度融合网络（DMF2Mel），该网络包含四个核心组件：动态对比特征聚合模块（DC-FAM）、分层注意力引导的多尺度网络（HAMS-Net）、样条映射注意力机制以及双向状态空间模块（convMamba）。具体来说，DC-FAM通过局部卷积和全局注意力机制将与语音相关的“前景特征”从噪声“背景特征”中分离出来，有效抑制干扰并增强瞬态信号的表示。基于U-Net框架的HAMS-Net实现了高层语义和低层细节的跨尺度融合。 SplineMap注意力机制集成了自适应门控科莫戈罗夫-阿诺德网络（AGKAN），将全局上下文建模与基于样条的局部拟合相结合。 convMamba以线性复杂度捕捉长程时间依赖性，并增强了非线性动态建模能力。在SparrKULee数据集上的结果表明，DMF2Mel在已知受试者的梅尔频谱图重建中达到了0.074的皮尔逊相关系数（比基线提高了48%），在未知受试者中达到了0.048（比基线提高了35%）。代码可在以下链接获取：https://github.com/fchest/DMF2Mel。

Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.
[5] arXiv:2507.07764 [中文pdf, pdf, html, 其他]: 标题：评估音频表示与音色相似性评分的对齐程度

标题： Assessing the Alignment of Audio Representations with Timbre Similarity Ratings

Haokun Tian, Stefan Lattner, Charalampos Saitis

评论：被ISMIR 2025接收

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

所谓的"音色空间"通过多维缩放将乐器声音的感知相似性评分映射到低维嵌入中，但存在可扩展性问题，并且无法进行泛化。最近在音频（音乐和语音）质量评估以及图像相似性方面的研究成果表明，深度学习能够生成与人类感知高度对齐的嵌入，同时基本不受这些限制的影响。尽管现有的人工评分音色相似性数据不足以训练深度神经网络（334个音频样本上的2,614对评分），但它可以作为音频模型的仅测试数据。在本文中，我们引入了评估不同音频表示与人类音色相似性判断对齐程度的指标，通过将嵌入距离的绝对值和排名与人类相似性评分进行比较来实现。我们的评估包括三种基于信号处理的表示，十二种从预训练模型中提取的表示，以及三种从新型声音匹配模型中提取的表示。其中，受图像风格迁移启发的风格嵌入，从CLAP模型和声音匹配模型中提取，显著优于其他表示，显示出它们在建模音色相似性方面的潜力。

Psychoacoustical so-called "timbre spaces" map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling, but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning is able to produce embeddings that align well with human perception while being largely free from these constraints. Although the existing human-rated timbre similarity data is not large enough to train deep neural networks (2,614 pairwise ratings on 334 audio samples), it can serve as test-only data for audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgments of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human similarity ratings. Our evaluation involves three signal-processing-based representations, twelve representations extracted from pre-trained models, and three representations extracted from a novel sound matching model. Among them, the style embeddings inspired by image style transfer, extracted from the CLAP model and the sound matching model, remarkably outperform the others, showing their potential in modeling timbre similarity.
[6] arXiv:2507.07799 [中文pdf, pdf, html, 其他]: 标题： SecureSpeech：基于提示的说话人和内容保护

标题： SecureSpeech: Prompt-based Speaker and Content Protection

Belinda Soh Hui Hui, Xiaoxiao Miao, Xin Wang

评论：被IEEE国际生物特征学会议（IJCB）2025接受

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

鉴于身份盗窃和通过语音内容重新识别说话者带来的日益严重的隐私问题，本文提出了一种基于提示的语音生成流程，以确保说话者身份和语音内容的双重匿名化。这是通过1）生成与源说话者不可关联的说话者身份，由描述符控制，以及2）使用名称实体识别模型和大型语言模型替换原始文本中的敏感内容来实现的。该流程利用匿名化的说话者身份和文本通过文本到语音合成模型生成高质量、隐私友好的语音。实验结果表明，在保持一定的内容保留和音频质量水平的同时，实现了显著的隐私保护。本文还研究了不同说话者描述对生成语音的实用性和隐私的影响，以确定潜在的偏见。

Given the increasing privacy concerns from identity theft and the re-identification of speakers through content in the speech field, this paper proposes a prompt-based speech generation pipeline that ensures dual anonymization of both speaker identity and spoken content. This is addressed through 1) generating a speaker identity unlinkable to the source speaker, controlled by descriptors, and 2) replacing sensitive content within the original text using a name entity recognition model and a large language model. The pipeline utilizes the anonymized speaker identity and text to generate high-fidelity, privacy-friendly speech via a text-to-speech synthesis model. Experimental results demonstrate an achievement of significant privacy protection while maintaining a decent level of content retention and audio quality. This paper also investigates the impact of varying speaker descriptions on the utility and privacy of generated speech to determine potential biases.
[7] arXiv:2507.07806 [中文pdf, pdf, html, 其他]: 标题：端到端的声学语言情感和意图识别，通过半监督学习增强

标题： End-to-end Acoustic-linguistic Emotion and Intent Recognition Enhanced by Semi-supervised Learning

Zhao Ren, Rathi Adarshi Rammohan, Kevin Scheck, Sheng Li, Tanja Schultz

评论：被EMBC 2025接受

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

从语音中进行情绪和意图识别是至关重要的，并且在人机交互中已被广泛研究。社交媒体平台、聊天机器人和其他技术的快速发展导致了大量语音数据从用户那里流出来。然而，手动标注这些数据成本高昂，使得训练用于识别的机器学习模型变得具有挑战性。为此，我们提出应用半监督学习，将大量未标记的数据与相对较小的标记数据集结合起来。我们训练端到端的声学和语言模型，每个模型都采用多任务学习进行情绪和意图识别。比较了两种半监督学习方法，包括fix-match学习和full-match学习。实验结果表明，半监督学习方法在从声学和文本数据中进行语音情绪和意图识别时提高了模型性能。最佳模型的后期融合分别在联合识别平衡指标上优于声学和文本基线12.3%和10.4%。

Emotion and intent recognition from speech is essential and has been widely investigated in human-computer interaction. The rapid development of social media platforms, chatbots, and other technologies has led to a large volume of speech data streaming from users. Nevertheless, annotating such data manually is expensive, making it challenging to train machine learning models for recognition purposes. To this end, we propose applying semi-supervised learning to incorporate a large scale of unlabelled data alongside a relatively smaller set of labelled data. We train end-to-end acoustic and linguistic models, each employing multi-task learning for emotion and intent recognition. Two semi-supervised learning approaches, including fix-match learning and full-match learning, are compared. The experimental results demonstrate that the semi-supervised learning approaches improve model performance in speech emotion and intent recognition from both acoustic and text data. The late fusion of the best models outperforms the acoustic and text baselines by joint recognition balance metrics of 12.3% and 10.4%, respectively.
[8] arXiv:2507.07867 [中文pdf, pdf, html, 其他]: 标题：重瓶颈：神经音频自编码器的潜在重新结构化

标题： Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

Dimitrios Bralios, Jonah Casebeer, Paris Smaragdis

评论：被IEEE MLSP 2025接受

主题：声音 (cs.SD) ; 机器学习 (cs.LG) ; 音频与语音处理 (eess.AS)

神经音频编解码器和自编码器已成为用于音频压缩、传输、特征提取和潜在空间生成的多功能模型。然而，一个关键限制是大多数模型都是通过最大化重建保真度进行训练的，通常忽略了在不同下游应用中实现最佳性能所需的特定潜在结构。我们提出了一种简单且后期的框架来解决这个问题，方法是对预训练的自编码器的瓶颈进行修改。我们的方法引入了一个“重新瓶颈”，这是一个仅通过潜在空间损失进行训练的内部瓶颈，以注入用户定义的结构。我们在三个实验中展示了该框架的有效性。首先，我们在不牺牲重建质量的情况下对潜在通道施加顺序。其次，我们将潜在空间与语义嵌入对齐，并分析其对下游扩散建模的影响。第三，我们引入了等变性，确保对输入波形的过滤操作直接对应于潜在空间中的特定变换。最终，我们的重新瓶颈框架提供了一种灵活且高效的方式来调整神经音频模型的表示，使其能够以最小的额外训练无缝满足不同应用的各种需求。

Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.
[9] arXiv:2507.07877 [中文pdf, pdf, html, 其他]: 标题： Edge-ASR：面向自动语音识别模型的低比特量化

标题： Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

Chen Feng, Yicheng Lin, Shaojie Zhuo, Chenzheng Su, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Xiaopeng Zhang

主题：声音 (cs.SD) ; 机器学习 (cs.LG) ; 音频与语音处理 (eess.AS)

最近在自动语音识别（ASR）方面的进展展示了在各种音频应用中的显著准确性和鲁棒性，例如实时转录和语音命令处理。然而，由于对内存、计算和功耗的严格限制，将这些模型部署在资源受限的边缘设备（例如物联网设备、可穿戴设备）上仍然面临重大挑战。量化，特别是训练后量化（PTQ），提供了一种在不重新训练的情况下减少模型大小和推理成本的有效方法。尽管其重要性，各种先进的量化方法和位宽配置对ASR模型的性能影响仍不清楚。在本工作中，我们对两种领先的边缘ASR模型家族Whisper和Moonshine应用了八个最先进的（SOTA）PTQ方法进行了全面基准测试。我们系统地评估了模型性能（即准确率、内存I/O和位操作），在来自开放ASR排行榜的七个不同数据集上分析了量化以及权重和激活的各种配置的影响。基于LLM压缩工具包的扩展，我们的框架集成了边缘ASR模型、多种先进的量化算法、统一的校准和评估数据管道以及详细的分析工具。我们的结果描述了效率和准确率之间的权衡，证明了即使使用3位量化，在使用先进的PTQ技术时，高容量模型也可以成功运行。这些发现为在低功耗、始终开启的边缘设备上优化ASR模型提供了有价值的见解。

Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leaderboard, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, and detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even 3-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.
[10] arXiv:2507.07879 [中文pdf, pdf, 其他]: 标题： LISTEN：用于边缘通知的轻量级工业声音可表示Transformer

标题： LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification

Changheon Han, Yun Seok Kang, Yuseop Sim, Martin Byung-Guk Jun, Hyung Wook Park

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

基于深度学习的机器听觉正在扩大工业声学分析的应用范围，例如异常检测和预测性维护，从而提高制造效率和可靠性。然而，它依赖于每个新任务的大规模、特定任务的标注数据集，这限制了在车间现场的广泛应用。虽然新兴的声音基础模型旨在减轻数据依赖性，但它们体积过大且计算成本高，需要云基础设施或高端硬件，这对于现场实时部署来说是不现实的。我们通过 LISTEN（轻量级工业声音可表示的边缘通知变压器）来解决这一问题，这是一个千字节大小的工业声音基础模型。使用知识蒸馏，LISTEN可以在低成本的边缘设备上实时运行。在基准下游任务中，即使使用最小的数据集和训练资源进行微调，它的表现也几乎与更大的父模型相同。除了模型本身，我们通过将 LISTEN 集成到边缘设备上的完整机器监控框架中，并结合工业物联网（IIoT）传感器和系统，展示了其实际应用价值，在实际制造车间现场验证了其性能和泛化能力。

Deep learning-based machine listening is broadening the scope of industrial acoustic analysis for applications like anomaly detection and predictive maintenance, thereby improving manufacturing efficiency and reliability. Nevertheless, its reliance on large, task-specific annotated datasets for every new task limits widespread implementation on shop floors. While emerging sound foundation models aim to alleviate data dependency, they are too large and computationally expensive, requiring cloud infrastructure or high-end hardware that is impractical for on-site, real-time deployment. We address this gap with LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), a kilobyte-sized industrial sound foundation model. Using knowledge distillation, LISTEN runs in real-time on low-cost edge devices. On benchmark downstream tasks, it performs nearly identically to its much larger parent model, even when fine-tuned with minimal datasets and training resource. Beyond the model itself, we demonstrate its real-world utility by integrating LISTEN into a complete machine monitoring framework on an edge device with an Industrial Internet of Things (IIoT) sensor and system, validating its performance and generalization capabilities on a live manufacturing shop floor.
[11] arXiv:2507.07954 [中文pdf, pdf, html, 其他]: 标题：输入条件层丢弃在语音基础模型中

标题： Input Conditioned Layer Dropping in Speech Foundation Models

Abdul Hannan, Daniele Falavigna, Alessio Brutti

评论：已被IEEE MLSP 2025接受

主题：声音 (cs.SD) ; 计算机视觉与模式识别 (cs.CV) ; 音频与语音处理 (eess.AS)

为边缘和物联网环境定制基础语音模型，其中计算资源随时间变化，需要具有可适应减少策略的动态架构。一种新兴的方法是层跳过（$\mathcal{LD}$），它在推理期间跳过骨干网络的部分层以减少计算负载。这使得静态模型可以转换为动态模型。然而，现有方法在选择层的方式上存在局限性，或者通过显著修改神经架构来实现。为此，我们提出了输入驱动的$\mathcal{LD}$，它利用网络的输入特征和一个轻量级的层选择网络来确定最佳的处理层组合。在四个语音和音频公共基准测试中进行的广泛实验，使用两种不同的预训练基础模型，证明了我们方法的有效性，全面优于随机跳过，并产生了与早期退出相当（或更好）的结果。

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.

[12] arXiv:2507.07396 (交叉列表自 cs.MM) [中文pdf, pdf, html, 其他]: 标题： IML-Spikeformer：用于语音处理的输入感知多级脉冲变换器

标题： IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li

评论：正在审阅中

主题：多媒体 (cs.MM) ; 机器学习 (cs.LG) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

脉冲神经网络（SNNs），受生物神经机制的启发，是一种有前途的类脑计算范式，为传统人工神经网络（ANNs）提供了节能的替代方案。尽管已被证明有效，SNN架构在大规模语音处理任务中一直难以达到具有竞争力的性能。两个关键挑战阻碍了进展：（1）由于多时间步脉冲发射导致的训练期间高计算开销，以及（2）缺乏针对语音处理任务的大型SNN架构。为解决这些问题，我们引入了输入感知多级脉冲变换器，即IML-Spikeformer，这是一种专为大规模语音处理设计的脉冲Transformer架构。我们的设计核心是输入感知多级脉冲（IMLS）机制，该机制使用自适应、输入感知的阈值方案，在单个时间步内模拟多时间步脉冲发射。 IML-Spikeformer进一步集成了可重新参数化的脉冲自注意力（RepSSA）模块与分层衰减掩码（HDM），形成HD-RepSSA模块。该模块提高了注意图的精度，并实现了对语音信号多尺度时间依赖性的建模。实验表明，IML-Spikeformer在AiShell-1上实现了6.0%的词错误率，在Librispeech-960上实现了3.4%，与传统ANN变换器相当，同时分别将理论推理能耗降低了4.64$\times$和4.32$\times$。 IML-Spikeformer在任务性能和能效方面标志着大规模语音处理中可扩展SNN架构的进展。

Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulate multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Reparameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0\% on AiShell-1 and 3.4\% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$\times$ and 4.32$\times$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency.
[13] arXiv:2507.07631 (交叉列表自 eess.AS) [中文pdf, pdf, html, 其他]: 标题：基于自监督表示空间损失的通用语音增强

标题： Generic Speech Enhancement with Self-Supervised Representation Space Loss

Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Ryo Masumura

评论： 22页，3张图。已被《信号处理前沿》接收

期刊参考：信号处理前沿 5：1587969，2025

主题：音频与语音处理 (eess.AS) ; 声音 (cs.SD) ; 信号处理 (eess.SP)

单通道语音增强被用于各种任务以减轻干扰信号的影响。传统上，为了确保语音增强效果最佳，语音增强需要针对每个任务进行调整。因此，将语音增强模型推广到未知的下游任务具有挑战性。本研究旨在构建一个通用的语音增强前端，可以提高后端解决多个下游任务的性能。为此，我们提出了一种新的训练准则，该准则在自监督学习模型的特征表示域中最小化增强信号与真实干净信号之间的距离。由于自监督学习的特征表示能够有效表达对解决各种下游任务有用的高度语音信息，该方法有望使语音增强模型保留此类信息。实验验证表明，该方法在保持增强信号的感知质量的同时，提高了多个语音任务的性能。

Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task. Thus, generalizing speech enhancement models to unknown downstream tasks has been challenging. This study aims to construct a generic speech enhancement front-end that can improve the performance of back-ends to solve multiple downstream tasks. To this end, we propose a novel training criterion that minimizes the distance between the enhanced and the ground truth clean signal in the feature representation domain of self-supervised learning models. Since self-supervised learning feature representations effectively express high-level speech information useful for solving various downstream tasks, the proposal is expected to make speech enhancement models preserve such information. Experimental validation demonstrates that the proposal improves the performance of multiple speech tasks while maintaining the perceptual quality of the enhanced signal.

[14] arXiv:2411.13766 (替换) [中文pdf, pdf, html, 其他]: 标题： Tiny-Align：在边缘上连接自动语音识别和大型语言模型

标题： Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Ruiyang Qin, Dancheng Liu, Gelei Xu, Zheyu Yan, Chenhui Xu, Yuting Hu, X. Sharon Hu, Jinjun Xiong, Yiyu Shi

评论：被ICCAD'25接受

主题：声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

大型语言模型（LLM）和自动语音识别（ASR）的结合，当部署在边缘设备上（称为边缘ASR-LLM）时，可以作为强大的个性化助手，为用户提供基于音频的交互。与基于文本的交互相比，边缘ASR-LLM允许更易于访问和自然的音频交互。不幸的是，现有的ASR-LLM模型主要在高性能计算环境中训练，并产生大量的模型权重，使得它们难以部署在边缘设备上。更重要的是，为了更好地满足用户的个性化需求，ASR-LLM必须能够从每个不同的用户中学习，因为音频输入通常包含高度个性化的特征，这需要个性化的本地设备训练。由于单独微调ASR或LLM通常会因模态特定的限制而导致次优结果，端到端训练确保了音频特征和语言理解的无缝集成（跨模态对齐），最终实现在边缘设备上的更个性化和高效的适应。然而，由于现有方法的复杂训练要求和巨大的计算需求，ASR音频和LLM之间的跨模态对齐在边缘设备上可能具有挑战性。在本工作中，我们提出了一种资源高效的跨模态对齐框架，该框架在边缘设备上连接ASR和LLMs以处理个性化的音频输入。我们的框架使资源受限的设备（如NVIDIA Jetson Orin（8GB RAM））能够实现高效的ASR-LLM对齐，在提高对齐质量超过50%的同时实现了50倍的训练时间加速。据我们所知，这是第一项研究在资源受限的边缘设备上高效ASR-LLM对齐的工作。

The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
[15] arXiv:2411.19204 (替换) [中文pdf, pdf, 其他]: 标题：基于语音的2型糖尿病分诊方法，利用家庭环境中的对话式虚拟助手

标题： A Voice-based Triage for Type 2 Diabetes using a Conversational Virtual Assistant in the Home Environment

Kelvin Summoogum, Debayan Das, Sathish Kumaran, Sumit Bhagra (MD)

评论： 8页

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

将云计算技术与医疗物联网相结合，以实现无处不在的医疗保健，在过去十年中随着机器学习和深度学习技术的出现已经取得了许多成功的应用。其中一项应用，即基于语音的病理学，尚未引起学术界和产业界的广泛关注。将语音分析应用于致命疾病的早期检测，有望改善患者的健康结果和生活质量。在本文中，我们提出了一种基于声学机器学习的新型应用，将其融入商品化的对话式虚拟助理系统，用于糖尿病发作的初步筛查。具体来说，我们开发了一个分诊系统，该系统从n=24名老年人与虚拟助理对话时的语音中提取声学特征，并预测是否发生糖尿病（2型）。我们的分诊系统分别对男性和女性老年受试者的命中率达到了70%和60%。我们提出的分诊系统使用了7个不可识别的基于语音的特征，并且可以在运行基于语音的虚拟助理的资源受限的嵌入式系统中运行。这项应用展示了在家庭环境中通过早期检测像糖尿病这样的影响生活的慢性疾病，将基于语音的病理学分析应用于改善老年人健康结果的可行性。

Incorporating cloud technology with Internet of Medical Things for ubiquitous healthcare has seen many successful applications in the last decade with the advent of machine learning and deep learning techniques. One of these applications, namely voice-based pathology, has yet to receive notable attention from academia and industry. Applying voice analysis to early detection of fatal diseases holds much promise to improve health outcomes and quality of life of patients. In this paper, we propose a novel application of acoustic machine learning based triaging into commoditised conversational virtual assistant systems to pre-screen for onset of diabetes. Specifically, we developed a triaging system which extracts acoustic features from the voices of n=24 older adults when they converse with a virtual assistant and predict the incidence of Diabetes Mellitus (Type 2) or not. Our triaging system achieved hit-rates of 70% and 60% for male and female older adult subjects, respectively. Our proposed triaging uses 7 non-identifiable voice-based features and can operate within resource-constrained embedded systems running voice-based virtual assistants. This application demonstrates the feasibility of applying voice-based pathology analysis to improve health outcomes of older adults within the home environment by early detection of life-changing chronic conditions like diabetes.
[16] arXiv:2506.04391 (替换) [中文pdf, pdf, html, 其他]: 标题：用于音频分类模型的时间局部解释的基准测试

标题： Benchmarking Time-localized Explanations for Audio Classification Models

Cecilia Bola√±os, Leonardo Pepino, Martin Meza, Luciana Ferrer

主题：声音 (cs.SD) ; 音频与语音处理 (eess.AS)

大多数现代音频处理方法是不透明的，也就是说，它们不会为其决策提供解释。因此，已经提出了各种方法来解释这些模型生成的输出。好的解释可以带来关于数据或模型的有趣见解，并增加对系统的信任。不幸的是，评估解释的质量远非简单，因为对于大多数任务来说，没有明确的真值解释可供参考。在本工作中，我们提出了一种用于音频分类模型的时间定位解释的基准，该基准使用目标事件的时间注释作为真值解释的代理。我们使用这个基准系统地优化和比较各种模型无关的后期解释方法，在某些情况下，获得了接近完美的解释。最后，我们展示了解释在揭示虚假相关性方面的实用性。

Most modern approaches for audio processing are opaque, in the sense that they do not provide an explanation for their decisions. For this reason, various methods have been proposed to explain the outputs generated by these models. Good explanations can result in interesting insights about the data or the model, as well as increase trust in the system. Unfortunately, evaluating the quality of explanations is far from trivial since, for most tasks, there is no clear ground truth explanation to use as reference. In this work, we propose a benchmark for time-localized explanations for audio classification models that uses time annotations of target events as a proxy for ground truth explanations. We use this benchmark to systematically optimize and compare various approaches for model-agnostic post-hoc explanation, obtaining, in some cases, close to perfect explanations. Finally, we illustrate the utility of the explanations for uncovering spurious correlations.
[17] arXiv:2507.03251 (替换) [中文pdf, pdf, html, 其他]: 标题：通过频谱学习和注意力实现高效的语音情感识别

标题： Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention

HyeYoung Lee, Muhammad Nadeem

主题：声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)

语音情感识别（SER）传统上依赖于听觉数据分析来进行情感分类。一些研究采用了不同的方法进行SER。然而，现有的SER方法往往难以捕捉细微的情感变化，并在不同数据集之间泛化。在本文中，我们使用梅尔频率倒谱系数（MFCCs）作为频谱特征，以弥合计算情感处理与人类听觉感知之间的差距。为了进一步提高鲁棒性和特征多样性，我们提出了一种基于1D-CNN的新型SER框架，该框架集成了数据增强技术。从增强数据中提取的MFCC特征通过一个结合了通道和空间注意力机制的1D卷积神经网络（CNN）架构进行处理。这些注意力模块使模型能够突出关键的情感模式，增强其捕捉语音信号中细微变化的能力。所提出的方法实现了最先进的性能，在SAVEE上的准确率为97.49%，在RAVDESS上的准确率为99.23%，在CREMA-D上的准确率为89.31%，在TESS上的准确率为99.82%，在EMO-DB上的准确率为99.53%，在EMOVO上的准确率为96.39%。实验结果展示了SER的新基准，证明了我们的方法在高精度识别情感表达方面的有效性。我们的评估表明，先进深度学习（DL）方法的集成显著提高了在不同数据集上的泛化能力，突显了它们在辅助技术和人机交互中的实际部署中推进SER的潜力。

Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification. Several studies have adopted different methods for SER. However, existing SER methods often struggle to capture subtle emotional variations and generalize across diverse datasets. In this article, we use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception. To further improve robustness and feature diversity, we propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques. MFCC features extracted from the augmented data are processed using a 1D Convolutional Neural Network (CNN) architecture enhanced with channel and spatial attention mechanisms. These attention modules allow the model to highlight key emotional patterns, enhancing its ability to capture subtle variations in speech signals. The proposed method delivers cutting-edge performance, achieving the accuracy of 97.49% for SAVEE, 99.23% for RAVDESS, 89.31% for CREMA-D, 99.82% for TESS, 99.53% for EMO-DB, and 96.39% for EMOVO. Experimental results show new benchmarks in SER, demonstrating the effectiveness of our approach in recognizing emotional expressions with high precision. Our evaluation demonstrates that the integration of advanced Deep Learning (DL) methods substantially enhances generalization across diverse datasets, underscoring their potential to advance SER for real-world deployment in assistive technologies and human-computer interaction.
[18] arXiv:2411.10927 (替换) [中文pdf, pdf, 其他]: 标题：跨语言语音构成（IPC）：一种增强第二语言发音的理论和计算方法

标题： Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation

Jisang Park, Minu Kim, DaYoung Hong, Jongha Lee

主题：计算与语言 (cs.CL) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

第二语言（L2）的学习者常常无意识地用母语（L1）中相似的音素替换不熟悉的L2音素，尽管L2的母语者认为这些声音是不同的且不可互换的。这种音素替换导致L2的标准音系模式出现偏差，给学习者掌握准确的L2发音带来挑战。为了解决这个问题，我们提出了跨语言语音组合（IPC），这是一种新的计算方法，旨在通过将L2音素重构为由多个L1音素组成的复合音，从而最小化错误的音系转移。对两个自动语音识别模型的测试表明，当L2说话者发出IPC生成的复合音时，目标L2音素的识别率比其发音受原始音系转移模式影响时提高了20%。这种改进在相对较短的时间内就观察到，证明了复合音的快速习得。

Learners of a second language (L2) often unconsciously substitute unfamiliar L2 phonemes with similar phonemes from their native language (L1), even though native speakers of the L2 perceive these sounds as distinct and non-interchangeable. This phonemic substitution leads to deviations from the standard phonological patterns of the L2, creating challenges for learners in acquiring accurate L2 pronunciation. To address this, we propose Inter-linguistic Phonetic Composition (IPC), a novel computational method designed to minimize incorrect phonological transfer by reconstructing L2 phonemes as composite sounds derived from multiple L1 phonemes. Tests with two automatic speech recognition models demonstrated that when L2 speakers produced IPC-generated composite sounds, the recognition rate of target L2 phonemes improved by 20% compared to when their pronunciation was influenced by original phonological transfer patterns. The improvement was observed within a relatively shorter time frame, demonstrating rapid acquisition of the composite sound.
[19] arXiv:2412.18603 (替换) [中文pdf, pdf, html, 其他]: 标题：长格式语音生成与口语语言模型

标题： Long-Form Speech Generation with Spoken Language Models

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

评论：被ICML 2025接收（口头报告）

主题：计算与语言 (cs.CL) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

我们考虑了多分钟语音的生成建模，这是长格式多媒体生成和音频原生语音助手的要求。然而，无文本的口语语言模型在超过几十秒后难以生成合理的语音，这是由于语音标记的高时间分辨率导致连贯性丢失，长序列训练或外推的架构问题，以及推理时的内存成本。基于这些考虑，我们得出了SpeechSSM，这是第一个从和采样长格式口语音频（例如16分钟的朗读或即兴演讲）的语音语言模型家族，在单个解码会话中不使用文本中间步骤。 SpeechSSMs利用线性时间序列建模的最新进展，在多分钟生成中大大超越当前Transformer口语LM的连贯性和效率，同时在话语级别仍与它们保持一致。由于我们发现当前的口语语言评估缺乏信息，尤其是在这种新的长格式设置中，我们还引入了：LibriSpeech-Long，一个用于长格式语音评估的基准；新的基于嵌入的和LLM判断的指标；以及对长度和时间的质量测量。语音样本、LibriSpeech-Long数据集以及任何未来的代码或模型发布都可以在https://google.github.io/tacotron/publications/speechssm/找到。

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.
[20] arXiv:2502.00718 (替换) [中文pdf, pdf, html, 其他]: 标题： “我不好”：在音频-语言模型中解释隐蔽、通用和鲁棒的音频越狱攻击

标题： "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

Isha Gupta, David Khachaturov, Robert Mullins

主题：机器学习 (cs.LG) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

多模态大语言模型的兴起引入了创新的人机交互范式，但也给机器学习安全带来了重大挑战。音频-语言模型（ALMs）尤其相关，因为口语交流的直观性，但关于它们的失败模式了解甚少。本文探讨针对ALMs的音频越狱攻击，重点关注其绕过对齐机制的能力。我们构建了在提示、任务甚至基础音频样本中具有泛化能力的对抗性扰动，展示了音频模态中的首个通用越狱攻击，并表明这些攻击在模拟的真实世界条件下仍然有效。除了展示攻击的可行性外，我们分析了ALMs如何解释这些音频对抗性示例，并揭示它们能够编码难以察觉的第一人称有害言论——这表明最有效的引发有害输出的扰动特别在音频信号中嵌入了语言特征。这些结果对于理解多模态模型中不同模态之间的相互作用具有重要意义，并为增强防御对抗性音频攻击提供了可行的见解。

The rise of multimodal large language models has introduced innovative human-machine interaction paradigms but also significant challenges in machine learning safety. Audio-Language Models (ALMs) are especially relevant due to the intuitive nature of spoken communication, yet little is known about their failure modes. This paper explores audio jailbreaks targeting ALMs, focusing on their ability to bypass alignment mechanisms. We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples, demonstrating the first universal jailbreaks in the audio modality, and show that these remain effective in simulated real-world conditions. Beyond demonstrating attack feasibility, we analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech - suggesting that the most effective perturbations for eliciting toxic outputs specifically embed linguistic features within the audio signal. These results have important implications for understanding the interactions between different modalities in multimodal models, and offer actionable insights for enhancing defenses against adversarial audio attacks.
[21] arXiv:2505.04382 (替换) [中文pdf, pdf, html, 其他]: 标题：离散最优传输与语音转换

标题： Discrete Optimal Transport and Voice Conversion

Anton Selitskiy, Maitreya Kocharekar

评论： 4页，6图，1表

主题：音频与语音处理 (eess.AS) ; 机器学习 (cs.LG) ; 声音 (cs.SD)

在本工作中，我们使用基于向量的接口来解决语音转换（VC）任务。为了对齐说话人之间的音频嵌入，我们采用离散最优传输映射。我们的评估结果证明了该方法的高质量和有效性。此外，我们表明在音频生成中将离散最优传输作为后处理步骤可能会导致合成音频被错误分类为真实音频。

In this work, we address the voice conversion (VC) task using a vector-based interface. To align audio embeddings between speakers, we employ discrete optimal transport mapping. Our evaluation results demonstrate the high quality and effectiveness of this method. Additionally, we show that applying discrete optimal transport as a post-processing step in audio generation can lead to the incorrect classification of synthetic audio as real.
[22] arXiv:2506.00981 (替换) [中文pdf, pdf, html, 其他]: 标题：自我监督语音模型了解荷兰语的哪些内容？分析语言特定预训练的优势

标题： What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum

评论：被Interspeech 2025接收。关于模型、代码和资料，请查看 https://github.com/mdhk/SSL-NL-eval

期刊参考：过程 INTERSPEECH 2025

主题：计算与语言 (cs.CL) ; 人工智能 (cs.AI) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)

语言特定的语音表示是由自监督模型学习的吗？现有研究显示，可以从仅使用语音记录训练的端到端模型中成功解码出一系列语言特征。然而，预训练特定语言在多大程度上能提升语言特定的语言信息仍不明确。在这里，我们测试了自监督 Wav2Vec2 模型内部表示中的荷兰语语音和词汇信息。仅在荷兰语上进行预训练相比在相同数量的英语或更多多语言数据上进行预训练，能更好地提升荷兰语语言特征的表示。这种语言特定的优势可以通过训练好的聚类或分类探测器很好地检测到，并且可以通过零样本度量部分观察到。此外，语言特定的益处与自动语音识别的下游性能相一致。

How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
[23] arXiv:2506.15220 (替换) [中文pdf, pdf, html, 其他]: 标题：视频-SALMONN 2：增强字幕的视听大型语言模型

标题： video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

主题：计算机视觉与模式识别 (cs.CV) ; 计算与语言 (cs.CL) ; 声音 (cs.SD)

视频包含大量信息，生成详细且准确的自然语言描述是视频理解的关键方面。在本文中，我们提出了video-SALMONN 2，一种先进的音视频大语言模型（LLM），采用低秩适应（LoRA）设计，用于通过定向偏好优化（DPO）增强视频（带配对音频）的字幕生成。我们提出了新的指标来评估视频描述的完整性和准确性，这些指标通过DPO进行优化。为了进一步提高训练效果，我们提出了一种新颖的多轮DPO（MrDPO）方法，该方法包括定期更新DPO参考模型，在每次训练轮次（1,000步）后合并并重新初始化LoRA模块作为参数更新的代理，并结合来自真实视频字幕的指导以稳定过程。实验结果表明，MrDPO显著提高了video-SALMONN 2的字幕生成准确性，将字幕生成错误率降低了28%。最终的video-SALMONN 2模型，仅包含70亿参数，在视频字幕生成任务中超越了GPT-4o和Gemini-1.5-Pro等领先模型，同时在相似规模模型中保持了在广泛使用的视频问答基准上的高度竞争力。代码可在\href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}处获得。

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

总共 23 条目

显示最多 2000 每页条目：较少 | 更多 | 所有

声音

显示 2025年07月11日，星期五新的列表

新提交 (展示 11 之 11 条目 )

交叉提交 (展示 2 之 2 条目 )

替换提交 (展示 10 之 10 条目 )

声音

显示 2025年07月11日， 星期五 新的列表

新提交 (展示 11 之 11 条目 )

交叉提交 (展示 2 之 2 条目 )

替换提交 (展示 10 之 10 条目 )

显示 2025年07月11日，星期五新的列表