VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Wu, Weihao; Cao, Liang; Wu, Xinyu; Lin, Zhiwei; Niu, Rui; Li, Jingbei; Wu, Zhiyong

计算机科学 > 计算与语言

arXiv:2509.03940 (cs)

[提交于 2025年9月4日 ]

标题： VoxRole：一种用于评估基于语音的角色扮演代理的全面基准

标题： VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Authors:Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, Zhiyong Wu

摘要：最近在大型语言模型（LLMs）方面的重大进展极大地推动了角色扮演对话代理（RPCAs）的发展。这些系统旨在通过一致的角色采用创造沉浸式的用户体验。然而，当前的RPCA研究面临双重限制。首先，现有工作主要关注文本模态，完全忽略了语音中关键的副语言特征，包括语调、韵律和节奏，这些对于传达角色情感和塑造生动的身份至关重要。其次，基于语音的角色扮演领域长期缺乏标准化的评估基准。大多数当前的口语对话数据集仅针对基本能力评估，角色档案描述薄弱或定义不明确。因此，它们无法有效量化模型在核心能力如长期角色一致性方面的性能。为解决这一关键差距，我们引入了 VoxRole，第一个专门设计用于评估基于语音的RPCAs的全面基准。该基准包含13335轮对话，总计65.6小时的语音，来自261部电影中的1228个独特角色。为了构建这个资源，我们提出了一种新颖的两阶段自动化流程，首先将电影音频与剧本对齐，然后使用一个 LLM系统地为每个角色构建多维档案。利用VoxRole，我们对当代口语对话模型进行了多维评估，揭示了它们在保持角色一致性方面的各自优势和局限性。

摘要： Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 声音 (cs.SD)
引用方式：	arXiv:2509.03940 [cs.CL]
	(或者 arXiv:2509.03940v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2509.03940

提交历史

来自： Weihao Wu [查看电子邮件]
[v1] 星期四， 2025 年 9 月 4 日 07:03:46 UTC (1,109 KB)

计算机科学 > 计算与语言

标题： VoxRole：一种用于评估基于语音的角色扮演代理的全面基准

标题： VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： VoxRole：一种用于评估基于语音的角色扮演代理的全面基准 显示英文标题

标题： VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： VoxRole：一种用于评估基于语音的角色扮演代理的全面基准