Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

Kim, Taesoo; Jo, Yongsik; Song, Hyunmin; Kim, Taehwan

doi:10.21437/Interspeech.2025-1075

Computer Science > Human-Computer Interaction

arXiv:2509.14627 (cs)

[Submitted on 18 Sep 2025 ]

Title: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

Title: 通过生成吸引人的语音来实现类人多模态对话代理

Authors:Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim

Abstract: Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC

Abstract: 人类对话涉及语言、言语和视觉线索，每种媒介提供互补的信息。例如，言语传达了一种语气或氛围，这无法仅通过文本完全表达。虽然多模态大语言模型专注于从多种输入生成文本响应，但对生成自然且吸引人的言语关注较少。我们提出一个类人的代理，根据对话情绪和回应风格信息生成言语响应。为了实现这一点，我们构建了一个专注于言语的新多感官对话数据集，以使代理能够生成自然的言语。然后，我们提出一个基于多模态大语言模型的模型，用于生成文本响应和语音描述，这些用于生成涵盖副语言信息的言语。实验结果证明了在对话中同时利用视觉和音频模态生成吸引人言语的有效性。源代码可在 https://github.com/kimtaesu24/MSenC 获取

Comments:	Published in Interspeech 2025
Subjects:	Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2509.14627 [cs.HC]
	(or arXiv:2509.14627v1 [cs.HC] for this version)
	https://doi.org/10.48550/arXiv.2509.14627
Related DOI:	https://doi.org/10.21437/Interspeech.2025-1075

Submission history

From: Taesoo Kim [view email]
[v1] Thu, 18 Sep 2025 05:14:10 UTC (783 KB)

Computer Science > Human-Computer Interaction

Title: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

Title: 通过生成吸引人的语音来实现类人多模态对话代理

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Human-Computer Interaction

Title: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech Show Chinese title

Title: 通过生成吸引人的语音来实现类人多模态对话代理

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech