StereoFoley: Object-Aware Stereo Audio Generation from Video

Karchkhadze, Tornike; Chen, Kuan-Lin; Mojtaba; Heydari; Henzel, Robert; Toso, Alessandro; Souden, Mehrez; Atkins, Joshua

Computer Science > Sound

arXiv:2509.18272 (cs)

[Submitted on 22 Sep 2025 (v1) , last revised 29 Sep 2025 (this version, v2)]

Title: StereoFoley: Object-Aware Stereo Audio Generation from Video

Title: 立体声 Foley：从视频生成对象感知的立体声音频

Authors:Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba (Moji)Heydari, Robert Henzel, Alessandro Toso, Mehrez Souden, Joshua Atkins

Abstract: We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

Abstract: 我们提出StereoFoley，一个视频到音频生成框架，能够在48 kHz下生成语义对齐、时间同步且空间准确的立体声音频。尽管最近的生成式视频到音频模型在语义和时间保真度方面表现强劲，但它们大多局限于单声道或无法提供对象感知的立体声成像，这是由于缺乏专业混音、空间准确的视频到音频数据集。首先，我们开发并训练了一个基础模型，可以从视频生成立体声音频，在语义准确性和同步性方面达到了最先进水平。接下来，为克服数据集的限制，我们引入了一种合成数据生成流程，结合视频分析、对象跟踪以及带有动态平移和距离控制的音量合成，从而实现空间准确的对象感知声音。最后，我们在这种合成数据集上微调基础模型，实现了清晰的对象与音频对应关系。由于没有现有的评估指标，我们引入了立体声对象感知度量，并通过人类听觉研究进行了验证，显示与感知有很强的相关性。这项工作建立了第一个端到端的立体声对象感知视频到音频生成框架，填补了一个关键空白，并在该领域设定了新的基准。

Subjects:	Sound (cs.SD) ; Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2509.18272 [cs.SD]
	(or arXiv:2509.18272v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2509.18272

Submission history

From: Tornike Karchkhadze [view email]
[v1] Mon, 22 Sep 2025 18:00:54 UTC (2,241 KB)
[v2] Mon, 29 Sep 2025 22:57:46 UTC (2,241 KB)

Computer Science > Sound

Title: StereoFoley: Object-Aware Stereo Audio Generation from Video

Title: 立体声 Foley：从视频生成对象感知的立体声音频

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title: StereoFoley: Object-Aware Stereo Audio Generation from Video Show Chinese title

Title: 立体声 Foley：从视频生成对象感知的立体声音频

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: StereoFoley: Object-Aware Stereo Audio Generation from Video