ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

Gu, Zeqi; Cui, Yin; Li, Zhaoshuo; Wei, Fangyin; Ge, Yunhao; Gu, Jinwei; Liu, Ming-Yu; Davis, Abe; Ding, Yifan

计算机科学 > 计算机视觉与模式识别

arXiv:2506.00742 (cs)

[提交于 2025年5月31日 ]

标题： ArtiScene：基于图像中介的语言驱动艺术性3D场景生成

标题： ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

Authors:Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, Yifan Ding

摘要：设计三维场景一直是一项具有挑战性的任务，它既需要艺术专业知识，也需要熟练掌握复杂的软件。最近，在文本到三维生成方面的进展通过让用户基于简单的文本描述创建场景极大地简化了这一过程。然而，由于这些方法通常需要额外的训练或上下文学习，它们的性能往往受到高质量三维数据可用性有限的限制。相比之下，从网络规模图像中学习的现代文本到图像模型能够生成具有多样化、可靠的空间布局以及一致且视觉上吸引人的风格的场景。我们的关键见解是，与其直接从三维场景中学习，不如利用生成的二维图像作为中介来引导三维合成。基于此，我们介绍了ArtiScene，这是一种无需训练的自动化场景设计管道，它结合了自由形式文本到图像生成的灵活性和二维中介布局的多样性和可靠性。首先，我们从场景描述生成二维图像，然后提取物体的形状和外观以创建三维模型。这些模型使用从同一中介图像推导出的几何、位置和姿态信息组装成最终场景。 ArtiScene在广泛的场景和风格中具有通用性，在布局和美学质量方面以量化指标远远超过最先进的基准，并在广泛的用户研究中平均达到74.89%的胜率，在GPT-4o评估中达到95.07%。项目页面：https://artiscene-cvpr.github.io/

摘要： Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: https://artiscene-cvpr.github.io/

评论：	被CVPR接受
主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI)
引用方式：	arXiv:2506.00742 [cs.CV]
	(或者 arXiv:2506.00742v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.00742

提交历史

来自： Zeqi Gu [查看电子邮件]
[v1] 星期六， 2025 年 5 月 31 日 23:03:54 UTC (41,218 KB)

计算机科学 > 计算机视觉与模式识别

标题： ArtiScene：基于图像中介的语言驱动艺术性3D场景生成

标题： ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： ArtiScene：基于图像中介的语言驱动艺术性3D场景生成 显示英文标题

标题： ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ArtiScene：基于图像中介的语言驱动艺术性3D场景生成