ArchGPT: Understanding the World's Architectures with Large Multimodal Models

Wang, Yuze; Yang, Luo; Wang, Junyi; Qi, Yue

Computer Science > Graphics

arXiv:2509.20858 (cs)

[Submitted on 25 Sep 2025 ]

Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models

Title: ArchGPT：使用大型多模态模型理解世界的建筑结构

Authors:Yuze Wang, Luo Yang, Junyi Wang, Yue Qi

Abstract: Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.

Abstract: 建筑体现了美学、文化和历史价值，是人类文明的有形见证。研究人员长期以来利用虚拟现实（VR）、混合现实（MR）和增强现实（AR）来实现对建筑的沉浸式探索和解读，从而提高建筑在教育、遗产保护和专业设计实践中的可访问性、公众理解和创意工作流程。然而，现有的VR/MR/AR系统通常是按案例开发的，依赖于硬编码的注释和特定任务的交互，这些交互无法在多样化的建筑环境中扩展。在本工作中，我们提出了ArchGPT，这是一个多模态建筑视觉问答（VQA）模型，以及一个可扩展的数据构建管道，用于整理高质量的建筑专用VQA注释。该管道生成了Arch-300K，一个约315,000个图像-问题-答案三元组的领域专用数据集。Arch-300K是通过多阶段过程构建的：首先，我们从Wikimedia Commons中整理建筑场景，并使用一种新颖的粗到细策略过滤不受约束的旅游照片集合，该策略结合了三维重建和语义分割，以选择无遮挡、结构一致的建筑图像。为了减轻原始文本元数据中的噪声和不一致性，我们提出了一种LLM引导的文本验证和知识提炼管道，以生成可靠且建筑专用的问题-答案对。利用这些整理后的图像和优化的元数据，我们进一步合成正式分析注释——包括详细描述和基于方面引导的对话——以提供更丰富的语义多样性，同时保持对数据的忠实。我们在Arch-300K上对开源多模态主干网络ShareGPT4V-7B进行了监督微调，得到了ArchGPT。

Subjects:	Graphics (cs.GR) ; Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2509.20858 [cs.GR]
	(or arXiv:2509.20858v1 [cs.GR] for this version)
	https://doi.org/10.48550/arXiv.2509.20858

Submission history

From: Yuze Wang [view email]
[v1] Thu, 25 Sep 2025 07:49:43 UTC (6,568 KB)

Computer Science > Graphics

Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models

Title: ArchGPT：使用大型多模态模型理解世界的建筑结构

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Graphics

Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models Show Chinese title

Title: ArchGPT：使用大型多模态模型理解世界的建筑结构

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models