Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2509.20858

Help | Advanced Search

Computer Science > Graphics

arXiv:2509.20858 (cs)
[Submitted on 25 Sep 2025 ]

Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models

Title: ArchGPT:使用大型多模态模型理解世界的建筑结构

Authors:Yuze Wang, Luo Yang, Junyi Wang, Yue Qi
Abstract: Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.
Abstract: 建筑体现了美学、文化和历史价值,是人类文明的有形见证。研究人员长期以来利用虚拟现实(VR)、混合现实(MR)和增强现实(AR)来实现对建筑的沉浸式探索和解读,从而提高建筑在教育、遗产保护和专业设计实践中的可访问性、公众理解和创意工作流程。然而,现有的VR/MR/AR系统通常是按案例开发的,依赖于硬编码的注释和特定任务的交互,这些交互无法在多样化的建筑环境中扩展。在本工作中,我们提出了ArchGPT,这是一个多模态建筑视觉问答(VQA)模型,以及一个可扩展的数据构建管道,用于整理高质量的建筑专用VQA注释。该管道生成了Arch-300K,一个约315,000个图像-问题-答案三元组的领域专用数据集。Arch-300K是通过多阶段过程构建的:首先,我们从Wikimedia Commons中整理建筑场景,并使用一种新颖的粗到细策略过滤不受约束的旅游照片集合,该策略结合了三维重建和语义分割,以选择无遮挡、结构一致的建筑图像。为了减轻原始文本元数据中的噪声和不一致性,我们提出了一种LLM引导的文本验证和知识提炼管道,以生成可靠且建筑专用的问题-答案对。利用这些整理后的图像和优化的元数据,我们进一步合成正式分析注释——包括详细描述和基于方面引导的对话——以提供更丰富的语义多样性,同时保持对数据的忠实。我们在Arch-300K上对开源多模态主干网络ShareGPT4V-7B进行了监督微调,得到了ArchGPT。
Subjects: Graphics (cs.GR) ; Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as: arXiv:2509.20858 [cs.GR]
  (or arXiv:2509.20858v1 [cs.GR] for this version)
  https://doi.org/10.48550/arXiv.2509.20858
arXiv-issued DOI via DataCite

Submission history

From: Yuze Wang [view email]
[v1] Thu, 25 Sep 2025 07:49:43 UTC (6,568 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
  • Other Formats
view license
Current browse context:
cs.GR
< prev   |   next >
new | recent | 2025-09
Change to browse by:
cs
cs.CV
cs.MM

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号