Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs.MM

Help | Advanced Search

Multimedia

  • Cross-lists
  • Replacements

See recent articles

Showing new listings for Friday, 26 September 2025

Total of 4 entries
Showing up to 2000 entries per page: fewer | more | all

Cross submissions (showing 3 of 3 entries )

[1] arXiv:2509.20724 (cross-list from cs.SI) [cn-pdf, pdf, html, other]
Title: Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos
Title: 视觉权威与健康错误信息的修辞:社交媒体视频的多模态分析
Mohammad Reza Zarei, Barbara Stead-Coyle, Michael Christensen, Sarah Everts, Majid Komeili
Subjects: Social and Information Networks (cs.SI) ; Computation and Language (cs.CL) ; Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM)

Short form video platforms are central sites for health advice, where alternative narratives mix useful, misleading, and harmful content. Rather than adjudicating truth, this study examines how credibility is packaged in nutrition and supplement videos by analyzing the intersection of authority signals, narrative techniques, and monetization. We assemble a cross platform corpus of 152 public videos from TikTok, Instagram, and YouTube and annotate each on 26 features spanning visual authority, presenter attributes, narrative strategies, and engagement cues. A transparent annotation pipeline integrates automatic speech recognition, principled frame selection, and a multimodal model, with human verification on a stratified subsample showing strong agreement. Descriptively, a confident single presenter in studio or home settings dominates, and clinical contexts are rare. Analytically, authority cues such as titles, slides and charts, and certificates frequently occur with persuasive elements including jargon, references, fear or urgency, critiques of mainstream medicine, and conspiracies, and with monetization including sales links and calls to subscribe. References and science like visuals often travel with emotive and oppositional narratives rather than signaling restraint.

短视频平台是健康建议的核心场所,其中混合了有益的、误导性的和有害的内容。 本研究不评判真实性,而是通过分析权威信号、叙述技巧和盈利模式的交集,探讨营养和补充剂视频中可信度的呈现方式。 我们收集了来自TikTok、Instagram和YouTube的152个公开视频组成的跨平台语料库,并对每个视频的26个特征进行标注,涵盖视觉权威性、演讲者属性、叙述策略和互动提示。 一个透明的标注流程结合了自动语音识别、有原则的帧选择和多模态模型,通过对分层子样本的人工验证显示出高度一致性。 描述性分析显示,自信的单个演讲者在工作室或家庭环境中占主导地位,而临床情境很少见。 分析上,权威线索如头衔、幻灯片和图表、证书经常与说服性元素如专业术语、引用、恐惧或紧迫感、对主流医学的批评和阴谋论一起出现,并与盈利手段如销售链接和订阅呼吁一起出现。 引用和科学类视觉内容通常伴随着情感化和对立的叙述,而不是表现出克制。

[2] arXiv:2509.20858 (cross-list from cs.GR) [cn-pdf, pdf, html, other]
Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models
Title: ArchGPT:使用大型多模态模型理解世界的建筑结构
Yuze Wang, Luo Yang, Junyi Wang, Yue Qi
Subjects: Graphics (cs.GR) ; Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM)

Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.

建筑体现了美学、文化和历史价值,是人类文明的有形见证。研究人员长期以来利用虚拟现实(VR)、混合现实(MR)和增强现实(AR)来实现对建筑的沉浸式探索和解读,从而提高建筑在教育、遗产保护和专业设计实践中的可访问性、公众理解和创意工作流程。然而,现有的VR/MR/AR系统通常是按案例开发的,依赖于硬编码的注释和特定任务的交互,这些交互无法在多样化的建筑环境中扩展。在本工作中,我们提出了ArchGPT,这是一个多模态建筑视觉问答(VQA)模型,以及一个可扩展的数据构建管道,用于整理高质量的建筑专用VQA注释。该管道生成了Arch-300K,一个约315,000个图像-问题-答案三元组的领域专用数据集。Arch-300K是通过多阶段过程构建的:首先,我们从Wikimedia Commons中整理建筑场景,并使用一种新颖的粗到细策略过滤不受约束的旅游照片集合,该策略结合了三维重建和语义分割,以选择无遮挡、结构一致的建筑图像。为了减轻原始文本元数据中的噪声和不一致性,我们提出了一种LLM引导的文本验证和知识提炼管道,以生成可靠且建筑专用的问题-答案对。利用这些整理后的图像和优化的元数据,我们进一步合成正式分析注释——包括详细描述和基于方面引导的对话——以提供更丰富的语义多样性,同时保持对数据的忠实。我们在Arch-300K上对开源多模态主干网络ShareGPT4V-7B进行了监督微调,得到了ArchGPT。

[3] arXiv:2509.21153 (cross-list from cs.CV) [cn-pdf, pdf, html, other]
Title: WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP
Title: WAVECLIP:自适应分辨率CLIP的小波标记化
Moshe Kimhi, Erez Koifman, Ehud Rivlin, Eli Schwartz, Chaim Baskin
Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) ; Multimedia (cs.MM)

We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.

我们引入了WAVECLIP,这是一种用于CLIP自适应分辨率推理的单一统一模型,通过基于小波的标记化实现。 WAVECLIP用多级小波分解替代了标准的块嵌入,使模型能够在同一模型中自然支持多种分辨率的同时,从粗到细处理图像。 在推理时,模型从低分辨率标记开始,并仅在需要时进行细化,使用键值缓存和因果跨层级注意力来复用计算,从而在需要时仅向模型引入新信息。 我们在零样本分类中评估了WAVECLIP,结果表明简单的基于置信度的门控机制实现了自适应提前退出。 这允许用户使用单一部署模型动态选择计算与准确性的权衡。 我们的方法只需从冻结的CLIP教师模型中进行轻量级蒸馏,并实现了具有显著计算节省的竞争力准确性。

Replacement submissions (showing 1 of 1 entries )

[4] arXiv:2503.06677 (replaced) [cn-pdf, pdf, html, other]
Title: REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints
Title: REArtGS:通过带有几何和运动约束的3D高斯点云重建和生成关节物体
Di Wu, Liu Liu, Zhou Linli, Anran Huang, Liangtu Song, Qiaojun Yu, Qi Wu, Cewu Lu
Comments: 11pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM)

Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Project site: https://sites.google.com/view/reartgs/home.

关节物体作为人类生活中普遍存在的实体,其三维表示在各种应用中起着关键作用。 然而,现有的方法在实现关节物体的高保真纹理表面重建和动态生成方面仍然面临挑战。 在本文中,我们提出了REArtGS,一种新颖的框架,该框架向三维高斯基元引入了额外的几何和运动约束,实现了关节物体的逼真表面重建和生成。 具体而言,给定任意两个状态的关节物体的多视角RGB图像,我们首先引入一种无偏的符号距离场(SDF)引导来规范高斯不透明度场,增强几何约束并提高表面重建质量。 然后,我们建立受关节物体运动学结构约束的可变形场,实现了未见过的状态的表面网格的无监督生成。 在合成和真实数据集上的大量实验表明,我们的方法能够实现给定状态的高质量纹理表面重建,并能够实现未见过状态的高保真表面生成。 项目网站:https://sites.google.com/view/reartgs/home.

Total of 4 entries
Showing up to 2000 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号