多媒体
查看 最近的 文章
显示 2025年09月05日, 星期五 新的列表
- [1] arXiv:2509.03565 (交叉列表自 cs.CL) [中文pdf, pdf, html, 其他]
-
标题: ResearchPulse:通过多文档科学推理构建方法-实验链标题: ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference评论: 被ACM MM 2025接收主题: 计算与语言 (cs.CL) ; 多媒体 (cs.MM)
理解科学思想的演变需要的不仅仅是总结单篇论文——这需要对主题相关的研究进行结构化的跨文档推理。 在这项工作中,我们形式化了多文档科学推理,这是一个新的任务,它从相关论文中提取并对齐动机、方法和实验结果,以重建研究发展链。 该任务引入了关键挑战,包括时间上对松散结构的方法进行对齐以及对异构实验表格进行标准化。 我们提出了ResearchPulse,一个基于代理的框架,集成了指令规划、科学内容提取和结构化可视化。 它由三个协调的代理组成:一个用于任务分解的Plan代理,一个构建动机-方法思维导图的Mmap代理,以及一个合成实验折线图的Lchart代理。 为了支持这项任务,我们引入了ResearchPulse-Bench,一个带有注释论文组的引用感知基准。 实验表明,尽管使用了7B规模的代理,我们的系统在语义对齐、结构一致性和视觉保真度方面始终优于GPT-4o等强基线。 数据集可在https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench获取。
Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.
- [2] arXiv:2509.03678 (交叉列表自 cs.HC) [中文pdf, pdf, 其他]
-
标题: 应许之地:一个整合全景画到虚拟工作流程和元素叙事的扩展现实叙述性景点标题: Promisedland: An XR Narrative Attraction Integrating Diorama-to-Virtual Workflow and Elemental Storytelling评论: 被接收至2025年第11届国际虚拟现实会议(ICVR 2025)论文集。ISBN:979-8-3503-9272-2。© 2025 IEEE。这是作者接受的稿件。最终版本将通过IEEE Xplore提供。主题: 人机交互 (cs.HC) ; 多媒体 (cs.MM)
应许之地是一个结合文化叙事、生态教育和创新混合制作流程的混合现实(MR)叙述性体验。 设定在未来地球因元素失衡而遭受困扰的背景下,用户在象征性角色的引导下展开互动旅程,通过收集五行元素——金、木、水、火、土来恢复平衡。 为了构建这一体验,我们引入了一种低成本、高保真的景深模型到虚拟管道——手工制作物理比例模型,进行3D扫描,并将其整合到Unreal Engine中。 这一过程实现了快速的空间原型设计,同时保留了物理环境的材质表现力和叙事一致性。 为进一步增强沉浸感,该体验结合了Stewart平台,提供与虚拟骑行动态同步的运动反馈,强化空间存在感和具身参与感。 最终原型运行在Meta Quest上,支持动态交互和实时视觉反馈。 应许之地为未来博物馆、文化展览和主题娱乐中的XR叙述性装置提供了可复制的设计蓝图。 它提出了一个XR叙述性体验的新框架——物理与数字元素汇聚以加深沉浸感、自主性和情感参与。
Promisedland is a mixed-reality (MR) narrative attraction that combines cultural storytelling, ecological education, and an innovative hybrid production workflow. Set in a future Earth suffering from elemental imbalance, users embark on an interactive journey guided by symbolic characters to restore harmony through the collection of five classical elements: metal, wood, water, fire, and earth. To prototype this experience, we introduce a low-cost, high-fidelity Diorama-to-Virtual pipeline - handcrafting physical scale models, 3D scanning, and integrating them into Unreal Engine. This process enables rapid spatial prototyping while preserving the material expressiveness and narrative consistency of the physical environment. To further enhance immersion, the experience incorporates a Stewart Platform to provide motion feedback synchronized with the virtual ride dynamics, reinforcing spatial presence and embodied engagement. The final prototype runs on Meta Quest, supporting dynamic interactions and real-time visual feedback. Promisedland offers a replicable design blueprint for future XR narrative installations across museums, cultural exhibitions, and themed entertainment. It proposes a new framework for XR Narrative Attractions - where physical and digital elements converge to deepen immersion, agency, and emotional engagement.
- [3] arXiv:2509.03692 (交叉列表自 cs.IR) [中文pdf, pdf, html, 其他]
-
标题: lifeXplore 在 Lifelog 搜索挑战赛 2021 中标题: lifeXplore at the Lifelog Search Challenge 2021主题: 信息检索 (cs.IR) ; 多媒体 (cs.MM)
自2018年首次举办以来,Lifelog搜索挑战赛(LSC)作为一项交互式生活日志数据检索竞赛,其受欢迎程度持续上升,并与ACM国际多媒体检索会议(ICMR)同时举行。 这项年度现场活动的目标是使用专门开发的工具,在有限的时间内,针对特定宣布的记忆,搜索大量的生活日志数据。 作为长期参与者,我们展示了改进后的生活Xplore——一种结合时间顺序日摘要浏览与交互式可组合概念过滤的检索系统。 与之前版本相比,该工具通过引入时间查询、高级日摘要功能以及可用性改进进行了优化。
Since its first iteration in 2018, the Lifelog Search Challenge (LSC) continues to rise in popularity as an interactive lifelog data retrieval competition, co-located at the ACM International Conference on Multimedia Retrieval (ICMR). The goal of this annual live event is to search a large corpus of lifelogging data for specifically announced memories using a purposefully developed tool within a limited amount of time. As long-standing participants, we present our improved lifeXplore - a retrieval system combining chronologic day summary browsing with interactive combinable concept filtering. Compared to previous versions, the tool is improved by incorporating temporal queries, advanced day summary features as well as usability improvements.
- [4] arXiv:2509.03693 (交叉列表自 cs.HC) [中文pdf, pdf, html, 其他]
-
标题: 设计有效的虚假信息检测人工智能解释:内容、社交和综合解释的比较研究标题: Designing Effective AI Explanations for Misinformation Detection: A Comparative Study of Content, Social, and Combined Explanations评论: 将出现在CSCW 2025上主题: 人机交互 (cs.HC) ; 多媒体 (cs.MM)
在本文中,我们研究了虚假信息的AI解释问题,其目标是识别有助于提高用户虚假信息检测能力及其整体用户体验的解释设计。 我们的工作受到当前可解释人工智能(XAI)方法的局限性的启发,这些方法主要关注内容解释,即阐明虚假信息的语言特征和句子结构。 为解决这一局限性,我们探索了超越内容解释的各种解释,例如“社会解释”,它考虑围绕虚假信息的更广泛的社会背景,以及“综合解释”,在场景中同时呈现内容解释和社会解释,这些场景可能相互一致或不一致。 为了评估这些AI解释的比较效果,我们在新冠状病毒(Study 1在Prolific上)和政治领域(Study 2在MTurk上)进行了两项在线众包实验。 我们的结果表明,AI解释通常在帮助用户检测虚假信息方面是有效的,有效性显著受到内容解释和社会解释之间对齐程度的影响。 我们还发现,解释类型呈现的顺序——具体来说,是内容解释还是社会解释先呈现——会影响检测准确性,在新冠状病毒和政治领域之间发现了差异。 这项工作有助于设计更有效的AI解释,促进对不同解释类型及其组合如何影响虚假信息检测的更深入理解。
In this paper, we study the problem of AI explanation of misinformation, where the goal is to identify explanation designs that help improve users' misinformation detection abilities and their overall user experiences. Our work is motivated by the limitations of current Explainable AI (XAI) approaches, which predominantly focus on content explanations that elucidate the linguistic features and sentence structures of the misinformation. To address this limitation, we explore various explanations beyond content explanation, such as "social explanation" that considers the broader social context surrounding misinformation, as well as a "combined explanation" where both the content and social explanations are presented in scenarios that are either aligned or misaligned with each other. To evaluate the comparative effectiveness of these AI explanations, we conduct two online crowdsourcing experiments in the COVID-19 (Study 1 on Prolific) and Politics domains (Study 2 on MTurk). Our results show that AI explanations are generally effective in aiding users to detect misinformation, with effectiveness significantly influenced by the alignment between content and social explanations. We also find that the order in which explanation types are presented - specifically, whether a content or social explanation comes first - can influence detection accuracy, with differences found between the COVID-19 and Political domains. This work contributes towards more effective design of AI explanations, fostering a deeper understanding of how different explanation types and their combinations influence misinformation detection.
- [5] arXiv:2509.03883 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: 人体动作视频生成:综述标题: Human Motion Video Generation: A SurveyHaiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, Fei Ma, Zhiyong Wu, Changpeng Yang, Zonghong Dai, Fei Richard Yu评论: 已被TPAMI接受。Github仓库:https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation IEEE Access:https://ieeexplore.ieee.org/document/11106267期刊参考: IEEE模式分析与机器智能汇刊 2025主题: 计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM)
人体动作视频生成由于其广泛的应用而引起了研究界的广泛关注,使得诸如逼真歌唱头或能够随着音乐流畅舞蹈的动态虚拟人等创新成为可能。然而,该领域的现有综述主要关注个别方法,缺乏对整个生成过程的全面概述。本文通过提供对人体动作视频生成的深入综述来弥补这一空白,涵盖了十余个子任务,并详细描述了生成过程的五个关键阶段:输入、动作规划、动作视频生成、优化和输出。值得注意的是,这是首个讨论大型语言模型在增强人体动作视频生成方面的潜力的综述。我们的综述回顾了人体动作视频生成在三个主要模态:视觉、文本和音频方面的最新发展和技术趋势。通过涵盖两百多篇论文,我们提供了该领域的全面概述,并突出了推动重大技术突破的关键作品。本综述的目标是揭示人体动作视频生成的前景,并作为推进数字人类全面应用的宝贵资源。本综述中所审查的模型的完整列表可在我们的仓库 https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation 中找到。
Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.
- [6] arXiv:2509.04086 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: TEn-CATS:具有多尺度类别感知时间图的文本增强音视频视频解析标题: TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph主题: 计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM)
音频-视觉视频解析(AVVP)任务旨在通过弱监督标签识别给定视频中的事件类别及其发生时间。 现有方法通常分为两类:(i) 基于注意力机制设计增强架构以实现更好的时间建模,以及 (ii) 生成更丰富的伪标签以弥补帧级标注的缺失。 然而,第一类方法将噪声段级伪标签视为可靠的监督,第二类方法让无差别的注意力在所有帧中扩散,初始错误在训练过程中被反复放大。 为了解决这个问题,我们提出了一种结合双向文本融合(BiT)模块和类别感知时间图(CATS)模块的方法。 具体而言,我们整合了前两种研究方向的优势和互补性。 我们首先通过BiT模块对音频和视觉模态特征进行语义注入和动态校准,以定位并净化更清洁、更丰富的语义线索。 然后,我们利用CATS模块进行语义传播和连接,以实现时间上的精确语义信息传播。 实验结果表明,我们的方法在两个基准数据集LLP和UnAV-100的多个关键指标上达到了最先进的性能(SOTA)。
Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.
- [7] arXiv:2509.04215 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: PianoBind:流行钢琴音乐的多模态联合嵌入模型标题: PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music评论: 已接受发表于第26届国际音乐信息检索学会会议(ISMIR 2025)主题: 声音 (cs.SD) ; 信息检索 (cs.IR) ; 多媒体 (cs.MM)
单簧管音乐,尽管是一种单一乐器媒介,却具有显著的表达能力,能够在各种流派、情绪和风格中传达丰富的语义信息。 然而,当前的通用音乐表示模型,主要在大规模数据集上进行训练,往往难以捕捉同质化单簧管音乐中的细微语义差异。 此外,现有的钢琴专用表示模型通常是单模态的,无法捕捉钢琴音乐固有的多模态特性,这种特性通过音频、符号和文本模态表现出来。 为了解决这些限制,我们提出了PianoBind,一种针对钢琴的多模态联合嵌入模型。 我们系统地研究了在联合嵌入框架中多源训练和模态利用的策略,该框架优化用于捕捉(1)小规模和(2)同质化钢琴数据集中的细粒度语义差异。 我们的实验结果表明,PianoBind学习到的多模态表示能够有效捕捉钢琴音乐的细微差别,在领域内和领域外的钢琴数据集上,其文本到音乐检索性能优于通用音乐联合嵌入模型。 此外,我们的设计选择为超越钢琴音乐的同质化数据集的多模态表示学习提供了可重复使用的见解。
Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.
- [8] arXiv:2509.04448 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: TRUST-VL:一种用于通用多模态虚假信息检测的可解释新闻助手标题: TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection评论: EMNLP 2025;项目主页:https://yanzehong.github.io/trust-vl/主题: 计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM)
多模态虚假信息,包括文本、视觉和跨模态的扭曲,正日益成为一种社会威胁,而生成式AI进一步加剧了这一威胁。 现有方法通常专注于一种类型的扭曲,并难以推广到未见过的场景。 在本工作中,我们观察到不同的扭曲类型具有共同的推理能力,同时还需要特定任务的技能。 我们假设在不同扭曲类型之间的联合训练有助于知识共享并增强模型的泛化能力。 为此,我们引入了TRUST-VL,这是一个统一且可解释的视觉-语言模型,用于一般的多模态虚假信息检测。 TRUST-VL结合了一个新颖的问答感知视觉增强模块,旨在提取特定任务的视觉特征。 为了支持训练,我们还构建了 TRUST-Instruct,一个包含198K个样本的大规模指令数据集,其中包含与人类事实核查工作流程对齐的结构化推理链。 在领域内和零样本基准上的广泛实验表明,TRUST-VL实现了最先进性能,同时提供了强大的泛化能力和可解释性。
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
交叉提交 (展示 8 之 8 条目 )
- [9] arXiv:2406.13923 (替换) [中文pdf, pdf, html, 其他]
-
标题: PIN:配对和交错多模态文档的知识密集型数据集标题: PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal DocumentsJunjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Liu, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen评论: 技术报告 v1.0主题: 人工智能 (cs.AI) ; 计算与语言 (cs.CL) ; 计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM)
近年来,大型多模态模型(LMMs)的进展利用了广泛的多模态数据集,以增强在复杂知识驱动任务中的能力。 然而,感知和推理错误的持续挑战限制了它们的有效性,特别是在解释复杂的视觉数据和推断多模态关系方面。 为了解决这些问题,我们引入了PIN(配对和交错的多模态文档),一种新的数据格式,旨在促进视觉和文本知识的更深层次整合。 PIN格式独特地结合了语义丰富的Markdown文件,这些文件保留了细粒度的文本结构,以及整体图像,这些图像捕捉了完整的文档布局。 遵循这种格式,我们构建并发布了两个大规模的开源数据集:PIN-200M(约2亿个文档)和PIN-14M(约1400万个),这些数据集来自中英文的多种网络和科学来源。 为了最大化可用性,我们提供了详细的统计分析,并为数据集配备了质量信号,使研究人员能够轻松过滤和选择特定任务的数据。 我们的工作为社区提供了一种多功能的数据格式和大量资源,为预训练策略的新研究和更强大的知识密集型LMMs的发展奠定了基础。
Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.
- [10] arXiv:2503.23746 (替换) [中文pdf, pdf, html, 其他]
-
标题: 短视频传播影响力评分:一个新的现实世界数据集和一个新的大型图模型标题: Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model主题: 计算机视觉与模式识别 (cs.CV) ; 计算与语言 (cs.CL) ; 机器学习 (cs.LG) ; 多媒体 (cs.MM) ; 社会与信息网络 (cs.SI)
短视频平台已获得巨大流行,吸引了全球数以百万计,甚至数十亿用户的兴趣。 最近,研究人员强调了分析短视频传播的重要性,这通常涉及发现商业价值、公众意见、用户行为等。 本文提出了一项新的短视频传播影响力评分(SPIR)任务,并旨在从数据集和方法的角度推动SPIR的发展。 首先,我们提出了一种新的跨平台短视频(XS-Video)数据集,旨在提供一个大规模且现实的跨平台短视频传播网络,以促进短视频传播的研究。 我们的XS-Video数据集涵盖了5个最大的中国平台上的117,720个视频、381,926个样本和535个主题,传播影响力从0到9进行标注。 据我们所知,这是第一个包含跨平台数据或提供所有观看数、点赞数、分享数、收藏数、粉丝数、评论数和评论内容的大规模短视频数据集。 其次,我们提出了一种名为NetGPT的大图模型(LGM),基于一种新颖的三阶段训练机制,以弥合异构图结构数据与大型语言模型(LLMs)的强大推理能力和知识之间的差距。 我们的NetGPT能够理解并分析短视频传播图,使其能够预测短视频的长期传播影响力。 在我们的XS-Video数据集上,通过分类和回归指标进行的全面实验结果表明了我们方法在SPIR中的优越性。
Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) task and aims to promote SPIR from both the dataset and method perspectives. First, we propose a new Cross-platform Short-Video (XS-Video) dataset, which aims to provide a large-scale and real-world short-video propagation network across various platforms to facilitate the research on short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926 samples, and 535 topics across 5 biggest Chinese platforms, annotated with the propagation influence from level 0 to 9. To the best of our knowledge, this is the first large-scale short-video dataset that contains cross-platform data or provides all of the views, likes, shares, collects, fans, comments, and comment content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a novel three-stage training mechanism, to bridge heterogeneous graph-structured data with the powerful reasoning ability and knowledge of Large Language Models (LLMs). Our NetGPT can comprehend and analyze the short-video propagation graph, enabling it to predict the long-term propagation influence of short-videos. Comprehensive experimental results evaluated by both classification and regression metrics on our XS-Video dataset indicate the superiority of our method for SPIR.
- [11] arXiv:2506.07634 (替换) [中文pdf, pdf, html, 其他]
-
标题: SongBloom:通过交错自回归草图和扩散优化的连贯歌曲生成标题: SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement评论: 提交至NeurIPS2025主题: 音频与语音处理 (eess.AS) ; 多媒体 (cs.MM)
生成具有连贯结构、和谐的乐器和人声元素的音乐在歌曲生成中仍然是一个重大挑战。 现有的语言模型和基于扩散的方法常常难以在全局连贯性和局部保真度之间取得平衡,导致输出缺乏音乐性或出现不连贯的进展和不匹配的歌词。 本文介绍了 $\textbf{SongBloom}$,一种用于完整长度歌曲生成的新框架,该框架利用了自回归草图和基于扩散的细化的交替范式。 SongBloom 使用了一个自回归扩散模型,该模型结合了扩散模型的高保真度和语言模型的可扩展性。 具体而言,它从短到长逐步扩展音乐草图,并从粗粒度到细粒度地细化细节。 交替生成范式有效地整合了先验语义和声学上下文以指导生成过程。 实验结果表明, SongBloom 在主观和客观指标上均优于现有方法,并达到了与最先进的商业音乐生成平台相当的性能。 音频样本可在我们的演示页面上找到:https://cypress-yang.github.io/SongBloom_demo。 代码和模型权重已发布在 https://github.com/Cypress-Yang/SongBloom 。
Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo. The code and model weights have been released on https://github.com/Cypress-Yang/SongBloom .