Multimedia

Authors and titles for June 2025

Total of 153 entries : 1-25 26-50 51-75 76-100 101-125 126-150 151-153

Showing up to 25 entries per page: fewer | more | all

[76] arXiv:2506.08200 (cross-list from cs.HC) [cn-pdf, pdf, html, other]: Title: AffectMachine-Pop: A controllable expert system for real-time pop music generation

Title: 情感机器-流行音乐：一种可控制的实时流行音乐生成专家系统

Kat R. Agres, Adyasha Dash, Phoebe Chua, Stefan K. Ehrlich

Subjects: Human-Computer Interaction (cs.HC) ; Multimedia (cs.MM)
[77] arXiv:2506.08493 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

Title: 上下文感知的TFL：一种用于时间伪造定位的通用上下文感知对比学习框架

Qilin Yin, Wei Lu, Xiangyang Luo, Xiaochun Cao

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM)
[78] arXiv:2506.08524 (cross-list from cs.SD) [cn-pdf, pdf, html, other]: Title: Teaching Physical Awareness to LLMs through Sounds

Title: 通过声音教授物理意识给LLMs

Weiguo Wang, Andy Nie, Wenrui Zhou, Yi Kai, Chengchen Hu

Comments: ICML 2025

Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI) ; Multimedia (cs.MM) ; Robotics (cs.RO) ; Audio and Speech Processing (eess.AS)
[79] arXiv:2506.08591 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Title: 基于多样性的MLP精简方法用于高效的大型视觉变换器

Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG) ; Multimedia (cs.MM)
[80] arXiv:2506.09650 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Title: HopaDIFF：用于多人场景中指代人体动作分割的全方位-部分感知傅里叶条件扩散模型

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

Comments: The code is available at https://github.com/KPeng9510/HopaDIFF.git

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG) ; Multimedia (cs.MM) ; Robotics (cs.RO) ; Image and Video Processing (eess.IV)
[81] arXiv:2506.09792 (cross-list from cs.SD) [cn-pdf, pdf, html, other]: Title: Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

Title: 利用外部知识源的语言约束进行视听目标语音提取

Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

Comments: Accepted by Interspeech 2025

Subjects: Sound (cs.SD) ; Machine Learning (cs.LG) ; Multimedia (cs.MM) ; Audio and Speech Processing (eess.AS)
[82] arXiv:2506.09999 (cross-list from cs.LG) [cn-pdf, pdf, html, other]: Title: Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

Title: 利用预训练模型在自适应融合下的多模态类增量学习

Yukun Chen, Zihuan Qiu, Fanman Meng, Hongliang Li, Linfeng Xu, Qingbo Wu

Subjects: Machine Learning (cs.LG) ; Multimedia (cs.MM) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)
[83] arXiv:2506.10005 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Title: 使用文本到图像和音频生成模型的多模态电影视频合成

Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj

Comments: 10 pages, seven figures about Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL) ; Graphics (cs.GR) ; Multimedia (cs.MM)
[84] arXiv:2506.10009 (cross-list from eess.IV) [cn-pdf, pdf, html, other]: Title: The Iris File Extension

Title: Iris 文件扩展名

Ryan Erik Landvater, Michael David Olp, Mustafa Yousif, Ulysses Balis

Comments: 17 pages, 7 figures

Subjects: Image and Video Processing (eess.IV) ; Multimedia (cs.MM)
[85] arXiv:2506.10452 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

Title: 面向缺失模态和分布偏移下的鲁棒多模态情感识别

Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen

Comments: Submitted to TAC. The code is available at https://github.com/gw-zhong/CIDer

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Computation and Language (cs.CL) ; Machine Learning (cs.LG) ; Multimedia (cs.MM)
[86] arXiv:2506.10574 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: DanceChat: Large Language Model-Guided Music-to-Dance Generation

Title: DanceChat：大型语言模型引导的音乐到舞蹈生成

Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, Shanxin Yuan

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)
[87] arXiv:2506.10857 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Title: VRBench：长叙事视频中多步骤推理的基准测试

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

Comments: ICCV2025

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) ; Multimedia (cs.MM)
[88] arXiv:2506.10932 (cross-list from cs.HC) [cn-pdf, pdf, other]: Title: Video-Mediated Emotion Disclosure: Expressions of Fear, Sadness, and Joy by People with Schizophrenia on YouTube

Title: 视频中介的情感宣泄：精神分裂症患者在YouTube上表达恐惧、悲伤和喜悦

Jiaying Lizzy Liu, Yan Zhang

Comments: 10 pages

Journal-ref: ASIS&T 2025

Subjects: Human-Computer Interaction (cs.HC) ; Computers and Society (cs.CY) ; Multimedia (cs.MM)
[89] arXiv:2506.10941 (cross-list from cs.CV) [cn-pdf, pdf, other]: Title: VINCIE: Unlocking In-context Image Editing from Video

Title: VINCIE：从视频中解锁情境图像编辑

Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

Comments: Project page: https://vincie2025.github.io/

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL) ; Machine Learning (cs.LG) ; Multimedia (cs.MM)
[90] arXiv:2506.11036 (cross-list from cs.LG) [cn-pdf, pdf, html, other]: Title: Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Title: 基于MLLMs的人类中心交互学习用于文本到图像的人再识别

Yang Qin, Chao Chen, Zhihang Fu, Dezhong Peng, Xi Peng, Peng Hu

Subjects: Machine Learning (cs.LG) ; Multimedia (cs.MM)
[91] arXiv:2506.11521 (cross-list from cs.CR) [cn-pdf, pdf, html, other]: Title: Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models

Title: 探究针对音频-视觉攻击的漏洞与防御：强调多模态模型的全面调查

Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li

Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI) ; Multimedia (cs.MM)
[92] arXiv:2506.11737 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model

Title: Quizzard@INOVA 挑战赛 2025 -- 轨道 A：交错多图像模型中的即插即用技术

Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Computation and Language (cs.CL) ; Multimedia (cs.MM)
[93] arXiv:2506.11934 (cross-list from cs.SI) [cn-pdf, pdf, html, other]: Title: Temporal Dynamics of Emotions in Italian Online Soccer Fandoms

Title: 意大利在线足球迷群体的情绪动态变化

Salvatore Citraro, Giovanni Mauro, Emanuele Ferragina

Subjects: Social and Information Networks (cs.SI) ; Multimedia (cs.MM)
[94] arXiv:2506.12269 (cross-list from eess.IV) [cn-pdf, pdf, html, other]: Title: ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing

Title: ICME 2025 视频会议视频超分辨率大挑战

Babak Naderi, Ross Cutler, Juhee Cho, Nabakumar Khongbantabam, Dejan Ivkovic

Subjects: Image and Video Processing (eess.IV) ; Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM)
[95] arXiv:2506.12573 (cross-list from cs.SD) [cn-pdf, pdf, html, other]: Title: Video-Guided Text-to-Music Generation Using Public Domain Movie Collections

Title: 基于公共领域电影资料的视频引导文本到音乐生成

Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, Hao-Wen Dong

Comments: ISMIR 2025 regular paper. Dataset, code, and demo available at https://havenpersona.github.io/ossl-v1

Subjects: Sound (cs.SD) ; Multimedia (cs.MM) ; Audio and Speech Processing (eess.AS)
[96] arXiv:2506.12935 (cross-list from cs.CL) [cn-pdf, pdf, html, other]: Title: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

Title: SoundMind：音频-语言模型的RL激励逻辑推理

Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui

Comments: Accepted to EMNLP 2025 Main Conference (Oral Presentation)

Subjects: Computation and Language (cs.CL) ; Multimedia (cs.MM) ; Sound (cs.SD) ; Audio and Speech Processing (eess.AS)
[97] arXiv:2506.13001 (cross-list from cs.SD) [cn-pdf, pdf, html, other]: Title: Personalizable Long-Context Symbolic Music Infilling with MIDI-RWKV

Title: 基于MIDI-RWKV的可个性化长上下文符号音乐补全

Christian Zhou-Zheng, Philippe Pasquier

Subjects: Sound (cs.SD) ; Machine Learning (cs.LG) ; Multimedia (cs.MM) ; Audio and Speech Processing (eess.AS)
[98] arXiv:2506.13038 (cross-list from cs.CV) [cn-pdf, pdf, html, other]: Title: HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs

Title: HKD4VLM: 一种用于视觉语言模型中鲁棒多模态幻觉和事实性检测的渐进式混合知识蒸馏框架

Zijian Zhang, Xuecheng Wu, Danlei Huang, Siyu Yan, Chong Peng, Xuezhi Cao

Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Multimedia (cs.MM)
[99] arXiv:2506.13971 (cross-list from eess.AS) [cn-pdf, pdf, html, other]: Title: Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

Title: 多模态融合与半监督学习可最小化建模视频会议对话体验的标注量

Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman

Comments: Interspeech 2025

Subjects: Audio and Speech Processing (eess.AS) ; Computation and Language (cs.CL) ; Human-Computer Interaction (cs.HC) ; Machine Learning (cs.LG) ; Multimedia (cs.MM)
[100] arXiv:2506.14223 (cross-list from cs.SD) [cn-pdf, pdf, html, other]: Title: Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription

Title: fretting-变压器：用于MIDI到指法转录的编码器-解码器模型

Anna Hamberger, Sebastian Murgul, Jochen Schmidt, Michael Heizmann

Comments: Accepted to the 50th International Computer Music Conference (ICMC), 2025

Subjects: Sound (cs.SD) ; Computation and Language (cs.CL) ; Multimedia (cs.MM) ; Audio and Speech Processing (eess.AS)

Total of 153 entries : 1-25 26-50 51-75 76-100 101-125 126-150 151-153

Showing up to 25 entries per page: fewer | more | all