性能
查看 最近的 文章
显示 2025年07月08日, 星期二 新的列表
- [1] arXiv:2507.03510 [中文pdf, pdf, html, 其他]
-
标题: 关于结合两种服务器控制策略以提高能效标题: On Combining Two Server Control Policies for Energy Efficiency评论: LOCO '24扩展摘要主题: 性能 (cs.PF)
两种流行的服务器控制策略可用于在保持可接受性能水平的同时降低能耗:服务器速度调节和关闭(以及开启)服务器的能力。 在本工作中,我们探讨了这两种机制之间是否存在协同效应。 为此,我们采用了一个连续时间马尔可夫链模型,其中服务器可以关闭(重新开启服务器需要一定时间),并且服务器的速度可以取两个值:标称运行速度和降低的运行速度。 对于一个在平均响应时间和服务器功耗上是线性的成本函数,我们认为这些机制之间不存在协同效应,即对于所有系统负载,一种机制占主导地位,如果同时使用另一种机制,成本仅略有下降。
Two popular server control policies are available for reducing energy consumption while maintaining acceptable performance levels: server speed scaling and the ability to turn servers off (and on). In this work, we explore the question of whether there are synergistic effects between these two mechanisms. To do this, we employ a continuous-time Markov chain model where the server can be turned off (and turning the server back on takes some time) and where the speed of the server can take on two values: a nominal operating speed and a reduced operating speed. For a cost function that is linear in the mean response time and server power consumption, we suggest that the mechanisms are not synergistic in that for all system loads, one mechanism is dominant in that if the other mechanism is also employed, there is only a small decrease in cost.
- [2] arXiv:2507.03537 [中文pdf, pdf, html, 其他]
-
标题: 宽带双扩散信道中具有时间缩放效应的仿射频分复用标题: Affine Frequency Division Multiplexing Over Wideband Doubly-Dispersive Channels With Time-Scaling Effects主题: 性能 (cs.PF) ; 信息论 (cs.IT)
最近提出的仿射频分复用(AFDM)调制被认为是一种适用于窄带双扩散信道的有前途的技术。 然而,在极端宽带双扩散信道中的时间缩放效应,即脉冲展宽和脉冲缩短现象,在文献中尚未被考虑。 在本文中,我们研究了这种宽带传输,并为AFDM系统开发了一种具有线性周期前缀(CPP)和线性周期后缀(CPS)的有效传输结构。 我们推导了在时间缩放宽带双扩散信道下AFDM系统的输入输出关系,并展示了离散仿射傅里叶(DAF)域等效信道中的稀疏性。 我们进一步优化了AFDM的线性参数,以适应宽带双扩散信道中的时间缩放特性,并通过成对误差概率(PEP)分析验证了所推导的线性参数的优势。 我们还开发了一种用于AFDM符号检测的高效跨域分布式正交近似消息传递(CD-D-OAMP)算法,并分析了其相应的状态演化。 通过分析CD-D-OAMP检测器的检测复杂度,并基于仿真评估AFDM系统的误码性能,我们证明了采用我们优化的线性参数的AFDM系统在时间缩放宽带双扩散信道中优于现有的竞争调制方案。 此外,我们提出的CD-D-OAMP检测器可以在复杂度和性能之间实现理想的权衡,同时支持并行计算,显著降低计算延迟。
The recently proposed affine frequency division multiplexing (AFDM) modulation has been considered as a promising technology for narrowband doubly-dispersive channels. However, the time-scaling effects, i.e., pulse widening and pulse shortening phenomena, in extreme wideband doubly-dispersive channels have not been considered in the literatures. In this paper, we investigate such wideband transmission and develop an efficient transmission structure with chirp-periodic prefix (CPP) and chirp-periodic suffix (CPS) for AFDM system. We derive the input-output relationship of AFDM system under time-scaled wideband doubly-dispersive channels and demonstrate the sparsity in discrete affine Fourier (DAF) domain equivalent channels. We further optimize the AFDM chirp parameters to accommodate the time-scaling characteristics in wideband doubly-dispersive channels and verify the superiority of the derived chirp parameters by pairwise error probability (PEP) analysis. We also develop an efficient cross domain distributed orthogonal approximate message passing (CD-D-OAMP) algorithm for AFDM symbol detection and analyze its corresponding state evolution. By analyzing the detection complexity of CD-D-OAMP detector and evaluating the error performance of AFDM systems based on simulations, we demonstrate that the AFDM system with our optimized chirp parameters outperforms the existing competitive modulation schemes in time-scaled wideband doubly-dispersive channels. Moreover, our proposed CD-D-OAMP detector can achieve the desirable trade-off between the complexity and performance, while supporting parallel computing to significantly reduce the computational latency.
新提交 (展示 2 之 2 条目 )
- [3] arXiv:2507.03211 (交叉列表自 cs.LG) [中文pdf, pdf, html, 其他]
-
标题: DistZO2:具有分布式并行计算的高吞吐量和内存高效的零阶微调大语言模型标题: DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing主题: 机器学习 (cs.LG) ; 性能 (cs.PF)
微调大型语言模型(LLMs)由于其庞大的规模仍然需要大量资源。 虽然零阶(ZO)优化通过消除反向传递提供了内存高效的替代方案,但由于GPU内存和计算吞吐量的限制,其在数百亿参数模型上的应用受到约束。 ZO2框架通过将模型参数卸载到CPU内存,并在单个GPU上重叠变压器块传输与双重前向计算来解决内存瓶颈问题。 然而,ZO2仍受限于单设备执行,并且吞吐量较低。 在本工作中,我们提出了DistZO2,这是一个高吞吐量、内存高效的分布式零阶微调LLM框架。 DistZO2引入了三种并行策略: (1) 扰动并行性(PertP),在设备之间并行化两个扰动前向传递;(2) 分布式数据并行性(DDP),适应ZO训练的标量梯度特性;以及(3) 一种统一的二维并行设计,结合PertP和DDP。 为了进一步缓解由参数卸载引入的通信瓶颈,我们提出了一种硬件感知的通信策略,该策略对参数块进行切片并通过高速互连如NVLink将其重新分布到多个GPU上。 DistZO2将零阶微调扩展到现代多GPU系统,在保持ZO2内存效率的同时显著提高了训练吞吐量。 在OPT-175B上的实验中,DistZO2在分布式计算下比ZO2快3倍。 DistZO2的代码已在https://github.com/liangyuwang/zo2中开源。
Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale. While zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating backward passes, its application to multi-hundred-billion-parameter models is constrained by GPU memory and compute throughput. The ZO2 framework addresses the memory bottleneck by offloading model parameters to CPU memory and overlapping transformer block transfer with dual forward computation on a single GPU. However, ZO2 remains limited by its single-device execution and achieves modest throughput. In this work, we present DistZO2, a high-throughput, memory-efficient framework for distributed zeroth-order fine-tuning of LLMs. DistZO2 introduces three parallel strategies: (1) Perturbation Parallelism (PertP), which parallelizes the two perturbed forward passes across devices; (2) Distributed Data Parallelism (DDP), adapted to the scalar-gradient nature of ZO training; and (3) a unified 2D Parallelism design that combines PertP and DDP. To further mitigate communication bottlenecks introduced by parameter offloading, we propose a hardware-aware communication strategy that slices parameter blocks and redistributes them across GPUs via high-speed interconnects such as NVLink. DistZO2 scales zeroth-order fine-tuning to modern multi-GPU systems, preserving ZO2's memory efficiency while substantially improving training throughput. In our experiments on OPT-175B, DistZO2 achieves a 3x speedup over ZO2 with distributed computing. DistZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.
- [4] arXiv:2507.03448 (交叉列表自 cs.SI) [中文pdf, pdf, html, 其他]
-
标题: 社会网络中的支配还是公平竞争? 一个影响者受欢迎程度的模型标题: Dominance or Fair Play in Social Networks? A Model of Influencer Popularity Dynamic评论: 18页主题: 社会与信息网络 (cs.SI) ; 性能 (cs.PF)
本文提出了一种数据驱动的平均场方法,用于建模寻求公众关注的用户(即影响力人物)的流行度动态。 我们提出了一种新颖的分析模型,该模型结合了个体活动模式、制作病毒式内容的专业知识、外生事件以及平台在可见性增强中的作用,最终决定了每个影响力人物的成功。 我们分析推导出系统遍历性的充分条件,从而能够预测流行度分布。 敏感性分析探讨了各种系统配置,突出了有利于影响力人物之间主导地位或公平竞争的条件。 我们的研究结果为社交网络向更公平或偏向的影响生态系统演化的潜在可能性提供了有价值的见解。
This paper presents a data-driven mean-field approach to model the popularity dynamics of users seeking public attention, i.e., influencers. We propose a novel analytical model that integrates individual activity patterns, expertise in producing viral content, exogenous events, and the platform's role in visibility enhancement, ultimately determining each influencer's success. We analytically derive sufficient conditions for system ergodicity, enabling predictions of popularity distributions. A sensitivity analysis explores various system configurations, highlighting conditions favoring either dominance or fair play among influencers. Our findings offer valuable insights into the potential evolution of social networks towards more equitable or biased influence ecosystems.
- [5] arXiv:2507.04432 (交叉列表自 q-bio.MN) [中文pdf, pdf, 其他]
-
标题: 通过将选择性增量学习应用于(非常)小的语言模型来重建生物通路标题: Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language ModelsPranta Saha, Joyce Reimer, Brook Byrns, Connor Burbridge, Neeraj Dhar, Jeffrey Chen, Steven Rayan, Gordon Broderick评论: 9页,6图,3表 + 28页的补充表格;提交至第16届ACM生物信息学、计算生物学和健康信息学会议(ACM BCB 2025)的投稿编号76主题: 分子网络 (q-bio.MN) ; 计算与语言 (cs.CL) ; 信息论 (cs.IT) ; 机器学习 (cs.LG) ; 性能 (cs.PF)
生成式人工智能(AI)模型在许多领域中的使用正变得无处不在。 尽管持续取得进展,通用的大规模语言AI模型(LLM)往往倾向于给出有创造力的答案,通常称为“幻觉”,这使得它们在医学和生物医学领域中的应用受到阻碍,因为在这些领域中准确性至关重要。 我们提出,设计和使用更小的、领域甚至任务特定的语言模型(LM)可能是在生物医学研究中更合理和适当的技术使用方式。 在这项工作中,我们应用了一个在当今标准下非常小的语言模型,用于专门预测分子组件之间的调控相互作用,以填补我们对细胞内通路当前理解的空白。 为此,我们尝试通过选择并仅使用最具信息量的例子作为主动学习方案的一部分,正确地定位从手动整理的通路数据库中恢复的已知通路相关相互作用。 通过这个例子,我们表明基于双向编码器表示的Transformer(BERT)架构的小型语言模型(约1.1亿个参数)可以以超过80%的准确率提出与结核病持续性和传播相关的分子相互作用,使用的数据不足520个相关调控关系的25%。 使用信息熵作为迭代选择新调优示例的指标,我们还发现,更高的准确率是由优先使用确定性最高的错误分配陈述(最低熵)所驱动的。 相反,同时使用正确但确定性最低的示例贡献很小,甚至可能对学习率产生不利影响。
The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called "hallucinations", which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today's standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.
- [6] arXiv:2507.05141 (交叉列表自 cs.LG) [中文pdf, pdf, html, 其他]
-
标题: 针对TinyML神经符号AI应用的高效硬件可处理概率推理标题: Hardware-efficient tractable probabilistic inference for TinyML Neurosymbolic AI applications主题: 机器学习 (cs.LG) ; 性能 (cs.PF)
神经符号人工智能(NSAI)最近出现,以缓解与深度学习(DL)模型相关的限制,例如量化其不确定性或使用显式规则进行推理。因此,TinyML硬件将需要支持这些符号模型,以将NSAI带入嵌入式场景。然而,尽管符号模型通常较为紧凑,但它们的稀疏性和计算分辨率与低分辨率和密集的神经模型形成对比,这对资源受限的TinyML硬件构成了严重挑战,极大地限制了可以在其上计算的符号模型的大小。在本工作中,我们通过紧密的软硬件集成来消除这一瓶颈,提出一个完整的框架,以在TinyML硬件上计算NSAI。我们专注于用可处理的概率电路(PC)实现的符号模型,这是一种适用于硬件集成的概率模型的流行子类。该框架:(1) 训练一种特定类型的硬件高效的\emph{确定性的}PC,选择用于符号任务;(2) 使用我们的$n^{th}$根压缩技术,对这个PC进行\emph{压缩},使其能够在TinyML硬件上计算,且精度下降最小;(3) 在TinyML硬件上\emph{部署}完整的NSAI模型。与无压缩情况下PC所需的64b精度基线相比,我们的工作在FPGA上实现了显著的硬件减少(FF最多减少82.3%,LUTs减少52.6%,Flash使用量减少18.0%),并在ESP32微控制器上平均推理速度提升了4.67倍。
Neurosymbolic AI (NSAI) has recently emerged to mitigate limitations associated with deep learning (DL) models, e.g. quantifying their uncertainty or reason with explicit rules. Hence, TinyML hardware will need to support these symbolic models to bring NSAI to embedded scenarios. Yet, although symbolic models are typically compact, their sparsity and computation resolution contrasts with low-resolution and dense neuro models, which is a challenge on resource-constrained TinyML hardware severely limiting the size of symbolic models that can be computed. In this work, we remove this bottleneck leveraging a tight hardware/software integration to present a complete framework to compute NSAI with TinyML hardware. We focus on symbolic models realized with tractable probabilistic circuits (PCs), a popular subclass of probabilistic models for hardware integration. This framework: (1) trains a specific class of hardware-efficient \emph{deterministic} PCs, chosen for the symbolic task; (2) \emph{compresses} this PC until it can be computed on TinyML hardware with minimal accuracy degradation, using our $n^{th}$-root compression technique, and (3) \emph{deploys} the complete NSAI model on TinyML hardware. Compared to a 64b precision baseline necessary for the PC without compression, our workflow leads to significant hardware reduction on FPGA (up to 82.3\% in FF, 52.6\% in LUTs, and 18.0\% in Flash usage) and an average inference speedup of 4.67x on ESP32 microcontroller.
交叉提交 (展示 4 之 4 条目 )
- [7] arXiv:2504.10996 (替换) [中文pdf, pdf, 其他]
-
标题: 基于噪声鲁棒先验的去噪应用性能模型标题: Denoising Application Performance Models with Noise-Resilient PriorsGustavo de Morais, Alexander Gei√ü, Alexandru Calotoiu, Gregor Corbin, Ahmad Tarraf, Torsten Hoefler, Bernd Mohr, Felix Wolf主题: 性能 (cs.PF) ; 分布式、并行与集群计算 (cs.DC)
在将并行代码扩展到更大机器时,性能模型有助于识别潜在的瓶颈。 由于分析设计这些数学表示通常具有挑战性,基于性能测量的经验模型提供了一个实用的替代方案。 然而,HPC系统上的测量通常受到噪声的影响,导致模型预测可能产生误导。 为了减少噪声的影响,我们在建模过程中引入了与应用相关的动态先验,这些先验是从计算工作量的抗噪测量和对通信例程中常用算法的了解中推导出来的。 这些先验然后缩小了我们性能模型的搜索空间,排除了反映噪声而非性能的复杂度类别。 我们的方法使模型更接近理论预期,并显著提高了其预测能力。 最后,它通过最小化重复测量的数量将实验成本减半。
When scaling parallel codes to larger machines, performance models help identify potential bottlenecks. Since analytically designing these mathematical representations is usually challenging, empirical models based on performance measurements offer a practical alternative. Yet, measurements on HPC systems are typically affected by noise, leading to potentially misleading model predictions. To reduce the influence of noise, we introduce application-specific dynamic priors into the modeling process, which we derive from noise-resilient measurements of computational effort and knowledge of typical algorithms used in communication routines. These priors then narrow the search space for our performance models, excluding complexity classes that reflect noise rather than performance. Our approach keeps the models much closer to theoretical expectations and significantly improves their predictive power. Finally, it cuts experimental costs in half by minimizing the number of repeated measurements.
- [8] arXiv:2502.15534 (替换) [中文pdf, pdf, html, 其他]
-
标题: Hiku:无服务器计算的拉取调度标题: Hiku: Pull-Based Scheduling for Serverless Computing评论: 发表于2025年IEEE第25届国际集群、云和互联网计算研讨会(CCGrid)主题: 分布式、并行与集群计算 (cs.DC) ; 性能 (cs.PF)
无服务器计算为开发和部署在事件触发下执行的功能提供了便捷的抽象。在这种函数即服务(FaaS)平台上,调度是一项关键任务,但当前的调度算法在保持负载平衡、最小化冷启动以及适应常见的突发工作负载方面常常遇到困难。在本工作中,我们提出了一种基于拉取的调度作为无服务器计算的新调度算法。我们的核心思想是将工作者选择与任务分配解耦,空闲工作者主动请求新的任务。在开源FaaS平台上的实验评估表明,与其它现有调度算法相比,基于拉取的调度显著提高了无服务器工作负载的性能和负载平衡,特别是在高并发情况下。与基于哈希的调度相比,所提出的算法将响应延迟提高了14.9%,将冷启动的频率从43%降低到30%,将吞吐量提高了8.3%,并通过每个工作者分配的请求数量衡量,实现了12.9%更均匀的负载分布。
Serverless computing promises convenient abstractions for developing and deploying functions that execute in response to events. In such Function-as-a-Service (FaaS) platforms, scheduling is an integral task, but current scheduling algorithms often struggle with maintaining balanced loads, minimizing cold starts, and adapting to commonly occurring bursty workloads. In this work, we propose pull-based scheduling as a novel scheduling algorithm for serverless computing. Our key idea is to decouple worker selection from task assignment, with idle workers requesting new tasks proactively. Experimental evaluation on an open-source FaaS platform shows that pull-based scheduling, compared to other existing scheduling algorithms, significantly improves the performance and load balancing of serverless workloads, especially under high concurrency. The proposed algorithm improves response latencies by 14.9% compared to hash-based scheduling, reduces the frequency of cold starts from 43% to 30%, increases throughput by 8.3%, and achieves a more even load distribution by 12.9% measured by the requests assigned per worker.
- [9] arXiv:2505.00901 (替换) [中文pdf, pdf, html, 其他]
-
标题: 异构内存基准测试工具包标题: Heterogeneous Memory Benchmarking Toolkit主题: 硬件架构 (cs.AR) ; 性能 (cs.PF)
本文提出了一种开源的内核级异构内存表征框架(MemScope),用于嵌入式系统。 MemScope能够在可配置的竞争压力场景下,精确表征可用内存模块的时间行为。 MemScope利用对物理内存分配、缓存维护、CPU状态、中断和I/O设备活动的内核级控制,以准确基准测试异构内存子系统。 这使我们能够直接映射连续的物理内存块并实例化分配器,从而精细控制核心以创建和消除干扰。 此外,我们可以减少噪声和中断,与等效的用户空间解决方案相比,保证更一致和精确的结果。 在Xilinx Zynq UltraScale+ ZCU102 CPU-FPGA平台上运行我们的框架,展示了其在多核系统中精确基准测试各种内存类型(包括PL侧DRAM和BRAM)带宽和延迟的能力。
This paper presents an open-source kernel-level heterogeneous memory characterization framework (MemScope) for embedded systems. MemScope enables precise characterization of the temporal behavior of available memory modules under configurable contention stress scenarios. MemScope leverages kernel-level control over physical memory allocation, cache maintenance, CPU state, interrupts, and I/O device activity to accurately benchmark heterogeneous memory subsystems. This gives us the privilege to directly map pieces of contiguous physical memory and instantiate allocators, allowing us to finely control cores to create and eliminate interference. Additionally, we can minimize noise and interruptions, guaranteeing more consistent and precise results compared to equivalent user-space solutions. Running our Framework on a Xilinx Zynq UltraScale+ ZCU102 CPU-FPGA platform demonstrates its capability to precisely benchmark bandwidth and latency across various memory types, including PL-side DRAM and BRAM, in a multi-core system.
- [10] arXiv:2506.16676 (替换) [中文pdf, pdf, html, 其他]
-
标题: PETSC用于托卡马克流体模型的快速求解器——第一部分标题: Fast solvers for Tokamak fluid models with PETSC -- Part I主题: 等离子体物理 (physics.plasm-ph) ; 性能 (cs.PF)
这项工作开始使用多重网格方法开发用于托卡马克科学和工程相关磁流体动力学(MHD)模型的快速、可扩展求解器。 这些托卡马克模型在环向坐标中具有一个显著的方向,该方向部分与磁导场对齐,这主导了等离子体动力学。 所有托卡马克模型都利用这一结构,例如 https://nimrodteam.org 的 NIMROD 使用 $2D$,环向平面中的非结构化高阶有限元以及环向坐标中的傅里叶模式,而 $3D$,扩展的MHD代码 \textit{M3D-C1}\footnote{https://m3dc1.pppl.gov} 使用 $2D$,环向平面中的非结构化 $C^1$元素以及环向方向中的三次 Hermite 函数。 这种结构建议首先处理环向坐标,这在公式层面由\textit{NIMROD}实现,但\textit{M3D-C1}方法则将其保留在代数系统中,在每个时间步长的隐式时间积分器中求解。 这项工作通过在现有的 PETSC(https://petsc.org -- 可扩展的科学计算工具包)块雅可比求解器中添加半粗化多重网格方法,解决了\textit{M3D-C1}速度求解中的环向坐标,新增的代码很少,允许使用更小的雅可比子域,这些子域更适合现代高度并行的硬件。 在 SPARC(https://cfs.energy/technology/sparc 破裂)的自洽奔逃电子模型上展示了这种新求解器配置的竞争力,并概述了这种新方法开发的下一步计划。
This work begins the development of fast, scalable solvers for scientific and engineering-relevant magnetohydrodynamics (MHD) models of tokamaks using multigrid methods. These tokamak models are characterized by a distinguished direction in the toroidal coordinate that is partially aligned with the magnetic guide field, which dominates the plasma dynamics. All tokamak models exploit this structure, for example, NIMROD at https://nimrodteam.org uses $2D$, unstructured, high-order finite elements in the poloidal plane with Fourier modes in the toroidal coordinate, and the $3D$, extended MHD code \textit{M3D-C1}\footnote{https://m3dc1.pppl.gov} uses $2D$, unstructured $C^1$ elements in the poloidal plane with cubic Hermite functions in the toroidal direction. This structure suggests addressing the toroidal coordinate first, which \textit{NIMROD} does at the formulation level, but the \textit{M3D-C1} approach leaves in the algebraic system to be solved at each time step in an implicit time integrator. This work addressed the toroidal coordinate in the \textit{M3D-C1} velocity solve by adding semi-coarsening multigrid to the existing PETSC at https://petsc.org -- Portable, Extensible Toolkit for Scientific Computation -- block Jacobi solver, with the addition of little new code that allows for smaller Jacobi subdomains that are better suited to contemporary, highly parallel, hardware. Competitive performance of this new solver configuration is demonstrated on a self-consistent runaway electron model of a SPARC at https://cfs.energy/technology/sparc disruption, and the next steps in the development of this new approach are outlined.
- [11] arXiv:2506.19651 (替换) [中文pdf, pdf, html, 其他]
-
标题: PEVLM:视觉语言模型的并行编码标题: PEVLM: Parallel Encoding for Vision-Language Models主题: 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG) ; 性能 (cs.PF)
视觉-语言模型(VLMs)在多模态理解和生成任务中表现出强大的能力。 然而,它们在长视频理解中的应用仍受到标准注意力机制二次复杂度的限制。 在本工作中,我们引入了\textbf{PEVLM},一种无需微调的并行编码方法,旨在提高VLM在长视频场景下的预填充效率。 PEVLM将输入视频划分为具有共享sink块的上下文块,同时保留序列位置嵌入,以使注意力权重分布与Full-Attention对齐。 这种设计将注意力复杂度从$O((T \times N)^2)$降低到$O(T \times N)$,其中$T$是帧数,$N$是每帧的标记数,而不会牺牲准确性。 在多个最先进的模型和基准上的广泛实验表明,PEVLM始终优于现有的并行编码方法,在注意力计算方面实现了高达\textbf{7.47x}的加速,并将端到端延迟降低了\textbf{40\%}。 值得注意的是,PEVLM不仅保持了高准确性,而且在某些情况下甚至超过了Full-Attention的性能。 在严格的延迟约束下,它实现了显著的提升,将准确性从\textbf{23.26\%}提高到\textbf{61.03\%}。 这些结果强调了PEVLM在低延迟、长上下文视频理解中的有效性,使其成为实际应用中有前景的解决方案。
Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long video scenarios. PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention. This design reduces attention complexity from $O((T \times N)^2)$ to $O(T \times N)$ where $T$ is the number of frames and $N$ the number of tokens per frame, without sacrificing accuracy. Extensive experiments across multiple state-of-the-art models and benchmarks demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to \textbf{7.47x} speedup in attention computation and reducing end-to-end latency by \textbf{40\%}. Remarkably, PEVLM not only maintains high accuracy, but in some settings even surpasses Full-Attention performance. Under strict latency constraints, it achieves substantial gains, improving accuracy from \textbf{23.26\%} to \textbf{61.03\%}. These results underscore the effectiveness of PEVLM for low-latency, long-context video understanding, making it a promising solution for real-world applications.