Distributed, Parallel, and Cluster Computing

New submissions
Cross-lists
Replacements

See recent articles

Showing new listings for Thursday, 25 September 2025

Total of 18 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2509.19478 [cn-pdf, pdf, html, other]: Title: Investigating Sharding Advancements, Methodologies, and Adoption Potential in Hedera

Title: 研究Hedera中的分片进展、方法和采用潜力

Ziwei Wang, Cong Wu, Paolo Tasca

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Sharding has emerged as a critical solution to address the scalability challenges faced by blockchain networks, enabling them to achieve higher transaction throughput, reduced latency, and optimized resource usage. This paper investigates the advancements, methodologies, and adoption potential of sharding in the context of Hedera, a distributed ledger technology known for its unique Gossip about Gossip protocol and asynchronous Byzantine Fault Tolerance (ABFT). We explore various academic and industrial sharding techniques, emphasizing their benefits and trade-offs. Building on these insights, we propose a hybrid sharding solution for Hedera that partitions the network into local and global committees, facilitating efficient cross-shard transactions and ensuring robust security through dynamic reconfiguration. Our analysis highlights significant reductions in storage and communication overhead, improved scalability, and enhanced fault tolerance, demonstrating the feasibility and advantages of integrating sharding into Hedera's architecture.

分片已成为解决区块链网络可扩展性挑战的关键解决方案，使其能够实现更高的交易吞吐量、降低延迟并优化资源使用。本文研究了分片在Hedera（一种以其独特的关于 gossip 的 gossip 协议和异步拜占庭容错（ABFT）而闻名的分布式账本技术）背景下的进展、方法和采用潜力。我们探讨了各种学术和工业领域的分片技术，强调它们的优势和权衡。基于这些见解，我们为Hedera提出了一种混合分片解决方案，将网络划分为本地委员会和全局委员会，促进高效的跨分片交易，并通过动态重新配置确保强大的安全性。我们的分析突出了存储和通信开销的显著减少、可扩展性的提高以及容错能力的增强，证明了将分片集成到Hedera架构中的可行性和优势。
[2] arXiv:2509.19532 [cn-pdf, pdf, html, other]: Title: To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

Title: 流还是不流：面向远程HPC处理决策的定量模型

Flavio Castro, Weijian Zheng, Joaquin Chung, Ian Foster, Rajkumar Kettimuthu

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Networking and Internet Architecture (cs.NI)

Modern scientific instruments generate data at rates that increasingly exceed local compute capabilities and, when paired with the staging and I/O overheads of file-based transfers, also render file-based use of remote HPC resources impractical for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise to reduce latency and improve system efficiency, but lack a principled way to assess their feasibility. In this work, we introduce a quantitative framework and an accompanying Streaming Speed Score to evaluate whether remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file input/output overhead to compute total processing completion time and identify operational regimes where streaming is beneficial. We motivate our methodology with use cases from facilities such as APS, FRIB, LCLS-II, and the LHC, and validate our approach through an illustrative case study based on LCLS-II data. Our measurements show that streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude, underscoring the importance of tail latency in streaming feasibility decisions.

现代科学仪器生成数据的速度日益超过本地计算能力，当与基于文件传输的暂存和I/O开销相结合时，也使得基于文件的远程高性能计算（HPC）资源的使用对于时间敏感的分析和实验控制变得不切实际。实时流处理框架有望降低延迟并提高系统效率，但缺乏一种系统的方法来评估其可行性。在本工作中，我们引入了一个定量框架和一个相应的流速得分，以评估远程高性能计算（HPC）资源是否能比本地替代方案提供及时的数据处理。我们的模型包括数据生成率、传输效率、远程处理能力和文件输入/输出开销等关键参数，以计算总处理完成时间和识别流式传输有益的操作区域。我们通过来自APS、FRIB、LCLS-II和LHC设施的用例来说明我们的方法，并通过基于LCLS-II数据的示例案例研究验证我们的方法。我们的测量结果显示，在高数据速率下，流式传输可以实现比基于文件的方法低达97%的端到端完成时间，而最坏情况下的拥塞可能导致传输时间增加一个数量级以上，这突显了尾部延迟在流式传输可行性决策中的重要性。
[3] arXiv:2509.19539 [cn-pdf, pdf, html, other]: Title: A Survey of Recent Advancements in Secure Peer-to-Peer Networks

Title: 安全对等网络的最新进展综述

Raj Patel, Umesh Biswas, Surya Kodipaka, Will Carroll, Preston Peranich, Maxwell Young

Comments: 30 pages, 4 figures, 2 tables

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Cryptography and Security (cs.CR)

Peer-to-peer (P2P) networks are a cornerstone of modern computing, and their security is an active area of research. Many defenses with strong security guarantees have been proposed; however, the most-recent survey is over a decade old. This paper delivers an updated review of recent theoretical advances that address classic threats, such as the Sybil and routing attacks, while highlighting how emerging trends -- such as machine learning, social networks, and dynamic systems -- pose new challenges and drive novel solutions. We evaluate the strengths and weaknesses of these solutions and suggest directions for future research.

点对点（P2P）网络是现代计算的核心，其安全性是一个活跃的研究领域。许多具有强大安全保证的防御措施已被提出；然而，最近的综述已超过十年。本文提供了对近期理论进展的更新回顾，这些进展解决了经典的威胁，如Sybil攻击和路由攻击，同时强调了新兴趋势——如机器学习、社交网络和动态系统——带来的新挑战并推动了新的解决方案。我们评估了这些解决方案的优缺点，并提出了未来研究的方向。
[4] arXiv:2509.19701 [cn-pdf, pdf, html, other]: Title: Characterizing Adaptive Mesh Refinement on Heterogeneous Platforms with Parthenon-VIBE

Title: 在异构平台上使用Parthenon-VIBE对自适应网格加密进行表征

Akash Poptani, Alireza Khadem, Scott Mahlke, Jonah Miller, Joshua Dolence, Reetuparna Das

Comments: Accepted to appear at IISWC 2025

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Performance (cs.PF)

Hero-class HPC simulations rely on Adaptive Mesh Refinement (AMR) to reduce compute and memory demands while maintaining accuracy. This work analyzes the performance of Parthenon, a block-structured AMR benchmark, on CPU-GPU systems. We show that smaller mesh blocks and deeper AMR levels degrade GPU performance due to increased communication, serial overheads, and inefficient GPU utilization. Through detailed profiling, we identify inefficiencies, low occupancy, and memory access bottlenecks. We further analyze rank scalability and memory constraints, and propose optimizations to improve GPU throughput and reduce memory footprint. Our insights can inform future AMR deployments on Department of Energy's upcoming heterogeneous supercomputers.

英雄级高性能计算（HPC）模拟依赖自适应网格细化（AMR）来降低计算和内存需求，同时保持准确性。这项工作分析了Parthenon在CPU-GPU系统上的性能，Parthenon是一个块结构化的AMR基准测试。我们表明，较小的网格块和更深的AMR层级会由于增加的通信、串行开销和低效的GPU利用率而降低GPU性能。通过详细的性能分析，我们发现了效率低下、低占用率和内存访问瓶颈。我们进一步分析了进程可扩展性和内存限制，并提出了优化措施以提高GPU吞吐量并减少内存占用。我们的见解可以为美国能源部即将推出的异构超级计算机上的未来AMR部署提供参考。
[5] arXiv:2509.19729 [cn-pdf, pdf, html, other]: Title: Gyges: Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference

Title: Gyges：用于高效大语言模型推理的动态跨实例并行化转换

Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang

Comments: 12 pages, 15 figures

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Efficiently processing the dynamics of requests, especially the context length variance, is important in Large Language Model (LLM) serving scenarios. However, there is an intrinsic trade-off: while leveraging parallelism strategies, such as Tensor Parallelism (TP), can coordinate multiple GPUs to accommodate larger context lengths, it inevitably results in degraded overall throughput. In this paper, we propose Cross-Instance Parallelism Transformation (Gyges), which adaptively adjusts the parallelism strategies of running instances to align with the dynamics of incoming requests. We design (1) a page-friendly, header-centric layout to accelerate KV cache transformations; (2) dedicated weight padding to accelerate model weight transformations; and (3) a transformation-aware scheduler to cooperatively schedule requests and parallelism transformations, optimizing the overall performance. Evaluations using real-world traces show that Gyges improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

在大型语言模型（LLM）服务场景中，高效处理请求的动力学特性，尤其是上下文长度的变化，是非常重要的。然而，存在一个内在的权衡：虽然利用并行策略，如张量并行（TP），可以协调多个GPU以适应更长的上下文长度，但不可避免地会导致整体吞吐量下降。在本文中，我们提出了跨实例并行转换（Gyges），它可以自适应地调整运行实例的并行策略，以匹配传入请求的动力学特性。我们设计了（1）一种页面友好、以头部为中心的布局，以加速键值缓存转换；（2）专用的权重填充，以加速模型权重转换；以及（3）一种转换感知调度器，以协同调度请求和并行转换，优化整体性能。使用真实世界轨迹的评估表明，与最先进的解决方案相比，Gyges的吞吐量提高了1.75倍至6.57倍。
[6] arXiv:2509.19836 [cn-pdf, pdf, html, other]: Title: BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens

Title: BurstEngine：一种高效的分布式框架，用于在超过1M个标记的极长序列上训练Transformer

Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Existing methods for training LLMs on long-sequence data, such as Tensor Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as sequence lengths and number of GPUs increase, especially when sequence lengths exceed 1M tokens. To address these challenges, we propose BurstEngine, an efficient framework designed to train LLMs on long-sequence data. BurstEngine introduces BurstAttention, an optimized distributed attention with lower communication cost than RingAttention. BurstAttention leverages topology-aware ring communication to fully utilize network bandwidth and incorporates fine-grained communication-computation overlap. Furthermore, BurstEngine introduces sequence-level selective checkpointing and fuses the language modeling head with the loss function to reduce memory cost. Additionally, BurstEngine introduces workload balance optimization for various types of attention masking. By integrating these optimizations, BurstEngine achieves a $1.2\times$ speedup with much lower memory overhead than the state-of-the-art baselines when training LLMs on extremely long sequences of over 1M tokens. We have made our code publicly available on GitHub: https://github.com/thunlp/BurstEngine.

现有用于在长序列数据上训练大语言模型的方法，如张量并行和上下文并行，在序列长度和GPU数量增加时表现出较低的模型FLOPs利用率，尤其是在序列长度超过1M个标记时更为明显。为解决这些挑战，我们提出了BurstEngine，这是一个旨在训练大语言模型的高效框架。BurstEngine引入了BurstAttention，这是一种优化的分布式注意力机制，其通信成本低于RingAttention。BurstAttention利用拓扑感知的环形通信以充分利用网络带宽，并结合细粒度的通信计算重叠。此外，BurstEngine引入了序列级的选择性检查点，并将语言建模头与损失函数融合以降低内存成本。此外，BurstEngine为各种类型的注意力掩码引入了工作负载平衡优化。通过集成这些优化，BurstEngine在训练超过1M个标记的极长序列时，相比最先进的基线方法实现了$1.2\times$的加速效果，且内存开销显著更低。我们已将代码公开在GitHub上：https://github.com/thunlp/BurstEngine.
[7] arXiv:2509.20160 [cn-pdf, pdf, html, other]: Title: Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Title: 加速Jetson边缘设备在训练深度学习模型中的性能表征

Prashanthi S. K., Sai Anuroop Kesanapalli, Yogesh Simmhan

Comments: Preprint of article in ACM SIGMETRICS 2023

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Deep Neural Networks (DNNs) have had a significant impact on domains like autonomous vehicles and smart cities through low-latency inferencing on edge computing devices close to the data source. However, DNN training on the edge is poorly explored. Techniques like federated learning and the growing capacity of GPU-accelerated edge devices like NVIDIA Jetson motivate the need for a holistic characterization of DNN training on the edge. Training DNNs is resource-intensive and can stress an edge's GPU, CPU, memory and storage capacities. Edge devices also have different resources compared to workstations and servers, such as slower shared memory and diverse storage media. Here, we perform a principled study of DNN training on individual devices of three contemporary Jetson device types: AGX Xavier, Xavier NX and Nano for three diverse DNN model--dataset combinations. We vary device and training parameters such as I/O pipelining and parallelism, storage media, mini-batch sizes and power modes, and examine their effect on CPU and GPU utilization, fetch stalls, training time, energy usage, and variability. Our analysis exposes several resource inter-dependencies and counter-intuitive insights, while also helping quantify known wisdom. Our rigorous study can help tune the training performance on the edge, trade-off time and energy usage on constrained devices, and even select an ideal edge hardware for a DNN workload, and, in future, extend to federated learning too. As an illustration, we use these results to build a simple model to predict the training time and energy per epoch for any given DNN across different power modes, with minimal additional profiling.

深度神经网络（DNNs）通过在靠近数据源的边缘计算设备上进行低延迟推理，在自动驾驶汽车和智慧城市等领域产生了重大影响。然而，边缘上的DNN训练研究较少。像联邦学习这样的技术以及NVIDIA Jetson等GPU加速边缘设备能力的提升，促使需要对边缘上的DNN训练进行全面分析。训练DNN资源消耗大，可能会对边缘的GPU、CPU、内存和存储容量造成压力。边缘设备与工作站和服务器相比也有不同的资源，例如较慢的共享内存和多样的存储介质。在这里，我们对三种现代Jetson设备类型（AGX Xavier、Xavier NX和Nano）的单个设备上的DNN训练进行了系统的研究，针对三种不同的DNN模型-数据集组合。我们改变设备和训练参数，如I/O流水线和并行性、存储介质、小批量大小和电源模式，并研究它们对CPU和GPU利用率、获取停顿、训练时间、能耗和变化性的影响。我们的分析揭示了若干资源相互依赖关系和出人意料的见解，同时也有助于量化已知的经验。我们的严格研究可以帮助调整边缘上的训练性能，权衡受限设备上的时间和能耗，并甚至选择适合DNN工作负载的理想边缘硬件，并在未来扩展到联邦学习。作为示例，我们利用这些结果构建一个简单的模型，以在不同电源模式下预测任何给定DNN的每个周期的训练时间和能耗，只需最少的额外分析。
[8] arXiv:2509.20189 [cn-pdf, pdf, html, other]: Title: Pagoda: An Energy and Time Roofline Study for DNN Workloads on Edge Accelerators

Title: 佛塔：针对边缘加速器上DNN工作负载的能源和时间屋顶线研究

Prashanthi S. K., Kunal Kumar Sahoo, Amartya Ranjan Saikia, Pranav Gupta, Atharva Vinay Joshi, Priyanshu Pansari, Yogesh Simmhan

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Edge accelerators such as Nvidia Jetsons are becoming an integral part of the computing continuum, and are often used for DNN inferencing and training. Nvidia Jetson edge devices have $2000$+ CUDA cores within a $70$W power envelope and offer $1000$s of power modes to customize CPU, GPU and memory frequencies. Their widely varying power--performance trade-offs can be exploited for energy and power-constrained deployments. While data-driven methods to predict the power and latency of DNN workloads for edge devices exist, there is a lack of principled study to understand why edge accelerators and their power modes perform the way they do. We develop a time roofline and a novel energy roofline model for the Jetson Orin AGX for diverse power modes, and couple it with an analytical model of the compute (FLOP) and memory access (bytes) for DNN inference workloads to analyze them from first principles. These reveal unique, sometimes counter-intuitive, insights into the power and performance behavior of DNN workloads on edge accelerators, e.g., the default power mode MAXN is not the most energy efficient and time efficiency implies energy efficiency for all power modes. We also extend our analytical roofline models to DNN training. Finally, we apply these methods to tune the power mode (and hence the roofline) of the edge device to optimize the latency and energy for DNN inference, with up to $15\%$ lower energy and minimal degradation in inference time.

边缘加速器，如Nvidia Jetson，正成为计算连续体中不可或缺的一部分，并常用于DNN推理和训练。 Nvidia Jetson边缘设备在$70$W的功耗范围内拥有$2000$+ CUDA核心，并提供$1000$s的功耗模式，以自定义CPU、GPU和内存频率。它们广泛变化的功耗-性能权衡可以被利用于能源和功耗受限的部署。虽然存在用于预测边缘设备上DNN工作负载功耗和延迟的数据驱动方法，但缺乏对边缘加速器及其功耗模式为何表现如此的系统性研究。我们为Jetson Orin AGX开发了一个时间屋顶线和一种新颖的能量屋顶线模型，适用于多种功耗模式，并将其与DNN推理工作负载的计算（FLOP）和内存访问（字节）的分析模型相结合，从基本原理出发进行分析。这些模型揭示了关于边缘加速器上DNN工作负载功耗和性能行为的独特且有时出人意料的见解，例如，默认功耗模式MAXN并不是最节能的，且所有功耗模式下时间效率意味着能耗效率。我们还将分析性屋顶线模型扩展到了DNN训练。最后，我们将这些方法应用于调整边缘设备的功耗模式（从而调整屋顶线），以优化DNN推理的延迟和能耗，最多可降低$15\%$的能耗，且推理时间几乎没有下降。
[9] arXiv:2509.20205 [cn-pdf, pdf, html, other]: Title: Fulcrum: Optimizing Concurrent DNN Training and Inferencing on Edge Accelerators

Title: Fulcrum：在边缘加速器上优化并发的DNN训练和推理

Prashanthi S. K., Saisamarth Taluri, Pranav Gupta, Amartya Ranjan Saikia, Kunal Kumar Sahoo, Atharva Vinay Joshi, Lakshya Karwa, Kedar Dhule, Yogesh Simmhan

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

The proliferation of GPU accelerated edge devices like Nvidia Jetsons and the rise in privacy concerns are placing an emphasis on concurrent DNN training and inferencing on edge devices. Inference and training have different computing and QoS goals. But edge accelerators like Jetson do not support native GPU sharing and expose 1000s of power modes. This requires careful time-sharing of concurrent workloads to meet power--performance goals, while limiting costly profiling. In this paper, we design an intelligent time-slicing approach for concurrent DNN training and inferencing on Jetsons. We formulate an optimization problem to interleave training and inferencing minibatches, and decide the device power mode and inference minibatch size, while maximizing the training throughput and staying within latency and power budgets, with modest profiling costs. We propose GMD, an efficient multi-dimensional gradient descent search which profiles just $15$ power modes; and ALS, an Active Learning technique which identifies reusable Pareto-optimal power modes, but profiles $50$--$150$ power modes. We evaluate these within our Fulcrum scheduler for $273,000+$ configurations across $15$ DNN workloads. We also evaluate our strategies on dynamic arrival inference and concurrent inferences. ALS and GMD outperform simpler and more complex baselines with larger-scale profiling. Their solutions satisfy the latency and power budget for $>97\%$ of our runs, and on average are within $7\%$ of the optimal throughput.

GPU加速的边缘设备如Nvidia Jetsons的普及以及隐私问题的增加，使得在边缘设备上同时进行深度神经网络（DNN）训练和推理变得尤为重要。推理和训练有不同的计算和QoS目标。但边缘加速器如Jetson不支持原生GPU共享，并且暴露了数千种电源模式。这需要对并发工作负载进行仔细的时间共享，以满足功耗-性能目标，同时限制昂贵的分析。在本文中，我们设计了一种智能时间切片方法，用于在Jetsons上同时进行DNN训练和推理。我们将一个优化问题形式化，以交错训练和推理的小批量数据，并决定设备电源模式和推理小批量大小，同时最大化训练吞吐量，并在延迟和功耗预算内，仅需适度的分析成本。我们提出了GMD，一种高效的多维梯度下降搜索方法，仅分析$15$个电源模式；以及ALS，一种主动学习技术，可识别可重复使用的帕累托最优电源模式，但分析$50$-$150$个电源模式。我们在Fulcrum调度器中对$273,000+$种配置进行了评估，覆盖了$15$个DNN工作负载。我们还在动态到达推理和并发推理上评估了我们的策略。 ALS和GMD在更大规模分析的情况下优于更简单和更复杂的基线。它们的解决方案满足了我们运行中$>97\%$的延迟和功耗预算，并且平均处于最优吞吐量的$7\%$范围内。
[10] arXiv:2509.20223 [cn-pdf, pdf, other]: Title: An Empirical Analysis of Secure Federated Learning for Autonomous Vehicle Applications

Title: 一种用于自动驾驶车辆应用的安全联邦学习的实证分析

Md Jueal Mia, M. Hadi Amini

Comments: i3CE 2024, 2024 ASCE International Conference on Computing in Civil Engineering

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Federated Learning lends itself as a promising paradigm in enabling distributed learning for autonomous vehicles applications and ensuring data privacy while enhancing and refining predictive model performance through collaborative training on edge client vehicles. However, it remains vulnerable to various categories of cyber-attacks, necessitating more robust security measures to effectively mitigate potential threats. Poisoning attacks and inference attacks are commonly initiated within the federated learning environment to compromise secure system performance. Secure aggregation can limit the disclosure of sensitive information from outsider and insider attackers of the federated learning environment. In this study, our aim is to conduct an empirical analysis on the transportation image dataset (e.g., LISA traffic light) using various secure aggregation techniques and multiparty computation in the presence of diverse categories of cyber-attacks. Multiparty computation serves as a state-of-the-art security mechanism, offering standard privacy for secure aggregation of edge autonomous vehicles local model updates through various security protocols. The presence of adversaries can mislead the autonomous vehicle learning model, leading to the misclassification of traffic lights, and resulting in detrimental impacts. This empirical study explores the resilience of various secure federated learning aggregation techniques and multiparty computation in safeguarding autonomous vehicle applications against various cyber threats during both training and inference times.

联邦学习作为一种有前景的范式，适用于自动驾驶车辆应用的分布式学习，并在通过边缘客户端车辆协作训练来增强和优化预测模型性能的同时，确保数据隐私。然而，它仍然容易受到各种类型的网络攻击，需要更强大的安全措施来有效缓解潜在威胁。中毒攻击和推断攻击通常在联邦学习环境中发起，以破坏安全系统性能。安全聚合可以限制来自联邦学习环境外部和内部攻击者的敏感信息泄露。在本研究中，我们的目标是使用各种安全聚合技术以及在存在多种网络攻击类别的情况下进行多方计算，对交通图像数据集（例如，LISA交通灯）进行实证分析。多方计算作为一种最先进的安全机制，通过各种安全协议为边缘自动驾驶车辆本地模型更新的安全聚合提供标准隐私保护。对手的存在可能会误导自动驾驶车辆学习模型，导致交通灯的错误分类，并产生有害的影响。这项实证研究探讨了各种安全联邦学习聚合技术和多方计算在训练和推理期间保护自动驾驶车辆应用免受各种网络威胁的弹性。
[11] arXiv:2509.20340 [cn-pdf, pdf, other]: Title: xGFabric: Coupling Sensor Networks and HPC Facilities with Private 5G Wireless Networks for Real-Time Digital Agriculture

Title: xGFabric：通过私有5G无线网络将传感器网络和HPC设施相结合用于实时数字农业

Liubov Kurafeeva, Alan Subedi, Ryan Hartung, Michael Fay, Avhishek Biswas, Shantenu Jha, Ozgur O. Kilic, Chandra Krintz, Andre Merzky, Douglas Thain, Mehmet C. Vuran, Rich Wolski

Comments: 8 pages with 7 figures followed by 3 pages of reproducibility appendix. This paper will be published following the SC 2025 conference on November 16-21, 2025 at St Louis, MO, USA. ISBN: 978-8-4007-1871-7/2025/11

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Advanced scientific applications require coupling distributed sensor networks with centralized high-performance computing facilities. Citrus Under Protective Screening (CUPS) exemplifies this need in digital agriculture, where citrus research facilities are instrumented with numerous sensors monitoring environmental conditions and detecting protective screening damage. CUPS demands access to computational fluid dynamics codes for modeling environmental conditions and guiding real-time interventions like water application or robotic repairs. These computing domains have contrasting properties: sensor networks provide low-performance, limited-capacity, unreliable data access, while high-performance facilities offer enormous computing power through high-latency batch processing. Private 5G networks present novel capabilities addressing this challenge by providing low latency, high throughput, and reliability necessary for near-real-time coupling of edge sensor networks with HPC simulations. This work presents xGFabric, an end-to-end system coupling sensor networks with HPC facilities through Private 5G networks. The prototype connects remote sensors via 5G network slicing to HPC systems, enabling real-time digital agriculture simulation.

高级科学应用需要将分布式传感器网络与集中式高性能计算设施相结合。柑橘保护屏幕筛选（CUPS）在数字农业中体现了这一需求，其中柑橘研究设施配备了大量传感器，用于监测环境条件并检测保护屏幕损坏。 CUPS需要访问计算流体力学代码，以模拟环境条件并指导实时干预措施，如水分施用或机器人维修。这些计算领域具有不同的特性：传感器网络提供低性能、有限容量、不可靠的数据访问，而高性能设施则通过高延迟的批量处理提供巨大的计算能力。私有5G网络提供了新的能力，解决了这一挑战，通过提供低延迟、高吞吐量和可靠性，实现边缘传感器网络与HPC仿真的近实时耦合。本工作提出了xGFabric，一个通过私有5G网络将传感器网络与HPC设施连接的端到端系统。原型通过5G网络切片连接远程传感器到HPC系统，实现了实时数字农业仿真。

[12] arXiv:2509.19396 (cross-list from cs.LG) [cn-pdf, pdf, html, other]: Title: OmniFed: A Modular Framework for Configurable Federated Learning from Edge to HPC

Title: OmniFed：一种从边缘到HPC的可配置联邦学习模块化框架

Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang

Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR) ; Distributed, Parallel, and Cluster Computing (cs.DC)

Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.

联邦学习（FL）对于边缘计算和高性能计算（HPC）至关重要，因为在这些场景中数据并未集中，并且隐私非常重要。我们提出了OmniFed，这是一个模块化框架，围绕解耦和清晰的职责分离进行设计，适用于配置、编排、通信和训练逻辑。其架构支持基于配置的原型设计和代码级别的按需覆盖自定义。我们还支持不同的拓扑结构，在单个部署中使用混合通信协议以及流行的训练算法。它还提供可选的隐私机制，包括差分隐私（DP）、同态加密（HE）和安全聚合（SA），以及压缩策略。这些功能通过定义明确的扩展点进行暴露，允许用户自定义拓扑和编排、学习逻辑以及隐私/压缩插件，同时保持核心系统的完整性。我们评估了多种模型和算法以测量各种性能指标。通过在一个堆栈中统一拓扑配置、混合协议通信和可插拔模块，OmniFed简化了在异构环境中的FL部署。GitHub仓库地址为https://github.com/at-aaims/OmniFed。
[13] arXiv:2509.19486 (cross-list from cs.RO) [cn-pdf, pdf, html, other]: Title: Supercomputing for High-speed Avoidance and Reactive Planning in Robots

Title: 超级计算在机器人高速避障和反应规划中的应用

Kieran S. Lachmansingh, José R. González-Estrada, Ryan E. Grant, Matthew K. X. J. Pan

Comments: 8 pages, 3 figures

Subjects: Robotics (cs.RO) ; Distributed, Parallel, and Cluster Computing (cs.DC)

This paper presents SHARP (Supercomputing for High-speed Avoidance and Reactive Planning), a proof-of-concept study demonstrating how high-performance computing (HPC) can enable millisecond-scale responsiveness in robotic control. While modern robots face increasing demands for reactivity in human--robot shared workspaces, onboard processors are constrained by size, power, and cost. Offloading to HPC offers massive parallelism for trajectory planning, but its feasibility for real-time robotics remains uncertain due to network latency and jitter. We evaluate SHARP in a stress-test scenario where a 7-DOF manipulator must dodge high-speed foam projectiles. Using a parallelized multi-goal A* search implemented with MPI on both local and remote HPC clusters, the system achieves mean planning latencies of 22.9 ms (local) and 30.0 ms (remote, ~300 km away), with avoidance success rates of 84% and 88%, respectively. These results show that when round-trip latency remains within the tens-of-milliseconds regime, HPC-side computation is no longer the bottleneck, enabling avoidance well below human reaction times. The SHARP results motivate hybrid control architectures: low-level reflexes remain onboard for safety, while bursty, high-throughput planning tasks are offloaded to HPC for scalability. By reporting per-stage timing and success rates, this study provides a reproducible template for assessing real-time feasibility of HPC-driven robotics. Collectively, SHARP reframes HPC offloading as a viable pathway toward dependable, reactive robots in dynamic environments.

本文介绍了SHARP（超级计算用于高速避障和反应规划），这是一个概念验证研究，展示了高性能计算（HPC）如何使机器人控制在毫秒级响应。虽然现代机器人在人机共享工作空间中面临日益增长的反应需求，但机载处理器受到尺寸、功耗和成本的限制。将任务卸载到HPC可以为轨迹规划提供大规模并行性，但由于网络延迟和抖动，其在实时机器人中的可行性仍不确定。我们在一个压力测试场景中评估了SHARP，其中7自由度机械臂必须躲避高速泡沫弹。使用在本地和远程HPC集群上用MPI实现的并行多目标A*搜索，系统实现了平均规划延迟22.9 ms（本地）和30.0 ms（远程，约300公里远），避障成功率分别为84%和88%。这些结果表明，当往返延迟保持在几十毫秒范围内时，HPC端的计算不再是瓶颈，使得避障时间远低于人类反应时间。SHARP的结果激发了混合控制架构：低级反射仍保留在机载设备中以确保安全，而突发性、高吞吐量的规划任务则卸载到HPC以实现可扩展性。通过报告各阶段的时间和成功率，本研究提供了一个可重复的模板，用于评估HPC驱动机器人的实时可行性。总体而言，SHARP将HPC卸载重新定义为在动态环境中实现可靠、反应迅速的机器人的一种可行路径。
[14] arXiv:2509.20241 (cross-list from cs.LG) [cn-pdf, pdf, other]: Title: Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Title: AI推理中的能源使用：效率路径与测试时计算

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres

Comments: A preprint version with DOI is available at Zenodo: https://doi.org/10.5281/zenodo.17188770

Subjects: Machine Learning (cs.LG) ; Distributed, Parallel, and Cluster Computing (cs.DC)

As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.

随着人工智能推理扩展到数十亿次查询，新兴的推理和代理工作流程增加了对令牌的需求，因此对每次查询的能量使用进行可靠估算在容量规划、排放核算和效率优先方面变得越来越重要。许多公开的估算结果不一致，并高估了能量使用，因为它们从有限的基准中推断而来，并未反映出大规模实现的效率提升。在本文中，我们介绍了一种自下而上的方法，基于令牌吞吐量来估算大规模语言模型系统的每次查询能量。对于在现实工作负载和GPU利用率及PUE约束下运行的H100节点上的模型，我们估计前沿规模模型（超过2000亿参数）的每次查询中位数能量为0.34 Wh（四分位距：0.18-0.67）。这些结果与使用生产规模配置的测量结果一致，表明非生产估算和假设可能将能量使用高估4至20倍。扩展到测试时缩放场景，每次典型查询的令牌数量增加15倍，中位数能量上升13倍至4.32 Wh，这表明在此模式下专注于效率将带来最大的车队范围内的节省。我们量化了在模型、服务平台和硬件层面可实现的效率增益，发现每次查询的能量减少了1.5至3.5倍，而综合进展可能合理地带来8至20倍的减少。为了说明系统级影响，我们估计部署每天处理10亿次查询的基础能耗为0.8 GWh/天。如果10%的查询是长查询，需求可能增长到1.8 GWh/天。通过有针对性的效率干预措施，它将降至0.9 GWh/天，与该规模下的网络搜索能耗足迹相似。这反映了数据中心历史上在互联网和云建设期间通过效率提升来缓解能耗增长的情况。

[15] arXiv:2504.19516 (replaced) [cn-pdf, pdf, html, other]: Title: Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration

Title: 项目：通过动态时空编排提升LLM服务的GPU利用率

Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu, Xianwei Zhang

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Modern LLM serving systems confront inefficient GPU utilization due to the fundamental mismatch between compute-intensive prefill and memory-bound decode phases. While current practices attempt to address this by organizing these phases into hybrid batches, such solutions create an inefficient tradeoff that sacrifices either throughput or latency, leaving substantial GPU resources underutilized. We identify two key root causes: 1) the prefill phase suffers from suboptimal compute utilization due to wave quantization and attention bottlenecks. 2) hybrid batches disproportionately prioritize latency over throughput, resulting in wasted compute and memory bandwidth. To mitigate the issues, we present Bullet, a novel spatial-temporal orchestration system that eliminates these inefficiencies through precise phase coordination. Bullet enables concurrent execution of prefill and decode phases, while dynamically provisioning GPU resources using real-time performance modeling. By integrating SLO-aware scheduling and adaptive resource allocation, Bullet maximizes utilization without compromising latency targets. Experimental evaluations on real-world workloads demonstrate that Bullet delivers 1.26x average throughput gains (up to 1.55x) over state-of-the-arts, while consistently meeting latency constraints.

现代大型语言模型服务系统由于计算密集型的预填充阶段和内存受限的解码阶段之间的根本性不匹配，面临GPU利用率低的问题。尽管当前的做法尝试通过将这些阶段组织成混合批次来解决这个问题，但此类解决方案造成了效率低下的权衡，牺牲了吞吐量或延迟，导致大量GPU资源未被充分利用。我们确定了两个关键的根本原因：1）由于波形量化和注意力瓶颈，预填充阶段的计算利用率不佳。 2）混合批次过度优先考虑延迟而非吞吐量，导致计算和内存带宽的浪费。为缓解这些问题，我们提出了Bullet，一种新颖的空间-时间协调系统，通过精确的阶段协调消除这些低效问题。 Bullet实现了预填充和解码阶段的并发执行，同时使用实时性能建模动态分配GPU资源。通过集成SLO感知调度和自适应资源分配，Bullet在不损害延迟目标的前提下最大化利用率。在真实工作负载上的实验评估表明，Bullet在吞吐量方面平均提升了1.26倍（最高达1.55倍），同时始终满足延迟约束。
[16] arXiv:2508.13523 (replaced) [cn-pdf, pdf, html, other]: Title: LAMMPS-KOKKOS: Performance Portable Molecular Dynamics Across Exascale Architectures

Title: LAMMPS-KOKKOS：跨百亿亿次计算架构的性能可移植分子动力学

Anders Johansson, Evan Weinberg, Christian R. Trott, Megan J. McCarthy, Stan G. Moore

Comments: 16 pages, 7 figures

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Performance (cs.PF) ; Computational Physics (physics.comp-ph)

Since its inception in 1995, LAMMPS has grown to be a world-class molecular dynamics code, with thousands of users, over one million lines of code, and multi-scale simulation capabilities. We discuss how LAMMPS has adapted to the modern heterogeneous computing landscape by integrating the Kokkos performance portability library into the existing C++ code. We investigate performance portability of simple pairwise, many-body reactive, and machine-learned force-field interatomic potentials. We present results on GPUs across different vendors and generations, and analyze performance trends, probing FLOPS throughput, memory bandwidths, cache capabilities, and thread-atomic operation performance. Finally, we demonstrate strong scaling on three exascale machines -- OLCF Frontier, ALCF Aurora, and NNSA El Capitan -- as well as on the CSCS Alps supercomputer, for the three potentials.

自1995年诞生以来，LAMMPS已发展成为世界级的分子动力学代码，拥有数千名用户，超过一百万行代码，并具备多尺度模拟能力。我们讨论了LAMMPS如何通过将Kokkos性能可移植库集成到现有的C++代码中，以适应现代异构计算环境。我们研究了简单成对、多体反应和机器学习力场原子间势的性能可移植性。我们在不同厂商和代际的GPU上展示了结果，并分析了性能趋势，探究了FLOPS吞吐量、内存带宽、缓存能力和线程原子操作性能。最后，我们展示了在三台exascale机器——OLCF Frontier、ALCF Aurora和NNSA El Capitan——以及CSCS Alps超级计算机上，三种势函数的强可扩展性。
[17] arXiv:2505.12144 (replaced) [cn-pdf, pdf, html, other]: Title: Proof-of-Social-Capital: A Consensus Protocol Replacing Stake for Social Capital

Title: 社会资本证明：一种用社会资本替代权益的共识协议

Juraj Mariani, Ivan Homoliak

Subjects: Cryptography and Security (cs.CR) ; Distributed, Parallel, and Cluster Computing (cs.DC)

Consensus protocols used today in blockchains mostly rely on scarce resources such as computational power or financial stake, favoring wealthy individuals due to a high entry barrier. We propose Proof-of-Social-Capital (PoSC), a new consensus protocol fueled by social capital as a staking resource to ensure fairness and decentralization. Consensus nodes in our system do not require financial or computational resources that are expensive to acquire; instead, they require preexisting social media influence, distributing consensus power not according to wealth but social capital. Our approach integrates zkSNARK proofs, verifiable credentials with a uniqueness-enforcing mechanism to prevent Sybil attacks, and the incentive scheme that rewards engagement with social media content by followers. This work offers a new concept aligned with modern social media lifestyle applied in finance, providing a practical insight for the evolution of decentralized consensus protocols.

当今区块链中使用的共识协议大多依赖于计算能力或金融权益等稀缺资源，由于进入门槛高，这有利于富裕的个人。我们提出了基于社会资本的证明（PoSC），这是一种新的共识协议，利用社会资本作为质押资源，以确保公平性和去中心化。我们系统中的共识节点不需要昂贵的财务或计算资源；相反，它们需要已有的社交媒体影响力，使共识权力的分配不根据财富，而是根据社会资本。我们的方法结合了zkSNARK证明、具有唯一性强制机制的可验证凭证，以防止Sybil攻击，并且有一个激励机制，奖励追随者对社交媒体内容的参与。这项工作提供了一个与现代社交媒体生活方式相一致的新概念，为去中心化共识协议的演进提供了实际见解。
[18] arXiv:2506.23906 (replaced) [cn-pdf, pdf, html, other]: Title: Segmented Operations using Matrix Multiplications

Title: 使用矩阵乘法的分段操作

Aleksandros Sobczyk, Giuseppe Sorrentino, Anastasios Zouzias

Subjects: Data Structures and Algorithms (cs.DS) ; Computational Complexity (cs.CC) ; Distributed, Parallel, and Cluster Computing (cs.DC)

Specialized computational units that perform small matrix multiplications as primitive operations are typically present in modern AI accelerators. However, these Matrix Multiplication Units (MMUs) are often underutilized for many fundamental deep learning operations besides dense matrix multiplications. Coincidentally, the lack of a rigorous theoretical model of computation for such architectures obstructs algorithmic design. In this work, we propose MMV-RAM, a computational model which judiciously extends the Vector-RAM model with an additional MMU. We provide a detailed theoretical analysis and carefully balance the computational power between the matrix and vector units, guided by the circuit complexity lower bound that parity is not in AC{[0]}. Given MMV-RAM, we proceed to algorithm design, starting with two fundamental parallel operations: segmented scan and sum. By expressing them as compositions of elementary parallel primitives (e.g., seg. sum reduces to: scan, compress, and vector differentiation), we can exploit MMUs to perform speculative blocked computations, ultimately leading to provable theoretical speed-ups against vector-only approaches. These results extend to other ubiquitous AI kernels, including dense matrix product, and sparse matrix-vector product. As a case study, we implemented the proposed algorithms on the Ascend 910B AI accelerator, which contains matrix and vector cores. We evaluate these implementations on synthetic and real-world datasets from various applications, including Large Language Models.

专门执行小矩阵乘法作为原始操作的计算单元通常存在于现代AI加速器中。然而，这些矩阵乘法单元（MMUs）在许多基本的深度学习操作中除了密集矩阵乘法之外常常被低估使用。巧合的是，这种架构缺乏严格的理论计算模型阻碍了算法设计。在本工作中，我们提出了MMV-RAM，这是一种计算模型，它明智地扩展了向量-RAM模型，增加了一个MMU。我们提供了详细的理论分析，并在矩阵和向量单元之间仔细平衡计算能力，这是由电路复杂性下界决定的，即奇偶性不在AC{[0]}中。在MMV-RAM的基础上，我们继续进行算法设计，从两个基本的并行操作开始：分段扫描和求和。通过将它们表示为基本并行原语的组合（例如，分段求和可以简化为：扫描、压缩和向量微分），我们可以利用MMU执行推测性分块计算，最终导致相对于仅向量方法的可证明理论加速。这些结果也适用于其他常见的AI内核，包括密集矩阵乘积和稀疏矩阵-向量乘积。作为一个案例研究，我们在Ascend 910B AI加速器上实现了所提出的算法，该加速器包含矩阵和向量核心。我们在来自各种应用的合成和真实世界数据集上评估了这些实现，包括大型语言模型。

Total of 18 entries

Showing up to 2000 entries per page: fewer | more | all

Distributed, Parallel, and Cluster Computing

Showing new listings for Thursday, 25 September 2025

New submissions (showing 11 of 11 entries )

Cross submissions (showing 3 of 3 entries )

Replacement submissions (showing 4 of 4 entries )