Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Wei, Xinming; Zhang, Jiahao; Li, Haoran; Chen, Jiayu; Qu, Rui; Li, Maoliang; Chen, Xiang; Luo, Guojie

计算机科学 > 分布式、并行与集群计算

arXiv:2506.24045 (cs)

[提交于 2025年6月30日 ]

标题： Agent.xpu：异构 SoC 上代理 LLM 工作负载的高效调度

标题： Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Authors:Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo

摘要：代理大型语言模型（LLMs）在个人设备上的普及引入了一类新的工作负载，其特点是目标的二元性。反应性任务由用户发起，需要立即、低延迟的响应，而主动性任务则在后台运行，并优先考虑吞吐量。现有的设备端LLM引擎，专为独立推理设计，在消费级异构SoC（具有CPU、集成GPU和NPU）上无法高效管理这些同时发生且相互冲突的请求。本文介绍了 Agent.xpu，一种用于内存统一的异构SoC上的代理LLM工作负载的高效服务系统。通过专用的离线分析，Agent.xpu首先构建一个异构执行图，该图融合并分块模型内核，以亲和性引导的方式进行弹性加速器映射，并带有预测性内核注释。在运行时，其在线调度器实现了细粒度的内核级抢占，以保证反应性任务的响应性。为了最大化SoC的利用率，它采用空闲感知的内核填充来机会性地附加主动性任务，并通过带宽感知的分发减轻NPU-iGPU的竞争。在Intel Core Ultra SoC上的评估显示，与最先进的推理引擎相比，Agent.xpu在反应性任务上的延迟降低了 4.6$\times$，并在主动性任务上保持了 1.6$\times$-6.8$\times$的更高吞吐量。

摘要： The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces Agent.xpu, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, Agent.xpu first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that Agent.xpu achieves 4.6$\times$ lower latency for reactive tasks and sustains 1.6$\times$-6.8$\times$ higher throughput for proactive tasks compared to state-of-the-art inference engines.

主题：	分布式、并行与集群计算 (cs.DC) ; 机器学习 (cs.LG)
引用方式：	arXiv:2506.24045 [cs.DC]
	(或者 arXiv:2506.24045v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.24045

提交历史

来自： Xinming Wei [查看电子邮件]
[v1] 星期一， 2025 年 6 月 30 日 16:50:48 UTC (663 KB)

计算机科学 > 分布式、并行与集群计算

标题： Agent.xpu：异构 SoC 上代理 LLM 工作负载的高效调度

标题： Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： Agent.xpu：异构 SoC 上代理 LLM 工作负载的高效调度 显示英文标题

标题： Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： Agent.xpu：异构 SoC 上代理 LLM 工作负载的高效调度