VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

Wang, Zhican; Fan, Hongxiang; Waris, Haroon; Wang, Gang; Li, Zhenyu; Jiang, Jianfei; Sun, Yanan; He, Guanghui

计算机科学 > 硬件架构

arXiv:2507.00797 (cs)

[提交于 2025年7月1日 ]

标题： VEDA：通过基于投票的KV缓存驱逐和数据流灵活加速器实现高效的LLM生成

标题： VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

Authors:Zhican Wang, Hongxiang Fan, Haroon Waris, Gang Wang, Zhenyu Li, Jianfei Jiang, Yanan Sun, Guanghui He

摘要：大型语言模型（LLMs）在自然语言处理任务中表现出色，但由于其对资源的密集需求，在边缘部署中面临着显著的计算和内存挑战。本工作通过算法-硬件-数据流三重优化来提高LLM推理的效率。我们提出了一种基于投票的KV缓存淘汰算法，通过自适应识别不重要的kv向量来平衡硬件效率和算法准确性。从数据流的角度来看，我们引入了一种灵活的产品数据流和一种运行时可重构的PE阵列用于矩阵-向量乘法。所提出的方法有效处理了不同的维度需求，并解决了序列长度逐步变化的挑战。此外，还提出了一种元素串行调度方案，用于非线性操作，如softmax和层归一化（layernorm）。结果表明，延迟显著减少，同时硬件复杂度从O(N)降低到O(1)。所提出的解决方案实现在一个定制设计的加速器VEDA中，其性能优于现有的硬件平台。这项研究在资源受限的边缘设备上的LLM推理方面取得了重大进展，促进了实时处理，增强了数据隐私，并实现了模型定制。

摘要： Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM inference by algorithm-hardware-dataflow tri-optimizations. We propose a novel voting-based KV cache eviction algorithm, balancing hardware efficiency and algorithm accuracy by adaptively identifying unimportant kv vectors. From a dataflow perspective, we introduce a flexible-product dataflow and a runtime reconfigurable PE array for matrix-vector multiplication. The proposed approach effectively handles the diverse dimensional requirements and solves the challenges of incrementally varying sequence lengths. Additionally, an element-serial scheduling scheme is proposed for nonlinear operations, such as softmax and layer normalization (layernorm). Results demonstrate a substantial reduction in latency, accompanied by a significant decrease in hardware complexity, from O(N) to O(1). The proposed solution is realized in a custom-designed accelerator, VEDA, which outperforms existing hardware platforms. This research represents a significant advancement in LLM inference on resource-constrained edge devices, facilitating real-time processing, enhancing data privacy, and enabling model customization.

评论：	DAC 2025
主题：	硬件架构 (cs.AR)
引用方式：	arXiv:2507.00797 [cs.AR]
	(或者 arXiv:2507.00797v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00797

提交历史

来自： Zhican Wang [查看电子邮件]
[v1] 星期二， 2025 年 7 月 1 日 14:30:31 UTC (885 KB)

计算机科学 > 硬件架构

标题： VEDA：通过基于投票的KV缓存驱逐和数据流灵活加速器实现高效的LLM生成

标题： VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： VEDA：通过基于投票的KV缓存驱逐和数据流灵活加速器实现高效的LLM生成 显示英文标题

标题： VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： VEDA：通过基于投票的KV缓存驱逐和数据流灵活加速器实现高效的LLM生成