SD-Acc: Accelerating Stable Diffusion through Phase-aware Sampling and Hardware Co-Optimizations

Wang, Zhican; He, Guanghui; Fan, Hongxiang

计算机科学 > 硬件架构

arXiv:2507.01309 (cs)

[提交于 2025年7月2日 ]

标题： SD-Acc：通过相位感知采样和硬件协同优化加速稳定扩散

标题： SD-Acc: Accelerating Stable Diffusion through Phase-aware Sampling and Hardware Co-Optimizations

Authors:Zhican Wang, Guanghui He, Hongxiang Fan

摘要：扩散模型的出现显著推动了生成式AI的发展，提高了图像和视频生成的质量、真实感和创造力。其中，Stable Diffusion（StableDiff）作为文本到图像生成的关键模型，是下一代多模态算法的基础。然而，其高计算和内存需求阻碍了推理速度和能效。为解决这些挑战，我们识别出三个核心问题：（1）密集且常常冗余的计算，（2）涉及卷积和注意力机制的异构操作，（3）多样化的权重和激活尺寸。我们提出了SD-Acc，一种新颖的算法与硬件协同优化框架。在算法层面，我们观察到某些去噪阶段的高层特征表现出显著相似性，从而可以进行近似计算。利用这一点，我们提出了一种自适应、阶段感知的采样策略，以减少计算和内存负载。该框架根据StableDiff模型和用户需求自动平衡图像质量和复杂度。在硬件层面，我们设计了一种以地址为中心的数据流，以高效处理简单脉动阵列内的异构操作。我们通过两级流架构和可重构向量处理单元解决了非线性函数的瓶颈问题。此外，我们通过结合针对StableDiff工作负载的动态重用和算子融合，实现了自适应数据流优化，显著减少了内存访问。在多个StableDiff模型中，我们的方法在不牺牲图像质量的情况下，计算需求最多减少了3倍。结合我们优化的硬件加速器，SD-Acc比传统的CPU和GPU实现具有更高的速度和能效。

摘要： The emergence of diffusion models has significantly advanced generative AI, improving the quality, realism, and creativity of image and video generation. Among them, Stable Diffusion (StableDiff) stands out as a key model for text-to-image generation and a foundation for next-generation multi-modal algorithms. However, its high computational and memory demands hinder inference speed and energy efficiency. To address these challenges, we identify three core issues: (1) intensive and often redundant computations, (2) heterogeneous operations involving convolutions and attention mechanisms, and (3) diverse weight and activation sizes. We present SD-Acc, a novel algorithm and hardware co-optimization framework. At the algorithm level, we observe that high-level features in certain denoising phases show significant similarity, enabling approximate computation. Leveraging this, we propose an adaptive, phase-aware sampling strategy that reduces compute and memory loads. This framework automatically balances image quality and complexity based on the StableDiff model and user requirements. At the hardware level, we design an address-centric dataflow to efficiently handle heterogeneous operations within a simple systolic array. We address the bottleneck of nonlinear functions via a two-stage streaming architecture and a reconfigurable vector processing unit. Additionally, we implement adaptive dataflow optimizations by combining dynamic reuse and operator fusion tailored to StableDiff workloads, significantly reducing memory access. Across multiple StableDiff models, our method achieves up to a 3x reduction in computational demand without compromising image quality. Combined with our optimized hardware accelerator, SD-Acc delivers higher speed and energy efficiency than traditional CPU and GPU implementations.

评论：	正在审核中
主题：	硬件架构 (cs.AR)
引用方式：	arXiv:2507.01309 [cs.AR]
	(或者 arXiv:2507.01309v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.01309

提交历史

来自： Zhican Wang [查看电子邮件]
[v1] 星期三， 2025 年 7 月 2 日 02:53:43 UTC (2,037 KB)

计算机科学 > 硬件架构

标题： SD-Acc：通过相位感知采样和硬件协同优化加速稳定扩散

标题： SD-Acc: Accelerating Stable Diffusion through Phase-aware Sampling and Hardware Co-Optimizations

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： SD-Acc：通过相位感知采样和硬件协同优化加速稳定扩散 显示英文标题

标题： SD-Acc: Accelerating Stable Diffusion through Phase-aware Sampling and Hardware Co-Optimizations

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： SD-Acc：通过相位感知采样和硬件协同优化加速稳定扩散