Accelerating Transposed Convolutions on FPGA-based Edge Devices

Haris, Jude; Cano, José

计算机科学 > 硬件架构

arXiv:2507.07683v1 (cs)

[提交于 2025年7月10日 ]

标题：基于FPGA的边缘设备上的转置卷积加速

标题： Accelerating Transposed Convolutions on FPGA-based Edge Devices

Authors:Jude Haris, José Cano

摘要：转置卷积（TCONV）在生成式人工智能（AI）模型中实现了上采样机制。然而，用于实现TCONV的主要输入导向映射（IOM）方法具有复杂的输出映射、重叠求和和低效计算。这些低效率进一步加剧了TCONV和生成式模型在资源受限的边缘设备上的性能瓶颈。为了解决这个问题，本文我们提出了MM2IM，一种软硬件协同设计的加速器，将矩阵乘法（MatMul）与col2IM结合，以在资源受限的边缘设备上高效处理TCONV层。使用SECDA-TFLite设计工具包，我们实现了MM2IM，并在261个TCONV问题配置上评估了其性能，相对于双线程ARM Neon优化的CPU基线平均加速了1.9倍。然后，我们在一系列来自知名生成式模型的TCONV层上评估了MM2IM的性能，最高加速比达到4.2倍，并将其与类似的资源受限的TCONV加速器进行比较，至少高出2x GOPs/DSP。最后，我们在DCGAN和pix2pix GAN模型上评估了MM2IM，相对于CPU基线最高加速比达到3倍，能耗降低2.4倍。

摘要： Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.

评论：	被第35届国际现场可编程逻辑与应用会议（FPL）2025接收
主题：	硬件架构 (cs.AR) ; 分布式、并行与集群计算 (cs.DC); 机器学习 (cs.LG)
引用方式：	arXiv:2507.07683 [cs.AR]
	(或者 arXiv:2507.07683v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.07683

提交历史

来自： Jude Haris Dr [查看电子邮件]
[v1] 星期四， 2025 年 7 月 10 日 12:05:33 UTC (1,474 KB)

计算机科学 > 硬件架构

标题：基于FPGA的边缘设备上的转置卷积加速

标题： Accelerating Transposed Convolutions on FPGA-based Edge Devices

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： 基于FPGA的边缘设备上的转置卷积加速 显示英文标题

标题： Accelerating Transposed Convolutions on FPGA-based Edge Devices

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于FPGA的边缘设备上的转置卷积加速