JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Byung-Ki, Kwon; Dai, Qi; Hyoseok, Lee; Luo, Chong; Oh, Tae-Hyun

计算机科学 > 计算机视觉与模式识别

arXiv:2505.00482 (cs)

[提交于 2025年5月1日 (v1) ，最后修订 2025年6月26日 (此版本， v2)]

标题：联合DiT：使用扩散变压器增强RGB-深度联合建模

标题： JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Authors:Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh

摘要：我们提出JointDiT，一种建模RGB和深度联合分布的扩散变压器。通过利用最先进的扩散变压器的架构优势和出色的图像先验，JointDiT不仅生成高保真图像，还生成几何上合理且准确的深度图。这种坚实的联合分布建模是通过我们提出的两种简单而有效的方法实现的，即自适应调度权重，这取决于每个模态的噪声水平，以及不平衡的时间步采样策略。通过这些技术，我们在每个模态的所有噪声水平上训练我们的模型，使JointDiT能够通过简单控制每个分支的时间步长，自然处理各种组合生成任务，包括联合生成、深度估计和深度条件图像生成。 JointDiT表现出卓越的联合生成性能。此外，它在深度估计和深度条件图像生成方面达到了可比的结果，表明联合分布建模可以作为条件生成的可替换替代方案。项目页面可在https://byungki-k.github.io/JointDiT/ 获取。

摘要： We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.

评论：	被IEEE/CVF国际计算机视觉会议（ICCV）2025接收。项目页面：https://byungki-k.github.io/JointDiT/ 代码：https://github.com/ByungKi-K/JointDiT-code
主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI)
引用方式：	arXiv:2505.00482 [cs.CV]
	(或者 arXiv:2505.00482v2 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.00482

提交历史

来自： Byung-Ki Kwon [查看电子邮件]
[v1] 星期四， 2025 年 5 月 1 日 12:21:23 UTC (27,033 KB)
[v2] 星期四， 2025 年 6 月 26 日 06:21:40 UTC (19,905 KB)

计算机科学 > 计算机视觉与模式识别

标题：联合DiT：使用扩散变压器增强RGB-深度联合建模

标题： JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 联合DiT：使用扩散变压器增强RGB-深度联合建模 显示英文标题

标题： JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：联合DiT：使用扩散变压器增强RGB-深度联合建模