Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

Davies, Travis; Huang, Yiqi; Liu, Yunxin; Chen, Xiang; Liu, Huxian; Hu, Luhui

Computer Science > Robotics

arXiv:2509.11865 (cs)

[Submitted on 15 Sep 2025 ]

Title: Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

Title: 天马：具有扩散变压器的鲁棒跨身体机器人操作

Authors:Travis Davies, Yiqi Huang, Yunxin Liu, Xiang Chen, Huxian Liu, Luhui Hu

Abstract: Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability and performance for diffusion-transformer policies trained on heterogeneous, multimodal robot data, and introduce Tenma, a lightweight diffusion-transformer for bi-manual arm control. Tenma integrates multiview RGB, proprioception, and language via a cross-embodiment normalizer that maps disparate state/action spaces into a shared latent space; a Joint State-Time encoder for temporally aligned observation learning with inference speed boosts; and a diffusion action decoder optimized for training stability and learning capacity. Across benchmarks and under matched compute, Tenma achieves an average success rate of 88.95% in-distribution and maintains strong performance under object and scene shifts, substantially exceeding baseline policies whose best in-distribution average is 18.12%. Despite using moderate data scale, Tenma delivers robust manipulation and generalization, indicating the great potential for multimodal and cross-embodiment learning strategies for further augmenting the capacity of transformer-based imitation learning policies.

Abstract: 扩展Transformer策略和扩散模型已推动了机器人操作的发展，但在轻量级、跨实体学习环境中结合这些技术仍然具有挑战性。我们研究了对在异构、多模态机器人数据上训练的扩散-Transformer策略的稳定性和性能影响最大的设计选择，并引入了Tenma，这是一种用于双臂控制的轻量级扩散-Transformer。 Tenma通过跨实体归一化器整合多视角RGB、本体感觉和语言，将不同的状态/动作空间映射到共享潜在空间；一个联合状态-时间编码器，用于时间对齐的观察学习并提升推理速度；以及一个针对训练稳定性和学习能力优化的扩散动作解码器。在基准测试中，在计算量匹配的情况下，Tenma在分布内的平均成功率为88.95%，并在物体和场景变化下保持了强大的性能，显著超过了基线策略，其最佳分布内平均值为18.12%。尽管使用了中等规模的数据，Tenma实现了稳健的操作和泛化能力，表明多模态和跨实体学习策略在进一步增强基于Transformer的模仿学习策略能力方面具有巨大潜力。

Comments:	8 pages, 4 figures
Subjects:	Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Cite as:	arXiv:2509.11865 [cs.RO]
	(or arXiv:2509.11865v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2509.11865

Submission history

From: Luhui Hu [view email]
[v1] Mon, 15 Sep 2025 12:39:15 UTC (4,432 KB)

Computer Science > Robotics

Title: Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

Title: 天马：具有扩散变压器的鲁棒跨身体机器人操作

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title: Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer Show Chinese title

Title: 天马：具有扩散变压器的鲁棒跨身体机器人操作

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer