Bi-Manual Joint Camera Calibration and Scene Representation

Haozhan Tang1    Tianyi Zhang1    Matthew Johnson-Roberson1,2    Weiming Zhi1 Project page: https://tomtang502.github.io/bijcr_web/.1 Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.2 College of Connected Computing, Vanderbilt University, USA.email: wzhi@andrew.cmu.edu.
Abstract

Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.

I INTRODUCTION

Robot manipulators with wrist-mounted cameras generally need to be meticulously calibrated offline to enable perceived objects to be transformed into the robot’s coordinate frame. This is done via a procedure known as hand-eye calibration, where the manipulator is moved through a set of poses and take images of a known calibration marker, such as a checker board or AprilTag [1]. Traditional hand-eye calibration methods focus on a single “eye-in-hand” camera and rely on external markers to compute the rigid transform between camera and end-effector. When extended to two independently moving arms, these approaches must be repeated separately for each arm, and then a secondary step is required to fuse the two coordinate frames. In this work, we tackle the problem of calibrating of dual manipulators with wrist-mounted cameras without using any calibration markers. Here, we assume that the poses of the cameras relative to the end-effectors, along with the relative poses of the manipulator bases are unknown and require estimation.

Refer to caption
Figure 1: We tackle a bi-manual setup, where the extrinsics of both cameras and the relative poses of the robot bases to one another are unknown. Bi-JCR solves to recover all three transformations.

Here, we propose a framework called Bi-manual Joint Representation and Calibration (Bi-JCR) that simultaneously builds a representation of the environment and calibrates both cameras for dual-manipulators with wrist-mounted cameras. Bi-JCR uses the same set of images for both calibrating the camera and constructing the environment representation. It completely avoids the need to calibrate offline with markers. Bi-JCR leverages modern 3D foundation models to efficiently estimate an unscaled representation along with unscaled camera poses, from a set of images captured by the dual manipulators. Then, by considering the forward kinematics of each arm, we formulate a joint scale recovery and dual calibration problem which can subsequently be solved via gradient descent on a manifold of transformation matrices. By optimizing across a single calibration problem defined using images from both arms, Bi-JCR simultaneously solves for each hand-eye transform, aligns the two robot base frames, recovers a missing scale factor, and directly yields the rigid transform between the two manipulators. This enables immediate fusion of visual data across both viewpoints for bimanual manipulation, without reliance on external markers or depth sensors.

We empirically evaluate Bi-JCR and demonstrate its ability to accurately calibrate cameras on both manipulators, and produce a dense and size-accurate representation of the environment that can be transformed into the workspace coordinate frame. We leverage the representation into downstream manipulation and to execute successful grasps and bi-manual hand-overs. Concretely, our contributions include:

  • The Bi-manual Joint Calibration and Representation (Bi-JCR) method that leverages 3D foundation models to build an environment representation built from wrist-mounted cameras, while calibrating the cameras;

  • The formulation of a novel optimization problem that recovers camera transformations on both manipulators, the relative pose between the manipulators, and a scale factor to obtain metric scale from representation;

  • Rigorous evaluation on real-world data, and evaluation of performance ablating over many of the state-of-the-art 3D foundation models integrated into Bi-JCR.

II Related Work

Refer to caption
Figure 2: The transformations between each arm, along with their cameras, are shown here. The transformations TC1E1subscriptsuperscript𝑇subscript𝐸1subscript𝐶1T^{E_{1}}_{C_{1}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, TC2E2subscriptsuperscript𝑇subscript𝐸2subscript𝐶2T^{E_{2}}_{C_{2}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Pb2b1subscriptsuperscript𝑃subscript𝑏1subscript𝑏2P^{b_{1}}_{b_{2}}italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are all unknown and will be recovered via Bi-JCR.

Hand–Eye Calibration: Decades-old closed-form solvers [2, 3, 4] remain the de facto standard because they are fast and provably correct under marker-based conditions without noise. Yet in modern labs, the very assumptions they rely on, static checkerboard, are routinely violated. Recent learning-based variants regress the transform directly from images [5], but demand gripper visibility or prior CAD models and often degrade sharply outside the synthetic domain in which they are trained. Other end-to-end policies learning-based methods bypass calibration entirely, mapping pixels to torques [6, 7], but at the cost of losing an explicit transform that downstream planners and safety monitors still persist. Foundation models for calibration are also explored in [8, 9], but have not been extended to the bi-manual setting.

Bi-manual Calibration: Extending single manipulator hand-eye calibration to the bi-manual setup is not trivial as it requires finding the pose of the secondary manipulator in the primary manipulator’s frame. Previously, some methods relied on the secondary manipulator holding a checker board to perform bi-manual calibration [10, 11]. A graph-based method that uses external markers to calibrate multiple manipulators simultaneously has also been explored in [12].

Scene Representation: Bimanual manipulation requires reasoning about shared workspaces where two end-effectors and several movable objects compete for space. Classical metric maps such as occupation grids [13] and signed distance fields [14, 15] give fast binary or distance queries for collision checking, yet they discretize space and struggle to capture fine contact geometry in small parts. Continuous alternatives such as Gaussian process maps [16], kernel regressors [17, 18], and neural implicit surfaces [19], offer subvoxel accuracy. Learning methods that directly consume point clouds [20, 21] or integrate them into trajectory optimization [22]. In the computer vision community, proposed photorealistic NeRF models [23, 24, 25] produce photorealistic models, at the expense of accurate geometry.

Foundation Models: Large-scale models trained in web corpora, LLMs in NLP [26] and multimodal encoders in vision [27] have recently been explored for 3D perception [28, 29, 30, 31]. In robotics, these large pre-trained deep learning models are referred to as “foundation models”, gaining increasing applications when used as plug-and-play modules for downstream tasks [32, 33]. In our work, we leverage these 3D foundation models as components in our pipeline.

Refer to caption Refer to caption

Figure 3: Left: We have cameras (outlined in red) rigidly attached to dual manipulators; Right: With the camera calibrated, we can bring a 3D reconstruction directly into the frame of the robot, and even inject it into a simulator (visualised in PyBullet [34]).

III Bi-manual Joint Representation and Calibration

The proposed Bi-manual Joint Representation and Calibration (Bi-JCR) framework aims to solve the eye-to-hand calibration problem for both manipulators, and in the process, also recover the relative poses of the robot base. At the same time, we can recover a dense 3D representation of the tabletop scene, which can facilitate downstream manipulation. This process is done without relying on any camera pose information from external markers, such as checkerboards or AprilTags [1].

III-A Problem Setup:

We consider two manipulators, with one designated as the primary manipulator and the other as the secondary, each equipped with an end-effector mounted low-cost RGB camera. A set of objects is arranged on a tabletop within their shared workspace. The rigid transformations from each camera to its corresponding end–effector are unknown and must be estimated. To collect data, we command each manipulator through a sequence of N𝑁Nitalic_N distinct end–effector poses, capturing an RGB image at each pose. We denote the primary manipulator’s poses by {E1,1,E1,2,,E1,N}subscript𝐸11subscript𝐸12subscript𝐸1𝑁\{E_{1,1},E_{1,2},\cdots,E_{1,N}\}{ italic_E start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , ⋯ , italic_E start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT } with N𝑁Nitalic_N corresponding RGB images {I1,1,I1,2,,I1,N}subscript𝐼11subscript𝐼12subscript𝐼1𝑁\{I_{1,1},I_{1,2},\cdots,I_{1,N}\}{ italic_I start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , ⋯ , italic_I start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT }. We denote the N𝑁Nitalic_N poses for the secondary manipulator as {E2,1,E2,2,,E2,N}subscript𝐸21subscript𝐸22subscript𝐸2𝑁\{E_{2,1},E_{2,2},\cdots,E_{2,N}\}{ italic_E start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT , ⋯ , italic_E start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT }, with corresponding RGB images {I2,1,I2,2,,I2,N}subscript𝐼21subscript𝐼22subscript𝐼2𝑁\{I_{2,1},I_{2,2},\cdots,I_{2,N}\}{ italic_I start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT , ⋯ , italic_I start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT }.

Using only these end–effector poses and captured images, Bi‐JCR will recover all of the following: The rigid transformation TC1E1subscriptsuperscript𝑇subscript𝐸1subscript𝐶1T^{E_{1}}_{C_{1}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT from Camera 1 to the primary end–effector; The rigid transformation TC2E2subscriptsuperscript𝑇subscript𝐸2subscript𝐶2T^{E_{2}}_{C_{2}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT from Camera 2 to the secondary end–effector; The scale factor λ𝜆\lambdaitalic_λ aligning the foundation model’s output frame with the real‐world metric frame; The pose of the secondary base frame b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT relative to the primary base frame b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, denoted Pb2b1subscriptsuperscript𝑃subscript𝑏1subscript𝑏2P^{b_{1}}_{b_{2}}italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT; The transformation Twb1subscriptsuperscript𝑇subscript𝑏1𝑤T^{b_{1}}_{w}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT from the foundation model’s output frame w𝑤witalic_w to the primary base frame b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; A metric‐scale 3D reconstruction of the scene in the primary base frame b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

The transformations between the dual manipulators, along with their attached cameras, are illustrated in Fig. 2. Here, we observe that only the forward kinematics of the manipulators, i.e. the transformation from the bases of the manipulators to their end-effector, is known. The relative position of the two manipulators are also initially unknown. Bi-JCR solves for all of the unknown transformations.

Refer to caption
Figure 4: 3D foundation model taking a set of RGB images, and output 3D point sets, camera poses and confidence maps.

III-B 3D Foundation Models in the Pipeline

The two image sets can be fed into a 3D foundation model to obtain a reconstruction of the scene in an arbitrary coordinate frame and scale. We recover both of the relative camera poses

{P1,1,,P1,N,P2,1,,P2,N}subscript𝑃11subscript𝑃1𝑁subscript𝑃21subscript𝑃2𝑁\{P_{1,1},\dots,P_{1,N},\,P_{2,1},\dots,P_{2,N}\}{ italic_P start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT } (1)

along with corresponding point sets containing the reconstruction,

{X^1,1,,X^1,N,X^2,1,,X^2,N},subscript^𝑋11subscript^𝑋1𝑁subscript^𝑋21subscript^𝑋2𝑁\{\hat{X}_{1,1},\dots,\hat{X}_{1,N},\,\hat{X}_{2,1},\dots,\hat{X}_{2,N}\},{ over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT } , (2)

where each point in a set corresponds to a pixel in the associated input image, and confidence values for each pixel can also be recovered. Pre-trained 3D foundation models are often used to extract structure from images of indoor scenes and building structures, and their application for table-top scenes has been under-explored. Example outputs from the model, DUSt3R [28], is given in Fig. 4. Because the foundation model recovers geometry only up to an unknown scale, both the estimated camera poses and the aggregated point cloud are not expressed in real‐world metric units. To resolve this scale ambiguity, we introduce a scale factor λ𝜆\lambdaitalic_λ so that the pose of camera i𝑖iitalic_i on manipulator m𝑚mitalic_m in the real‐world (or base) frame w𝑤witalic_w becomes

Pm,iw(λ)subscriptsuperscript𝑃𝑤𝑚𝑖𝜆\displaystyle P^{w}_{m,i}(\lambda)italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ( italic_λ ) =[Rm,iλtm,i01]SE(3),absentmatrixsubscript𝑅𝑚𝑖𝜆subscript𝑡𝑚𝑖01SE3\displaystyle=\begin{bmatrix}R_{m,i}&\lambda\,t_{m,i}\\ 0&1\end{bmatrix}\;\in\;\mathrm{SE}(3),= [ start_ARG start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_CELL start_CELL italic_λ italic_t start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ∈ roman_SE ( 3 ) , (3)

where we have i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N and the index, m{1,2}𝑚12m\in\{1,2\}italic_m ∈ { 1 , 2 }. In this expression, Rm,iSO(3)subscript𝑅𝑚𝑖SO3R_{m,i}\in\mathrm{SO}(3)italic_R start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ∈ roman_SO ( 3 ) denotes the rotation and tm,i3subscript𝑡𝑚𝑖superscript3t_{m,i}\in\mathbb{R}^{3}italic_t start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the translation estimated by the foundation model. Next, defining the transform from the scaled foundation frame w𝑤witalic_w to each manipulator’s base frame bmsubscript𝑏𝑚b_{m}italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as Twbmsubscriptsuperscript𝑇subscript𝑏𝑚𝑤T^{b_{m}}_{w}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, the camera poses in the real‐world base frames are

Pm,ibm(λ)subscriptsuperscript𝑃subscript𝑏𝑚𝑚𝑖𝜆\displaystyle P^{b_{m}}_{m,i}(\lambda)italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ( italic_λ ) =TwbmPm,iw(λ),i=1,,N.formulae-sequenceabsentsubscriptsuperscript𝑇subscript𝑏𝑚𝑤subscriptsuperscript𝑃𝑤𝑚𝑖𝜆𝑖1𝑁\displaystyle=T^{b_{m}}_{w}\,P^{w}_{m,i}(\lambda),\quad i=1,\dots,N.= italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ( italic_λ ) , italic_i = 1 , … , italic_N . (4)

III-C Solving for Initial Calibration Solution

In Bi-JCR, we seek to simultaneously solve for λ𝜆\lambdaitalic_λ, Twb1subscriptsuperscript𝑇subscript𝑏1𝑤T^{b_{1}}_{w}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and Twb2subscriptsuperscript𝑇subscript𝑏2𝑤T^{b_{2}}_{w}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT in the process of solving bi-manual hand-eye calibration for the two camera frame to base frame transformations TC1E1subscriptsuperscript𝑇subscript𝐸1subscript𝐶1T^{E_{1}}_{C_{1}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and TC2E2subscriptsuperscript𝑇subscript𝐸2subscript𝐶2T^{E_{2}}_{C_{2}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. During the sequence of manipulator motions, the transformation between scaled camera poses and end effector poses for manipulator m{1,2}𝑚12m\in\{1,2\}italic_m ∈ { 1 , 2 } can be formulated as the classical hand-eye calibration equations [2]:

Em,i1Em,i+1TCmEm=TCmEmPm,ibm1(λ)Pm,i+1bm(λ)superscriptsubscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖1subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚superscriptsubscriptsuperscript𝑃subscript𝑏𝑚𝑚𝑖1𝜆subscriptsuperscript𝑃subscript𝑏𝑚𝑚𝑖1𝜆\displaystyle{E_{m,i}}^{-1}E_{m,i+1}T^{E_{m}}_{C_{m}}=T^{E_{m}}_{C_{m}}{P^{b_{% m}}_{m,i}}^{-1}(\lambda)P^{b_{m}}_{m,i+1}(\lambda)italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_λ ) italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT ( italic_λ ) (5)

Now, we denote the transformation between poses by,

TEm,iEm,i+1subscriptsuperscript𝑇subscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖\displaystyle T^{E_{m,i+1}}_{E_{m,i}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT =Em,i1Em,i+1,absentsuperscriptsubscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖1\displaystyle={E_{m,i}}^{-1}E_{m,i+1},= italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT , (6)
TPm,iwPm,i+1w(λ)subscriptsuperscript𝑇subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖𝜆\displaystyle T^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}(\lambda)italic_T start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) =Pm,iw1(λ)Pm,i+1w(λ).absentsuperscriptsubscriptsuperscript𝑃𝑤𝑚𝑖1𝜆subscriptsuperscript𝑃𝑤𝑚𝑖1𝜆\displaystyle={P^{w}_{m,i}}^{-1}(\lambda)P^{w}_{m,i+1}(\lambda).= italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_λ ) italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT ( italic_λ ) . (7)

Then, the hand-eye equations can be formulated as

TEm,iEm,i+1TCmEm=TCmEmTPm,iwPm,i+1w(λ).subscriptsuperscript𝑇subscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptsuperscript𝑇subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖𝜆\displaystyle T^{E_{m,i+1}}_{E_{m,i}}T^{E_{m}}_{C_{m}}=T^{E_{m}}_{C_{m}}T^{P^{% w}_{m,i+1}}_{P^{w}_{m,i}}(\lambda).italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) . (8)

Here we observe that Eq. 8 admits the AX=XB𝐴𝑋𝑋𝐵AX=XBitalic_A italic_X = italic_X italic_B form of the classical hand-eye calibration problem, with the right-hand side dependent on the scale factor λ𝜆\lambdaitalic_λ.

The first phase of Bi-JCR aims to obtain an initial solution for the scale and desired transformation, which we denote as λsuperscript𝜆\lambda^{\prime}italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, TC1E1superscriptsubscriptsuperscript𝑇subscript𝐸1subscript𝐶1{T^{E_{1}}_{C_{1}}}^{\prime}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and TC2E2superscriptsubscriptsuperscript𝑇subscript𝐸2subscript𝐶2{T^{E_{2}}_{C_{2}}}^{\prime}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since the rotation component here are in 𝐒𝐎(3)𝐒𝐎3\mathbf{SO}(3)bold_SO ( 3 ), we can first solve for the rotation components of the transformations, which are invariant to the scale factor. This can be achieved via linear algebra on the manifold of rotation matrices by following [4]. We convert rotation components of TEm,iEm,i+1subscriptsuperscript𝑇subscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖T^{E_{m,i+1}}_{E_{m,i}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and TPm,iwPm,i+1wsubscriptsuperscript𝑇subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖T^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}italic_T start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT into the log map of 𝐒𝐎(3)𝐒𝐎3\mathbf{SO}(3)bold_SO ( 3 ) to its lie algebra (𝔰𝔬(3)𝔰𝔬3\mathfrak{so}(3)fraktur_s fraktur_o ( 3 )) where for some R𝐒𝐎(3)𝑅𝐒𝐎3R\in\mathbf{SO}(3)italic_R ∈ bold_SO ( 3 ). This gives,

ω=𝜔absent\displaystyle\omega=italic_ω = arccos(Tr(R)12),Tr𝑅12\displaystyle\arccos(\frac{\mathrm{Tr}(R)-1}{2}),roman_arccos ( divide start_ARG roman_Tr ( italic_R ) - 1 end_ARG start_ARG 2 end_ARG ) , (9)
LogMap(R):=assignLogMap𝑅absent\displaystyle\mathrm{LogMap}(R):=roman_LogMap ( italic_R ) := ω2sin(ω)[R3,2R2,3R1,3R3,1R2,1R1,2]𝔰𝔬(3).𝜔2𝜔matrixsubscript𝑅32subscript𝑅23subscript𝑅13subscript𝑅31subscript𝑅21subscript𝑅12𝔰𝔬3\displaystyle\frac{\omega}{2\sin(\omega)}\begin{bmatrix}R_{3,2}-R_{2,3}\\ R_{1,3}-R_{3,1}\\ R_{2,1}-R_{1,2}\end{bmatrix}\in\mathfrak{so}(3).divide start_ARG italic_ω end_ARG start_ARG 2 roman_sin ( italic_ω ) end_ARG [ start_ARG start_ROW start_CELL italic_R start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 2 , 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ fraktur_s fraktur_o ( 3 ) . (10)

where, Tr()Tr\mathrm{Tr}(\cdot)roman_Tr ( ⋅ ) indicates the trace operator and the subscripts indicate the elements’ indices in R𝑅Ritalic_R. Then we can find best fit rotational components via:

RCmEmsuperscriptsuperscriptsubscript𝑅subscript𝐶𝑚subscript𝐸𝑚\displaystyle{R_{C_{m}}^{E_{m}}}^{\prime}italic_R start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =(MmMm)12Mm,absentsuperscriptsuperscriptsubscript𝑀𝑚topsubscript𝑀𝑚12superscriptsubscript𝑀𝑚top\displaystyle=(M_{m}^{\top}M_{m})^{-\frac{1}{2}}M_{m}^{\top},= ( italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (11)
where Mmwhere subscript𝑀𝑚\displaystyle\text{where }M_{m}where italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =i=1N1LogMap(REm,iEm,i+1)LogMap(RPm,iwPm,i+1w),absentsuperscriptsubscript𝑖1𝑁1tensor-productLogMapsuperscriptsubscript𝑅subscript𝐸𝑚𝑖subscript𝐸𝑚𝑖1LogMapsubscriptsuperscript𝑅subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖\displaystyle=\sum_{i=1}^{N-1}\mathrm{LogMap}(R_{E_{m,i}}^{E_{m,i+1}})\otimes% \mathrm{LogMap}(R^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT roman_LogMap ( italic_R start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ⊗ roman_LogMap ( italic_R start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

The tensor-product\otimes is the outer product, and the matrix inverse square root can be computed efficiently via singular value decomposition.

Next, we solve the translation components along with the scale factor jointly, by minimizing the residuals of the similar scale recovery problem formulated in [8] for each arm, assuming that the scale factor is consistent for the results of each arm:

SRP:argmintCmEm,λSRP:subscriptsuperscriptsuperscriptsubscript𝑡subscript𝐶𝑚subscript𝐸𝑚superscript𝜆\displaystyle\text{SRP:}\quad\arg\min_{{t_{C_{m}}^{E_{m}}}^{\prime},\lambda^{% \prime}}SRP: roman_arg roman_min start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT i=1N1||QitCmEm𝐝i(λ)||22,\displaystyle\sum^{N-1}_{i=1}\lvert\lvert{Q_{i}}{t_{C_{m}}^{E_{m}}}^{\prime}-% \mathbf{d}_{i}(\lambda^{\prime})\lvert\lvert_{2}^{2},∑ start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT | | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (12)
where Qiwhere subscript𝑄𝑖\displaystyle\text{where }{Q_{i}}where italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =IREm,iEm,i+1,absent𝐼superscriptsubscript𝑅subscript𝐸𝑚𝑖subscript𝐸𝑚𝑖1\displaystyle=I-R_{E_{m,i}}^{E_{m,i+1}},= italic_I - italic_R start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (13)
and 𝐝i(λ)and subscript𝐝𝑖superscript𝜆\displaystyle\text{and }\mathbf{d}_{i}(\lambda^{\prime})and bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =tEm,iEm,i+1RCmEm(λ)tPm,iwPm,i+1w.absentsuperscriptsubscript𝑡subscript𝐸𝑚𝑖subscript𝐸𝑚𝑖1superscriptsuperscriptsubscript𝑅subscript𝐶𝑚subscript𝐸𝑚𝜆subscriptsuperscript𝑡subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖\displaystyle=t_{E_{m,i}}^{E_{m,i+1}}-{R_{C_{m}}^{E_{m}}}^{\prime}(\lambda)t^{% P^{w}_{m,i+1}}_{P^{w}_{m,i}}.= italic_t start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_λ ) italic_t start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (14)

Equation 12 can be solved via least-squares, and we obtain our solutions λsuperscript𝜆\lambda^{\prime}italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, TC1E1superscriptsubscriptsuperscript𝑇subscript𝐸1subscript𝐶1{T^{E_{1}}_{C_{1}}}^{\prime}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and TC2E2superscriptsubscriptsuperscript𝑇subscript𝐸2subscript𝐶2{T^{E_{2}}_{C_{2}}}^{\prime}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which can be further refined via gradient-based optimisation.

III-D Refine Calibration through Gradient-based Optimization

We further refine the solutions via gradient descent to improve estimation. Here, we first rearrange Equation 8 into,

TEm,iEm,i+1TCmEmTCmEmTPm,iwPm,i+1w=0,subscriptsuperscript𝑇subscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptsuperscript𝑇subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖0\displaystyle T^{E_{m,i+1}}_{E_{m,i}}T^{E_{m}}_{C_{m}}-T^{E_{m}}_{C_{m}}T^{P^{% w}_{m,i+1}}_{P^{w}_{m,i}}=0,italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 , (15)
for i{1,,N1},m{1,2}.formulae-sequencefor 𝑖1𝑁1𝑚12\displaystyle\text{ for }i\in\{1,\cdots,N-1\},m\in\{1,2\}.for italic_i ∈ { 1 , ⋯ , italic_N - 1 } , italic_m ∈ { 1 , 2 } .

then we can solve the calibration problem by minimizing the difference between transformation matrices TEm,iEm,i+1TCmEmsubscriptsuperscript𝑇subscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚T^{E_{m,i+1}}_{E_{m,i}}T^{E_{m}}_{C_{m}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and TCmEmTPm,iwPm,i+1wsubscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptsuperscript𝑇subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖T^{E_{m}}_{C_{m}}T^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Specifically, we define a cost function as,

\displaystyle\ellroman_ℓ (λ,TC1E1,TC2E2)=𝜆subscriptsuperscript𝑇subscript𝐸1subscript𝐶1subscriptsuperscript𝑇subscript𝐸2subscript𝐶2absent\displaystyle(\lambda,T^{E_{1}}_{C_{1}},T^{E_{2}}_{C_{2}})=( italic_λ , italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = (16)
m{1,2}(1N1i=1N1αDR(TCmEm)+(1α)Dt(λ,TCmEm)),subscript𝑚121𝑁1superscriptsubscript𝑖1𝑁1𝛼subscript𝐷𝑅subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚1𝛼subscript𝐷𝑡𝜆subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚\displaystyle\sum_{m\in\{1,2\}}\bigl{(}\frac{1}{N-1}\sum_{i=1}^{N-1}\alpha D_{% R}(T^{E_{m}}_{C_{m}})\!+\!(1\!-\!\alpha)D_{t}(\lambda,T^{E_{m}}_{C_{m}})\bigr{% )},∑ start_POSTSUBSCRIPT italic_m ∈ { 1 , 2 } end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_α italic_D start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( 1 - italic_α ) italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_λ , italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ,
where DR(TCmEm)=arccos(tr(REm,im,i+1RCm,im,i+1)1),where subscript𝐷𝑅subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚tr𝑅superscriptsuperscriptsubscript𝐸𝑚𝑖𝑚𝑖1top𝑅superscriptsubscript𝐶𝑚𝑖𝑚𝑖11\displaystyle\text{where }D_{R}(T^{E_{m}}_{C_{m}})\!=\!\arccos\bigl{(}% \operatorname{tr}({RE_{m,i}^{m,i+1}}^{\top}\!RC_{m,i}^{m,i+1})\!-1\!\bigr{)},where italic_D start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_arccos ( roman_tr ( italic_R italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_i + 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R italic_C start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_i + 1 end_POSTSUPERSCRIPT ) - 1 ) ,
and Dt(λ,TCmEm)=tem,im,i+1tcm,im,i+12,and subscript𝐷𝑡𝜆subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚subscriptdelimited-∥∥𝑡superscriptsubscript𝑒𝑚𝑖𝑚𝑖1𝑡superscriptsubscript𝑐𝑚𝑖𝑚𝑖12\displaystyle\text{and }D_{t}(\lambda,T^{E_{m}}_{C_{m}})\!=\!\bigl{\lVert}te_{% m,i}^{m,i+1}\!-\!tc_{m,i}^{m,i+1}\bigr{\rVert}_{2},and italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_λ , italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ∥ italic_t italic_e start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_i + 1 end_POSTSUPERSCRIPT - italic_t italic_c start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_i + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where tr()tr\operatorname{tr}(\cdot)roman_tr ( ⋅ ) is the trace operator. We can then minimize the cost function via gradient descent [35] while constraining the rotation to be on the 𝐒𝐎(3)𝐒𝐎3\mathbf{SO}(3)bold_SO ( 3 ) manifold. Here, we use the results from the previous section as the initial solution. Furthermore, to ensure that the rotational components in 𝐒𝐎(3)𝐒𝐎3\mathbf{SO}(3)bold_SO ( 3 ) during the backpropagation process, we follow [36] and first pull each rotation into the Lie algebra with the logarithm map, then perform the gradient update on the resulting axis–angle vector in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and push it back onto the manifold via the exponential map. With the solutions that minimize the cost function, we can then obtain the world-to-base transformations Twb1subscriptsuperscript𝑇subscript𝑏1𝑤T^{b_{1}}_{w}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and Twb2subscriptsuperscript𝑇subscript𝑏2𝑤T^{b_{2}}_{w}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT via

Twbm=AVGSE3i{1,,N1}(TEm,iEm,i+1TCmEmTPm,iwPm,i+1w1)subscriptsuperscript𝑇subscript𝑏𝑚𝑤subscriptsubscriptAVGSE3𝑖1𝑁1subscriptsuperscript𝑇subscript𝐸𝑚𝑖1subscript𝐸𝑚𝑖subscriptsuperscript𝑇subscript𝐸𝑚subscript𝐶𝑚superscriptsubscriptsuperscript𝑇subscriptsuperscript𝑃𝑤𝑚𝑖1subscriptsuperscript𝑃𝑤𝑚𝑖1\displaystyle T^{b_{m}}_{w}=\operatorname*{AVG_{SE3}}\limits_{i\in\{1,\cdots,N% -1\}}\bigl{(}T^{E_{m,i+1}}_{E_{m,i}}{T^{E_{m}}_{C_{m}}T^{P^{w}_{m,i+1}}_{P^{w}% _{m,i}}}^{-1}\bigr{)}italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = start_OPERATOR roman_AVG start_POSTSUBSCRIPT SE3 end_POSTSUBSCRIPT end_OPERATOR start_POSTSUBSCRIPT italic_i ∈ { 1 , ⋯ , italic_N - 1 } end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (17)

where AVGSE3subscriptAVGSE3\operatorname*{AVG_{SE3}}roman_AVG start_POSTSUBSCRIPT SE3 end_POSTSUBSCRIPT is the average over a set of transformation matrices on 𝐒𝐄(3)𝐒𝐄3\mathbf{SE}(3)bold_SE ( 3 ), by considering the average of rotation and translation separately.

III-E Obtaining Metric-Scale 3D Representation

Here, we seek to build a real-world metric scale 3D representation of the environment under the primary manipulator’s frame. Following Equation 3, we first scale camera poses by λ𝜆\lambdaitalic_λ to get {P1,1w,,P1,Nw}subscriptsuperscript𝑃𝑤11subscriptsuperscript𝑃𝑤1𝑁\{P^{w}_{1,1},\cdots,P^{w}_{1,N}\}{ italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT } and {P2,1w,,P2,Nw}subscriptsuperscript𝑃𝑤21subscriptsuperscript𝑃𝑤2𝑁\{P^{w}_{2,1},\cdots,P^{w}_{2,N}\}{ italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT }, we can then get the calibrated metric scale camera poses by

Pm,ib1=Twb1Pm,iw, for m{1,2},i{1,,N}.formulae-sequencesubscriptsuperscript𝑃subscript𝑏1𝑚𝑖subscriptsuperscript𝑇subscript𝑏1𝑤subscriptsuperscript𝑃𝑤𝑚𝑖formulae-sequence for 𝑚12𝑖1𝑁\displaystyle P^{b_{1}}_{m,i}=T^{b_{1}}_{w}P^{w}_{m,i},\text{ for }m\in\{1,2\}% ,i\in\{1,\cdots,N\}.italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , for italic_m ∈ { 1 , 2 } , italic_i ∈ { 1 , ⋯ , italic_N } . (18)

Next, we also want to use the point sets from each arm, associated with each input image, {X^1,1,,X^1,N}subscript^𝑋11subscript^𝑋1𝑁\{\hat{X}_{1,1},\cdots,\hat{X}_{1,N}\}{ over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT } and {X^2,1,,X^2,N}subscript^𝑋21subscript^𝑋2𝑁\{\hat{X}_{2,1},\cdots,\hat{X}_{2,N}\}{ over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT } and their associated confidence maps {C1,1,,C1,N}subscript𝐶11subscript𝐶1𝑁\{C_{1,1},\cdots,C_{1,N}\}{ italic_C start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , ⋯ , italic_C start_POSTSUBSCRIPT 1 , italic_N end_POSTSUBSCRIPT } and {C2,1,,C2,N}subscript𝐶21subscript𝐶2𝑁\{C_{2,1},\cdots,C_{2,N}\}{ italic_C start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , ⋯ , italic_C start_POSTSUBSCRIPT 2 , italic_N end_POSTSUBSCRIPT } from the output of the foundation model to recover a rich and high-quality representation of the environment. We first use a confidence threshold to filter out points below this threshold in each X^m,isubscript^𝑋𝑚𝑖\hat{X}_{m,i}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT. Then, we transform the points from the filtered point sets, {𝐱i}i=1Npcsuperscriptsubscriptsubscript𝐱𝑖𝑖1subscript𝑁𝑝𝑐\{\mathbf{x}_{i}\}_{i=1}^{N_{pc}}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, to get a point cloud in real-world metric scale and primary manipulator’s base frame, {𝐱ib1}i=1Npcsuperscriptsubscriptsubscriptsuperscript𝐱subscript𝑏1𝑖𝑖1subscript𝑁𝑝𝑐\{\mathbf{x}^{b_{1}}_{i}\}_{i=1}^{N_{pc}}{ bold_x start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT through

{𝐱ib1=Twb1(λ𝐱i), for i{1,,Npc}}.formulae-sequencesubscriptsuperscript𝐱subscript𝑏1𝑖subscriptsuperscript𝑇subscript𝑏1𝑤𝜆subscript𝐱𝑖 for 𝑖1subscript𝑁𝑝𝑐\displaystyle\{\mathbf{x}^{b_{1}}_{i}=T^{b_{1}}_{w}(\lambda\mathbf{x}_{i}),% \text{ for }i\in\{1,\cdots,N_{pc}\}\}.{ bold_x start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_λ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , for italic_i ∈ { 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_p italic_c end_POSTSUBSCRIPT } } . (19)

Furthermore, the pose of the secondary manipulator’s base in the primary manipulator’s base frame can be computed as

Pb2b1=Twb1Twb21.subscriptsuperscript𝑃subscript𝑏1subscript𝑏2subscriptsuperscript𝑇subscript𝑏1𝑤superscriptsubscriptsuperscript𝑇subscript𝑏2𝑤1\displaystyle P^{b_{1}}_{b_{2}}=T^{b_{1}}_{w}{T^{b_{2}}_{w}}^{-1}.italic_P start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (20)

The pose of the secondary manipulator’s base in the primary manipulator’s base frame enables us to compute end-effector poses for downstream tasks of both manipulators in a single unified frame. This facilitates downstream processes, such as object segmentation along with grasping generation, to operate.

Scene A Scene B Scene C Images Per Manipulator 4 images 7 images 9 images 4 images 7 images 9 images 4 images 7 images 9 images Converged bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ Bi-JCR (Ours) Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.0769 0.0724 0.0668 0.0740 0.0617 0.0569 0.0743 0.0634 0.0612 Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.0461 0.0391 0.0378 0.0587 0.0351 0.0340 0.0424 0.0341 0.0373 No. of Poses (Left) 4 7 9 4 7 9 4 7 9 No. of Poses (Right) 4 7 9 4 7 9 4 7 9 Converged × × × bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ × bold-✓\bm{\checkmark}bold_✓ bold-✓\bm{\checkmark}bold_✓ COLMAP [37] Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT NA NA NA 0.0865 0.0583 0.0584 NA 0.0591 0.0586 + Calibration Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT NA NA NA 0.0317 0.0269 0.0274 NA 0.0277 0.0271 No. of Poses (Left) 0 4 2 4 7 9 0 7 9 No. of Poses (Right) 4 0 0 4 7 9 4 7 9 Ray Diffusion [38] Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.4845 0.4288 0.7951 0.4948 0.3839 0.3634 0.4504 0.2610 0.2223 + Calibration Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.2118 0.1416 0.1309 0.2067 0.2051 0.1637 0.2125 0.1858 0.1760 No. of Poses (Left) 4 7 9 4 7 9 4 7 9 No. of Poses (Right) 4 7 9 4 7 9 4 7 9

TABLE I: Quantitative evaluation on Bi-JCR’s calibration residual error (δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and δtsubscript𝛿𝑡\delta_{t}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) against baseline methods. Lower residual indicates more accurate calibrations.

IV Empirical Evaluation

In this section, we rigorously evaluate our proposed Bi-Manual Joint Calibration and Representation (Bi-JCR) method. Our bi-manual setup consists of two AgileX Piper 6 degree-of-freedom manipulators, each with a low-cost USB webcam mounted on the gripper. We seek to answer the following questions:

  1. 1.

    Can Bi-JCR produce correct hand-eye calibration for both arms, even when the number of images provided is low?

  2. 2.

    Can Bi-JCR recover the scale accurately such that our representation’s sizes match the physical world?

  3. 3.

    Can high-quality 3D environment representations, in the correct coordinate frame, be built?

  4. 4.

    How do different 3D foundation models change the calibration accuracy?

  5. 5.

    Does refinement via additional gradient descent improve calibration accuracy?

  6. 6.

    Does Bi-JCR facilitate downstream bi-manual manipulation tasks?

IV-A Eyes-to-Hands Calibration with Bi-JCR

Baselines: COLMAP-based pose estimation and Ray Diffusion. To assess the calibration quality of Bi-JCR, we compare against two alternatives. First, in the absence of high-contrast markers such as checkerboards or AprilTags [1], we use SfM via COLMAP [37] and apply the Park–Martin algorithm [4] to compute eye-to-hand transformations for each manipulator. Second, we evaluate Ray Diffusion [38], a sparse-view diffusion model trained on large datasets [39] that directly regresses camera poses.

Task and Metrics: We take images in three different environments: a scene on a darker light condition with 9 items (scene A), two others of which are under brighter light conditions with different sets of objects of 9 and 10 items respectively (scene B, scene C). Bi-JCR and the two baseline methods are evaluated with an increasing number of input images (4, 7 and 9 images per manipulator), then check whether the calibration has converged correctly by considering residual losses via Equation 15 with ground truth, obtained via Apriltags [1], on the right-hand side. Lower residual values indicate a higher degree of consistency.

Results: We tabulate our results in Table I. We observe that COLMAP often results in diverged calibration, as images where correspondence cannot be found are discarded. Whether the calibration has converged and the number of images used, from each manipulator, are also shown in Table I. Although Ray Diffusion registered all images, it registered them in an inconsistent way under this tabletop setup of cluttered, partially visible objects, causing the calibration optimizer to accumulate large errors. Our Bi-JCR method consistently produce smaller residual under both lower and higher number of views for both manipulators, showing remarkable image efficiency. We also observe a residual loss reduce trend as the number of images gradually increase, in comparison to the residual loss fluctuation in the other two baseline methods, which shows Bi-JCR’s reliable precision gain with increasing number of views.

Refer to caption
Refer to caption
(a) Spoon
Refer to caption
Refer to caption
(b) Tea Lid
Refer to caption
Refer to caption
(c) Tape
Refer to caption
Refer to caption
(d) Battery
Refer to caption
Refer to caption
(e) Toolbox
Refer to caption
Refer to caption
(f) Joystick
Figure 5: Visualization of the objects we used for scale validation in Table II. For each object, the top is their real-world appearances, and the bottom is the reconstructions. The blue and yellow dots specify the length measured for scale validation.

Object Spoon Tea Lid Tape Battery Toolbox Joystick 5 Images 4.90% 4.3% 2.73% 1.59% 10.90% 7.73% 8 Images 1.61% 0.26% 1.34% 1.10% 2.53% 2.98%

TABLE II: Percentage of error of object dimensions to compare the size of real-world objects against reconstructed objects. Within 8 images per manipulator, the percentage errors of reconstructed sizes are at most 2.98%, indicating accurate scale recovery.

Refer to captionRefer to captionRefer to caption

Figure 6: Qualitative evaluation of the camera calibration in the three scenes. We observe that the cameras indicated as cones are aligned with the end-effector poses, indicating accurate calibration. The end-effector and camera poses of the left manipulator are colored in shades of red and those of the right in blue.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 7: Qualitative evaluations of the recovered 3D reconstructions, including scene A, scene B, scene C from left to right. The reconstruction of the scene is dense and geometrically accurate.

Here, we also visualize the aligned camera and end-effector poses after calibration via Bi-JCR in Figure 6. The end-effectors and outlined as U-shapes and cameras are represented by cones. Both end-effector poses and camera poses are transformed to the primary manipulator’s base frame using the base to base transformation estimated by Bi-JCR. Primary manipulator end-effector and eye-to-hand transformed camera are colored in red, and the secondary manipulator’s are colored in blue. As shown in Figure 6, Bi-JCR successfully recovers eye-to-hand transformations that consistently align camera and end-effector poses for both primary and secondary manipulator across all scenes.

IV-B Accurate Metric Scale Recovery with Bi-JCR

Bi-JCR also reconstructs a 3D dense point cloud on a metric scale of the real-world environment. Here, we evaluate the accuracy of scale recovery by comparing the difference between the side length of the real-world object and the side length of the reconstructed objects with 5 and 8 images collected from each manipulator, as shown in Figure 5. The error, computed by

errobj=|sreconstructedsreal world|sreal world,subscripterr𝑜𝑏𝑗subscript𝑠reconstructedsubscript𝑠real worldsubscript𝑠real world\displaystyle\operatorname{err}_{obj}=\frac{|s_{\text{reconstructed}}-s_{\text% {real world}}|}{s_{\text{real world}}},roman_err start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT = divide start_ARG | italic_s start_POSTSUBSCRIPT reconstructed end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT real world end_POSTSUBSCRIPT | end_ARG start_ARG italic_s start_POSTSUBSCRIPT real world end_POSTSUBSCRIPT end_ARG , (21)

is reported in Table II. With only 8 images per manipulator, Bi-JCR is able to reduce the error to a median of 1.48% and at most 2.98%, which marks precise scale recovery giving real-world metric scale.

Darker Light Condition (9 items) Brighter Light Condition (8 items) Brighter Light Condition (7 items) Images per Manipulator 4 7 9 4 7 9 4 7 9 DUSt3R[28] + Bi-JCR Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.0769 0.0724 0.0668 0.0740 0.0617 0.0569 0.0743 0.0634 0.0612 Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.0461 0.0391 0.0378 0.0587 0.0351 0.0340 0.0424 0.0341 0.0373 MASt3R[30] + Bi-JCR Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.2613 0.2404 0.2020 0.2636 0.1541 0.1302 0.2350 0.1482 0.1312 Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.1639 0.1378 0.1219 0.1538 0.1012 0.0811 0.1514 0.1009 0.0896 VGGT[31] + Bi-JCR Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.3763 0.3728 0.5358 0.3793 0.3561 0.5186 0.5274 0.3492 0.3524 Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.1241 0.1255 0.1438 0.1290 0.1676 0.1818 0.2448 0.1692 0.1649

TABLE III: Quantitative result on selecting the best foundation model for Bi-JCR. The foundation model used by Bi-JCR, DUSt3R [28] is compared against MASt3R [30] and VGGT [31] in term of rotational and translational residual loss from Equation 15. Under all three scenarios, DUSt3R outperforms both MASt3R and VGGT in all number of views per manipulator.

Refer to captionRefer to caption

Refer to captionRefer to caption

Figure 8: Visualization of the base to base transformation recovery and transformation recovery of foundation model output frame to primary manipulator’s base frame, including scene A (top left), scene B (top right), scene C (bottom left), and the real world bi-manual bases setup (bottom right).

IV-C 3D Representation in Primary Manipulator’s Base Frame

Besides scale, the quality of the 3D representation built by Bi-JCR is critical to downstream tasks. Here, we qualitatively assess the reconstructed 3D point cloud by visualizing it against images taken on the real world environment in Figure 7. We observe that Bi-JCR reconstructs the relative position and orientation of objects in the environment correctly, and the shape of each object is highly preserved. We further investigated whether representations can be accurately transformed into the robot’s coordinate frame and the placement of the secondary manipulator base in the primary manipulator base frame. We inject the reconstruction, along with manipulator poses, into the PyBullet Simulator [34]. As shown in Figure 8, the relative pose of the two manipulators highly resembles the relative pose of the two manipulators in the real world, indicating the correct estimation of the pose of the secondary manipulator in the base frame of the primary manipulator. The table 3D reconstruction in simulation remains parallel to the bases of manipulators, and object orientations and positions are visually correct relative to the bases of manipulators, indicating both a high-quality 3D reconstruction produced along with its accurate transformation into the primary manipulator’s frame.

IV-D Ablation Study on Different Foundation Models

With recent development of 3D reconstruction using structure from motion (SfM), there have been many new foundation models that outperform the DUSt3R foundation model [28] chosen by us in various task benchmarks, such as MASt3R [30] and VGGT [31]. Therefore, we evaluate the performance of Bi-JCR with different foundation models in 4, 7, 9 numbers of views per manipulator, and we choose the complete mode to find correspondences between all pairs of views for MASt3R, and we report the residual loss in Table III. Unlike VGGT, both MASt3R and DUSt3R receive a lower residual loss compared to VGGT in a higher number of views per manipulator, which is likely because MASt3R and DUSt3R utilize the ICP process to retain consistency in camera pose estimation. Furthermore, DUSt3R produces camera poses with better overall residual calibration loss, so DUSt3R remains the primary choice for Bi-JCR.

IV-E Ablation Study on Bi-JCR’s Gradient Descent

Bi-JCR’s leverages gradient descent to further refine initial solutions. We experimentally evaluate the benefit of the gradient descent refinement by comparing the residual losses of Bi-JCR, with and without refinement via Gradient Descent [40]. As shown in Table IV, the gradient descent refinement shows a marked improvement in rotational loss with fewer views. When the number of views per manipulator is doubled from 4 to 8, Bi-JCR with gradient descent refinement still outperforms Bi-JCR without refinement. We observe that gradient descent refinement plays a critical component in Bi-JCR under a sparse view setup, but is less impactful when the number of images per manipulator increases.

Darker Condition Lighter Condition Images per Manipulator 4 8 4 8 Bi-JCR w/ GD Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.0785 0.0698 0.0743 0.0619 Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.0478 0.0390 0.0424 0.0372 Bi-JCR w/o GD Residual δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 0.0920 0.0704 0.0927 0.0628 Residual δ𝐭subscript𝛿𝐭\delta_{\mathbf{t}}italic_δ start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT 0.0477 0.0391 0.0427 0.0374

TABLE IV: Quantitative result on the effect of Gradient Descent (GD) [35] in Bi-JCR, observe that under sparse images condition, GD is able to significantly improve rotational residual error, and Bi-JCR with GD also slightly outperform Bi-JCR without GD in large number of images setup.
Refer to caption
Refer to caption
(a) Representation in the robot’s frame.

Refer to captionRefer to caption

(b) Segmentated tabletop.
Figure 9: Visualization of running segmentation algorithm in the real world metric scale reconstructed 3D representation from Bi-JCR, which allows various bi-manual downstream tasks such as joint grasping and passing.

Refer to captionRefer to captionRefer to captionRefer to caption

(a) Heavier objects are selected from the scene, and can be segmented out.

Refer to captionRefer to captionRefer to captionRefer to caption

Refer to captionRefer to captionRefer to captionRefer to caption

(b) We can execute Bi-manual grasps computed from the representation produced by Bi-JCR, in the real world.
Figure 10: Execution of the bi-manual joint grasping on heavier objects in the scene. Background blurred for greater clarity.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Top: Selected objects (wrench, spoon, balance meter and tape) segmented; Bottom: Generated robot end-effector grasping poses for manipulator hand-overs.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: We demonstrate passing of the objects: wrench, spoon, balance meter, tape from top to bottom.

IV-F Bi-JCR Enables Bi-manual Downstream Tasks

To assess both the relative‐pose estimation between the two manipulators’ bases and the fidelity of the reconstructed 3D scene, we conduct two real‐world bimanual tasks: (1) joint grasping of large objects and (2) passing of small objects. We begin by running Bi‐JCR to recover the base‐to‐base transformation and reconstruct the 3D environment in simulation (Fig. 9(a)). Next, we apply the 3D point‐cloud segmentation algorithm from [41] to isolate each object on the tabletop (Fig. 9(b)).

Joint grasping: We first select the cluster corresponding to the large objects: a toolbox, a controller, a battery pack and a box, as illustrated Fig. 10. A grasp pose for the primary manipulator is computed in its own base frame using [42], and likewise for the secondary manipulator. The secondary grasp must then be transformed into its base frame. Finally, both end‐effector poses are executed via inverse kinematics and joint control to perform the coordinated grasp.

Bimanual passing: We then focus on the small objects, a wrench, a spoon, a balance meter and a tape, shown in Fig. 11. After choosing a target transfer location in the primary manipulator’s base frame, we translate the segmented point cloud to that location and generate precise poses to enable object passing for both arms. Again, the secondary end‐effector pose is reprojected into its base frame. The primary manipulator loads the object onto its gripper and then successfully hands it off to the secondary manipulator at the specified location (Fig. 11). The successful passing of the objects highlights that the transformation between the local coordinate frames of each robot is accurately recovered. These experiments demonstrate that, Bi‐JCR reliably enables complex bimanual operations by accurately calibrating cameras on both manipulators and constructing a high‐quality dense 3D reconstruction.

IV-G Summary of Empirical Results

From our empirical experiments, we demonstrate that:

  1. 1.

    Bi-JCR’s calibration produces highly accurate eye-to-hand transformations.

  2. 2.

    The recovered scale factor successfully converts the foundation model’s output into real-world metric units.

  3. 3.

    Bi-JCR generates a high-quality dense 3D point cloud with correct object geometry and can be correctly transformed into the robot’s frame.

  4. 4.

    Among 3D foundation models, DUSt3R [28] consistently delivers the best performance for Bi-JCR.

  5. 5.

    Gradient-descent refinement is effective under sparse-view conditions, although its marginal benefit diminishes as view density increases.

  6. 6.

    The accurate transformations and dense 3D representations produced by Bi-JCR enable various downstream bimanual tasks, including coordinated grasping and object transfer between manipulators.

V Conclusion and Future work

We introduced Bi-JCR, a unified framework for joint calibration and 3D representation in bimanual robotic systems with wrist-mounted cameras. Leveraging large 3D foundation models, Bi-JCR removes calibration markers and simultaneously estimates camera extrinsics, inter-arm relative poses, and a shared, metric-consistent scene representation. Our approach unifies the calibration and perception processes using only RGB images, and facilitates downstream bi-manual tasks such as grasping and object handover. Extensive real-world evaluations demonstrate Bi-JCR’s performance over diverse environments. Future work will leverage confidence masks from 3D foundation models to actively guide novel image collection, continuously complete the reconstructed scene, refine calibration, and then generate motion [43, 44].

References

  • [1] E. Olson, “Apriltag: A robust and flexible visual fiducial system,” in IEEE International Conference on Robotics and Automation, 2011.
  • [2] R. Y. Tsai and R. K. Lenz, “A new technique for fully autonomous and efficient 3d robotics hand/eye calibration,” IEEE Trans. Robotics Autom., 1988.
  • [3] R. Horaud and F. Dornaika, “Hand-eye calibration,” I. J. Robotic Res., 1995.
  • [4] F. Park and B. Martin, “Robot Sensor Calibration: Solving AX = XB on the Euclidean Group,” IEEE Transactions on Robotics and Automation, 1994.
  • [5] E. Valassakis, K. Dreczkowski, and E. Johns, “Learning eye-in-hand camera calibration from a single image,” in Proceedings of the 5th Conference on Robot Learning, 2022.
  • [6] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” in Advances in Neural Information Processing Systems, 2021.
  • [7] P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” in Conference on Robot Learning (CoRL), 2021.
  • [8] W. Zhi, H. Tang, T. Zhang, and M. Johnson-Roberson, “Unifying representation and calibration with 3d foundation models,” IEEE Robotics and Automation Letters, 2024.
  • [9] W. Zhi, H. Tang, T. Zhang, and M. Johnson-Roberson, “3d foundation models enable simultaneous geometry and pose estimation of grasped objects,” arXiv preprint arXiv:2407.10331, 2024.
  • [10] L. Wu, J. Wang, L. Qi, K. Wu, H. Ren, and M. Q.-H. Meng, “Simultaneous hand–eye, tool–flange, and robot–robot calibration for comanipulation by solving the axb=ycz problem,” IEEE TRansactions on robotics, vol. 32, no. 2, pp. 413–428, 2016.
  • [11] Z. Fu, J. Pan, E. Spyrakos-Papastavridis, X. Chen, and M. Li, “A dual quaternion-based approach for coordinate calibration of dual robots in collaborative motion,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4086–4093, 2020.
  • [12] Z. Zhou, L. Ma, X. Liu, Z. Cao, and J. Yu, “Simultaneously calibration of multi hand–eye robot system based on graph,” IEEE Transactions on Industrial Electronics, vol. 71, no. 5, pp. 5010–5020, 2023.
  • [13] A. Elfes, “Sonar-based real-world mapping and navigation,” IEEE Journal on Robotics and Automation, 1987.
  • [14] R. Malladi, J. A. Sethian, and B. C. Vemuri, “Shape modeling with front propagation: a level set approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995.
  • [15] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, 2011.
  • [16] S. O’Callaghan, F. T. Ramos, and H. Durrant-Whyte, “Contextual occupancy maps using gaussian processes,” in 2009 IEEE International Conference on Robotics and Automation, 2009.
  • [17] W. Zhi, L. Ott, R. Senanayake, and F. Ramos, “Continuous occupancy map fusion with fast bayesian hilbert maps,” in International Conference on Robotics and Automation (ICRA), 2019.
  • [18] W. Zhi, R. Senanayake, L. Ott, and F. Ramos, “Spatiotemporal learning of directional uncertainty in urban environments with kernel recurrent mixture density networks,” IEEE Robotics and Automation Letters, 2019.
  • [19] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [20] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: deep hierarchical feature learning on point sets in a metric space,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
  • [22] W. Zhi, I. Akinola, K. van Wyk, N. Ratliff, and F. Ramos, “Global and reactive motion generation with geometric fabric command sequences,” in IEEE International Conference on Robotics and Automation, ICRA, 2023.
  • [23] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020.
  • [24] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., 2022.
  • [25] T. Zhang, K. Huang, W. Zhi, and M. Johnson-Roberson, “Darkgs: Learning neural illumination and 3d gaussians relighting for robotic exploration in the dark,” 2024.
  • [26] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, 2023.
  • [27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” CoRR, 2021.
  • [28] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” in CVPR, 2024.
  • [29] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [30] V. Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” 2024.
  • [31] J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025.
  • [32] R. Bommasani and et al., “On the opportunities and risks of foundation models,” CoRR, 2021.
  • [33] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,” CoRR, 2023.
  • [34] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http://pybullet.org, 2016–2019.
  • [35] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning, vol. 4. Springer, 2006.
  • [36] R. Brégier, “Deep regression on manifolds: a 3D rotation case study,” 2021.
  • [37] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [38] J. Y. Zhang, A. Lin, M. Kumar, T.-H. Yang, D. Ramanan, and S. Tulsiani, “Cameras as rays: Pose estimation via ray diffusion,” in International Conference on Learning Representations (ICLR), 2024.
  • [39] J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny, “Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,” in International Conference on Computer Vision, 2021.
  • [40] C. M. Bishop, Pattern recognition and machine learning, 5th Edition. Information science and statistics, Springer, 2007.
  • [41] H. Tang, T. Zhang, O. Kroemer, M. Johnson-Roberson, and W. Zhi, “Graphseg: Segmented 3d representations via graph edge addition and contraction,” arXiv preprint arXiv:2504.03129, 2025.
  • [42] A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” The International Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.
  • [43] W. Zhi, T. Lai, L. Ott, and F. Ramos, “Diffeomorphic transforms for generalised imitation learning,” in Learning for Dynamics and Control Conference, L4DC, 2022.
  • [44] W. Zhi, T. Zhang, and M. Johnson-Roberson, “Instructing robots by sketching: Learning from demonstration via probabilistic diagrammatic teaching,” in IEEE International Conference on Robotics and Automation, 2024.