Bi-Manual Joint Camera Calibration and Scene Representation

Haozhan Tang¹ Tianyi Zhang¹ Matthew Johnson-Roberson^1,2 Weiming Zhi¹ Project page: https://tomtang502.github.io/bijcr_web/.¹ Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.² College of Connected Computing, Vanderbilt University, USA.^∗email: wzhi@andrew.cmu.edu.

Abstract

Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.

I INTRODUCTION

Robot manipulators with wrist-mounted cameras generally need to be meticulously calibrated offline to enable perceived objects to be transformed into the robot’s coordinate frame. This is done via a procedure known as hand-eye calibration, where the manipulator is moved through a set of poses and take images of a known calibration marker, such as a checker board or AprilTag [1]. Traditional hand-eye calibration methods focus on a single “eye-in-hand” camera and rely on external markers to compute the rigid transform between camera and end-effector. When extended to two independently moving arms, these approaches must be repeated separately for each arm, and then a secondary step is required to fuse the two coordinate frames. In this work, we tackle the problem of calibrating of dual manipulators with wrist-mounted cameras without using any calibration markers. Here, we assume that the poses of the cameras relative to the end-effectors, along with the relative poses of the manipulator bases are unknown and require estimation.

Refer to caption — Figure 1: We tackle a bi-manual setup, where the extrinsics of both cameras and the relative poses of the robot bases to one another are unknown. Bi-JCR solves to recover all three transformations.

Here, we propose a framework called Bi-manual Joint Representation and Calibration (Bi-JCR) that simultaneously builds a representation of the environment and calibrates both cameras for dual-manipulators with wrist-mounted cameras. Bi-JCR uses the same set of images for both calibrating the camera and constructing the environment representation. It completely avoids the need to calibrate offline with markers. Bi-JCR leverages modern 3D foundation models to efficiently estimate an unscaled representation along with unscaled camera poses, from a set of images captured by the dual manipulators. Then, by considering the forward kinematics of each arm, we formulate a joint scale recovery and dual calibration problem which can subsequently be solved via gradient descent on a manifold of transformation matrices. By optimizing across a single calibration problem defined using images from both arms, Bi-JCR simultaneously solves for each hand-eye transform, aligns the two robot base frames, recovers a missing scale factor, and directly yields the rigid transform between the two manipulators. This enables immediate fusion of visual data across both viewpoints for bimanual manipulation, without reliance on external markers or depth sensors.

We empirically evaluate Bi-JCR and demonstrate its ability to accurately calibrate cameras on both manipulators, and produce a dense and size-accurate representation of the environment that can be transformed into the workspace coordinate frame. We leverage the representation into downstream manipulation and to execute successful grasps and bi-manual hand-overs. Concretely, our contributions include:

•

The Bi-manual Joint Calibration and Representation (Bi-JCR) method that leverages 3D foundation models to build an environment representation built from wrist-mounted cameras, while calibrating the cameras;
•

The formulation of a novel optimization problem that recovers camera transformations on both manipulators, the relative pose between the manipulators, and a scale factor to obtain metric scale from representation;
•

Rigorous evaluation on real-world data, and evaluation of performance ablating over many of the state-of-the-art 3D foundation models integrated into Bi-JCR.

II Related Work

Hand–Eye Calibration: Decades-old closed-form solvers [2, 3, 4] remain the de facto standard because they are fast and provably correct under marker-based conditions without noise. Yet in modern labs, the very assumptions they rely on, static checkerboard, are routinely violated. Recent learning-based variants regress the transform directly from images [5], but demand gripper visibility or prior CAD models and often degrade sharply outside the synthetic domain in which they are trained. Other end-to-end policies learning-based methods bypass calibration entirely, mapping pixels to torques [6, 7], but at the cost of losing an explicit transform that downstream planners and safety monitors still persist. Foundation models for calibration are also explored in [8, 9], but have not been extended to the bi-manual setting.

Bi-manual Calibration: Extending single manipulator hand-eye calibration to the bi-manual setup is not trivial as it requires finding the pose of the secondary manipulator in the primary manipulator’s frame. Previously, some methods relied on the secondary manipulator holding a checker board to perform bi-manual calibration [10, 11]. A graph-based method that uses external markers to calibrate multiple manipulators simultaneously has also been explored in [12].

Scene Representation: Bimanual manipulation requires reasoning about shared workspaces where two end-effectors and several movable objects compete for space. Classical metric maps such as occupation grids [13] and signed distance fields [14, 15] give fast binary or distance queries for collision checking, yet they discretize space and struggle to capture fine contact geometry in small parts. Continuous alternatives such as Gaussian process maps [16], kernel regressors [17, 18], and neural implicit surfaces [19], offer subvoxel accuracy. Learning methods that directly consume point clouds [20, 21] or integrate them into trajectory optimization [22]. In the computer vision community, proposed photorealistic NeRF models [23, 24, 25] produce photorealistic models, at the expense of accurate geometry.

Foundation Models: Large-scale models trained in web corpora, LLMs in NLP [26] and multimodal encoders in vision [27] have recently been explored for 3D perception [28, 29, 30, 31]. In robotics, these large pre-trained deep learning models are referred to as “foundation models”, gaining increasing applications when used as plug-and-play modules for downstream tasks [32, 33]. In our work, we leverage these 3D foundation models as components in our pipeline.

III Bi-manual Joint Representation and Calibration

The proposed Bi-manual Joint Representation and Calibration (Bi-JCR) framework aims to solve the eye-to-hand calibration problem for both manipulators, and in the process, also recover the relative poses of the robot base. At the same time, we can recover a dense 3D representation of the tabletop scene, which can facilitate downstream manipulation. This process is done without relying on any camera pose information from external markers, such as checkerboards or AprilTags [1].

III-A Problem Setup:

We consider two manipulators, with one designated as the primary manipulator and the other as the secondary, each equipped with an end-effector mounted low-cost RGB camera. A set of objects is arranged on a tabletop within their shared workspace. The rigid transformations from each camera to its corresponding end–effector are unknown and must be estimated. To collect data, we command each manipulator through a sequence of $N$ distinct end–effector poses, capturing an RGB image at each pose. We denote the primary manipulator’s poses by $\{E_{1,1},E_{1,2},\cdots,E_{1,N}\}$ with $N$ corresponding RGB images $\{I_{1,1},I_{1,2},\cdots,I_{1,N}\}$ . We denote the $N$ poses for the secondary manipulator as $\{E_{2,1},E_{2,2},\cdots,E_{2,N}\}$ , with corresponding RGB images $\{I_{2,1},I_{2,2},\cdots,I_{2,N}\}$ .

Using only these end–effector poses and captured images, Bi‐JCR will recover all of the following: The rigid transformation $T^{E_{1}}_{C_{1}}$ from Camera 1 to the primary end–effector; The rigid transformation $T^{E_{2}}_{C_{2}}$ from Camera 2 to the secondary end–effector; The scale factor $\lambda$ aligning the foundation model’s output frame with the real‐world metric frame; The pose of the secondary base frame $b_{2}$ relative to the primary base frame $b_{1}$ , denoted $P^{b_{1}}_{b_{2}}$ ; The transformation $T^{b_{1}}_{w}$ from the foundation model’s output frame $w$ to the primary base frame $b_{1}$ ; A metric‐scale 3D reconstruction of the scene in the primary base frame $b_{1}$ .

The transformations between the dual manipulators, along with their attached cameras, are illustrated in Fig. 2. Here, we observe that only the forward kinematics of the manipulators, i.e. the transformation from the bases of the manipulators to their end-effector, is known. The relative position of the two manipulators are also initially unknown. Bi-JCR solves for all of the unknown transformations.

III-B 3D Foundation Models in the Pipeline

The two image sets can be fed into a 3D foundation model to obtain a reconstruction of the scene in an arbitrary coordinate frame and scale. We recover both of the relative camera poses

\{P_{1,1},\dots,P_{1,N},\,P_{2,1},\dots,P_{2,N}\}

(1)

along with corresponding point sets containing the reconstruction,

\{\hat{X}_{1,1},\dots,\hat{X}_{1,N},\,\hat{X}_{2,1},\dots,\hat{X}_{2,N}\},

(2)

where each point in a set corresponds to a pixel in the associated input image, and confidence values for each pixel can also be recovered. Pre-trained 3D foundation models are often used to extract structure from images of indoor scenes and building structures, and their application for table-top scenes has been under-explored. Example outputs from the model, DUSt3R [28], is given in Fig. 4. Because the foundation model recovers geometry only up to an unknown scale, both the estimated camera poses and the aggregated point cloud are not expressed in real‐world metric units. To resolve this scale ambiguity, we introduce a scale factor $\lambda$ so that the pose of camera $i$ on manipulator $m$ in the real‐world (or base) frame $w$ becomes

\displaystyle P^{w}_{m,i}(\lambda)

\displaystyle=\begin{bmatrix}R_{m,i}&\lambda\,t_{m,i}\\ 0&1\end{bmatrix}\;\in\;\mathrm{SE}(3),

(3)

where we have $i=1,\dots,N$ and the index, $m\in\{1,2\}$ . In this expression, $R_{m,i}\in\mathrm{SO}(3)$ denotes the rotation and $t_{m,i}\in\mathbb{R}^{3}$ is the translation estimated by the foundation model. Next, defining the transform from the scaled foundation frame $w$ to each manipulator’s base frame $b_{m}$ as $T^{b_{m}}_{w}$ , the camera poses in the real‐world base frames are

\displaystyle P^{b_{m}}_{m,i}(\lambda)

\displaystyle=T^{b_{m}}_{w}\,P^{w}_{m,i}(\lambda),\quad i=1,\dots,N.

(4)

III-C Solving for Initial Calibration Solution

In Bi-JCR, we seek to simultaneously solve for $\lambda$ , $T^{b_{1}}_{w}$ , and $T^{b_{2}}_{w}$ in the process of solving bi-manual hand-eye calibration for the two camera frame to base frame transformations $T^{E_{1}}_{C_{1}}$ and $T^{E_{2}}_{C_{2}}$ . During the sequence of manipulator motions, the transformation between scaled camera poses and end effector poses for manipulator $m\in\{1,2\}$ can be formulated as the classical hand-eye calibration equations [2]:

\displaystyle{E_{m,i}}^{-1}E_{m,i+1}T^{E_{m}}_{C_{m}}=T^{E_{m}}_{C_{m}}{P^{b_{% m}}_{m,i}}^{-1}(\lambda)P^{b_{m}}_{m,i+1}(\lambda)

(5)

Now, we denote the transformation between poses by,

	$\displaystyle T^{E_{m,i+1}}_{E_{m,i}}$	$\displaystyle={E_{m,i}}^{-1}E_{m,i+1},$		(6)
	$\displaystyle T^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}(\lambda)$	$\displaystyle={P^{w}_{m,i}}^{-1}(\lambda)P^{w}_{m,i+1}(\lambda).$		(7)

Then, the hand-eye equations can be formulated as

\displaystyle T^{E_{m,i+1}}_{E_{m,i}}T^{E_{m}}_{C_{m}}=T^{E_{m}}_{C_{m}}T^{P^{% w}_{m,i+1}}_{P^{w}_{m,i}}(\lambda).

(8)

Here we observe that Eq. 8 admits the $AX=XB$ form of the classical hand-eye calibration problem, with the right-hand side dependent on the scale factor $\lambda$ .

The first phase of Bi-JCR aims to obtain an initial solution for the scale and desired transformation, which we denote as $\lambda^{\prime}$ , ${T^{E_{1}}_{C_{1}}}^{\prime}$ and ${T^{E_{2}}_{C_{2}}}^{\prime}$ . Since the rotation component here are in $\mathbf{SO}(3)$ , we can first solve for the rotation components of the transformations, which are invariant to the scale factor. This can be achieved via linear algebra on the manifold of rotation matrices by following [4]. We convert rotation components of $T^{E_{m,i+1}}_{E_{m,i}}$ and $T^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}$ into the log map of $\mathbf{SO}(3)$ to its lie algebra ( $\mathfrak{so}(3)$ ) where for some $R\in\mathbf{SO}(3)$ . This gives,

	$\displaystyle\omega=$	$\displaystyle\arccos(\frac{\mathrm{Tr}(R)-1}{2}),$		(9)
	$\displaystyle\mathrm{LogMap}(R):=$	$\displaystyle\frac{\omega}{2\sin(\omega)}\begin{bmatrix}R_{3,2}-R_{2,3}\\ R_{1,3}-R_{3,1}\\ R_{2,1}-R_{1,2}\end{bmatrix}\in\mathfrak{so}(3).$		(10)

where, $\mathrm{Tr}(\cdot)$ indicates the trace operator and the subscripts indicate the elements’ indices in $R$ . Then we can find best fit rotational components via:

	$\displaystyle{R_{C_{m}}^{E_{m}}}^{\prime}$	$\displaystyle=(M_{m}^{\top}M_{m})^{-\frac{1}{2}}M_{m}^{\top},$		(11)
	$\displaystyle\text{where }M_{m}$	$\displaystyle=\sum_{i=1}^{N-1}\mathrm{LogMap}(R_{E_{m,i}}^{E_{m,i+1}})\otimes% \mathrm{LogMap}(R^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}),$

The $\otimes$ is the outer product, and the matrix inverse square root can be computed efficiently via singular value decomposition.

Next, we solve the translation components along with the scale factor jointly, by minimizing the residuals of the similar scale recovery problem formulated in [8] for each arm, assuming that the scale factor is consistent for the results of each arm:

$\displaystyle\text{SRP:}\quad\arg\min_{{t_{C_{m}}^{E_{m}}}^{\prime},\lambda^{% \prime}}$	$\displaystyle\sum^{N-1}_{i=1}\lvert\lvert{Q_{i}}{t_{C_{m}}^{E_{m}}}^{\prime}-% \mathbf{d}_{i}(\lambda^{\prime})\lvert\lvert_{2}^{2},$	(12)
$\displaystyle\text{where }{Q_{i}}$	$\displaystyle=I-R_{E_{m,i}}^{E_{m,i+1}},$	(13)
$\displaystyle\text{and }\mathbf{d}_{i}(\lambda^{\prime})$	$\displaystyle=t_{E_{m,i}}^{E_{m,i+1}}-{R_{C_{m}}^{E_{m}}}^{\prime}(\lambda)t^{% P^{w}_{m,i+1}}_{P^{w}_{m,i}}.$	(14)

Equation 12 can be solved via least-squares, and we obtain our solutions $\lambda^{\prime}$ , ${T^{E_{1}}_{C_{1}}}^{\prime}$ and ${T^{E_{2}}_{C_{2}}}^{\prime}$ , which can be further refined via gradient-based optimisation.

III-D Refine Calibration through Gradient-based Optimization

We further refine the solutions via gradient descent to improve estimation. Here, we first rearrange Equation 8 into,

	$\displaystyle T^{E_{m,i+1}}_{E_{m,i}}T^{E_{m}}_{C_{m}}-T^{E_{m}}_{C_{m}}T^{P^{% w}_{m,i+1}}_{P^{w}_{m,i}}=0,$		(15)
	$\displaystyle\text{ for }i\in\{1,\cdots,N-1\},m\in\{1,2\}.$

then we can solve the calibration problem by minimizing the difference between transformation matrices $T^{E_{m,i+1}}_{E_{m,i}}T^{E_{m}}_{C_{m}}$ and $T^{E_{m}}_{C_{m}}T^{P^{w}_{m,i+1}}_{P^{w}_{m,i}}$ . Specifically, we define a cost function as,

$\displaystyle\ell$	$\displaystyle(\lambda,T^{E_{1}}_{C_{1}},T^{E_{2}}_{C_{2}})=$	(16)
	$\displaystyle\sum_{m\in\{1,2\}}\bigl{(}\frac{1}{N-1}\sum_{i=1}^{N-1}\alpha D_{% R}(T^{E_{m}}_{C_{m}})\!+\!(1\!-\!\alpha)D_{t}(\lambda,T^{E_{m}}_{C_{m}})\bigr{% )},$
	$\displaystyle\text{where }D_{R}(T^{E_{m}}_{C_{m}})\!=\!\arccos\bigl{(}% \operatorname{tr}({RE_{m,i}^{m,i+1}}^{\top}\!RC_{m,i}^{m,i+1})\!-1\!\bigr{)},$
	$\displaystyle\text{and }D_{t}(\lambda,T^{E_{m}}_{C_{m}})\!=\!\bigl{\lVert}te_{% m,i}^{m,i+1}\!-\!tc_{m,i}^{m,i+1}\bigr{\rVert}_{2},$

where $\operatorname{tr}(\cdot)$ is the trace operator. We can then minimize the cost function via gradient descent [35] while constraining the rotation to be on the $\mathbf{SO}(3)$ manifold. Here, we use the results from the previous section as the initial solution. Furthermore, to ensure that the rotational components in $\mathbf{SO}(3)$ during the backpropagation process, we follow [36] and first pull each rotation into the Lie algebra with the logarithm map, then perform the gradient update on the resulting axis–angle vector in $\mathbb{R}^{3}$ , and push it back onto the manifold via the exponential map. With the solutions that minimize the cost function, we can then obtain the world-to-base transformations $T^{b_{1}}_{w}$ , and $T^{b_{2}}_{w}$ via

\displaystyle T^{b_{m}}_{w}=\operatorname*{AVG_{SE3}}\limits_{i\in\{1,\cdots,N% -1\}}\bigl{(}T^{E_{m,i+1}}_{E_{m,i}}{T^{E_{m}}_{C_{m}}T^{P^{w}_{m,i+1}}_{P^{w}% _{m,i}}}^{-1}\bigr{)}

(17)

where $\operatorname*{AVG_{SE3}}$ is the average over a set of transformation matrices on $\mathbf{SE}(3)$ , by considering the average of rotation and translation separately.

III-E Obtaining Metric-Scale 3D Representation

Here, we seek to build a real-world metric scale 3D representation of the environment under the primary manipulator’s frame. Following Equation 3, we first scale camera poses by $\lambda$ to get $\{P^{w}_{1,1},\cdots,P^{w}_{1,N}\}$ and $\{P^{w}_{2,1},\cdots,P^{w}_{2,N}\}$ , we can then get the calibrated metric scale camera poses by

\displaystyle P^{b_{1}}_{m,i}=T^{b_{1}}_{w}P^{w}_{m,i},\text{ for }m\in\{1,2\}% ,i\in\{1,\cdots,N\}.

(18)

Next, we also want to use the point sets from each arm, associated with each input image, $\{\hat{X}_{1,1},\cdots,\hat{X}_{1,N}\}$ and $\{\hat{X}_{2,1},\cdots,\hat{X}_{2,N}\}$ and their associated confidence maps $\{C_{1,1},\cdots,C_{1,N}\}$ and $\{C_{2,1},\cdots,C_{2,N}\}$ from the output of the foundation model to recover a rich and high-quality representation of the environment. We first use a confidence threshold to filter out points below this threshold in each $\hat{X}_{m,i}$ . Then, we transform the points from the filtered point sets, $\{\mathbf{x}_{i}\}_{i=1}^{N_{pc}}$ , to get a point cloud in real-world metric scale and primary manipulator’s base frame, $\{\mathbf{x}^{b_{1}}_{i}\}_{i=1}^{N_{pc}}$ through

\displaystyle\{\mathbf{x}^{b_{1}}_{i}=T^{b_{1}}_{w}(\lambda\mathbf{x}_{i}),% \text{ for }i\in\{1,\cdots,N_{pc}\}\}.

(19)

Furthermore, the pose of the secondary manipulator’s base in the primary manipulator’s base frame can be computed as

\displaystyle P^{b_{1}}_{b_{2}}=T^{b_{1}}_{w}{T^{b_{2}}_{w}}^{-1}.

(20)

The pose of the secondary manipulator’s base in the primary manipulator’s base frame enables us to compute end-effector poses for downstream tasks of both manipulators in a single unified frame. This facilitates downstream processes, such as object segmentation along with grasping generation, to operate.

Scene A Scene B Scene C Images Per Manipulator 4 images 7 images 9 images 4 images 7 images 9 images 4 images 7 images 9 images Converged $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ Bi-JCR (Ours) Residual $\delta_{R}$ 0.0769 0.0724 0.0668 0.0740 0.0617 0.0569 0.0743 0.0634 0.0612 Residual $\delta_{\mathbf{t}}$ 0.0461 0.0391 0.0378 0.0587 0.0351 0.0340 0.0424 0.0341 0.0373 No. of Poses (Left) 4 7 9 4 7 9 4 7 9 No. of Poses (Right) 4 7 9 4 7 9 4 7 9 Converged × × × $\bm{\checkmark}$ $\bm{\checkmark}$ $\bm{\checkmark}$ × $\bm{\checkmark}$ $\bm{\checkmark}$ COLMAP [37] Residual $\delta_{R}$ NA NA NA 0.0865 0.0583 0.0584 NA 0.0591 0.0586 + Calibration Residual $\delta_{\mathbf{t}}$ NA NA NA 0.0317 0.0269 0.0274 NA 0.0277 0.0271 No. of Poses (Left) 0 4 2 4 7 9 0 7 9 No. of Poses (Right) 4 0 0 4 7 9 4 7 9 Ray Diffusion [38] Residual $\delta_{R}$ 0.4845 0.4288 0.7951 0.4948 0.3839 0.3634 0.4504 0.2610 0.2223 + Calibration Residual $\delta_{\mathbf{t}}$ 0.2118 0.1416 0.1309 0.2067 0.2051 0.1637 0.2125 0.1858 0.1760 No. of Poses (Left) 4 7 9 4 7 9 4 7 9 No. of Poses (Right) 4 7 9 4 7 9 4 7 9

TABLE I: Quantitative evaluation on Bi-JCR’s calibration residual error (

\delta_{R}

and

\delta_{t}

) against baseline methods. Lower residual indicates more accurate calibrations.

IV Empirical Evaluation

In this section, we rigorously evaluate our proposed Bi-Manual Joint Calibration and Representation (Bi-JCR) method. Our bi-manual setup consists of two AgileX Piper 6 degree-of-freedom manipulators, each with a low-cost USB webcam mounted on the gripper. We seek to answer the following questions:

1.

Can Bi-JCR produce correct hand-eye calibration for both arms, even when the number of images provided is low?
2.

Can Bi-JCR recover the scale accurately such that our representation’s sizes match the physical world?
3.

Can high-quality 3D environment representations, in the correct coordinate frame, be built?
4.

How do different 3D foundation models change the calibration accuracy?
5.

Does refinement via additional gradient descent improve calibration accuracy?
6.

Does Bi-JCR facilitate downstream bi-manual manipulation tasks?

IV-A Eyes-to-Hands Calibration with Bi-JCR

Baselines: COLMAP-based pose estimation and Ray Diffusion. To assess the calibration quality of Bi-JCR, we compare against two alternatives. First, in the absence of high-contrast markers such as checkerboards or AprilTags [1], we use SfM via COLMAP [37] and apply the Park–Martin algorithm [4] to compute eye-to-hand transformations for each manipulator. Second, we evaluate Ray Diffusion [38], a sparse-view diffusion model trained on large datasets [39] that directly regresses camera poses.

Task and Metrics: We take images in three different environments: a scene on a darker light condition with 9 items (scene A), two others of which are under brighter light conditions with different sets of objects of 9 and 10 items respectively (scene B, scene C). Bi-JCR and the two baseline methods are evaluated with an increasing number of input images (4, 7 and 9 images per manipulator), then check whether the calibration has converged correctly by considering residual losses via Equation 15 with ground truth, obtained via Apriltags [1], on the right-hand side. Lower residual values indicate a higher degree of consistency.

Results: We tabulate our results in Table I. We observe that COLMAP often results in diverged calibration, as images where correspondence cannot be found are discarded. Whether the calibration has converged and the number of images used, from each manipulator, are also shown in Table I. Although Ray Diffusion registered all images, it registered them in an inconsistent way under this tabletop setup of cluttered, partially visible objects, causing the calibration optimizer to accumulate large errors. Our Bi-JCR method consistently produce smaller residual under both lower and higher number of views for both manipulators, showing remarkable image efficiency. We also observe a residual loss reduce trend as the number of images gradually increase, in comparison to the residual loss fluctuation in the other two baseline methods, which shows Bi-JCR’s reliable precision gain with increasing number of views.

Object Spoon Tea Lid Tape Battery Toolbox Joystick 5 Images 4.90% 4.3% 2.73% 1.59% 10.90% 7.73% 8 Images 1.61% 0.26% 1.34% 1.10% 2.53% 2.98%

TABLE II: Percentage of error of object dimensions to compare the size of real-world objects against reconstructed objects. Within 8 images per manipulator, the percentage errors of reconstructed sizes are at most 2.98%, indicating accurate scale recovery.

Here, we also visualize the aligned camera and end-effector poses after calibration via Bi-JCR in Figure 6. The end-effectors and outlined as U-shapes and cameras are represented by cones. Both end-effector poses and camera poses are transformed to the primary manipulator’s base frame using the base to base transformation estimated by Bi-JCR. Primary manipulator end-effector and eye-to-hand transformed camera are colored in red, and the secondary manipulator’s are colored in blue. As shown in Figure 6, Bi-JCR successfully recovers eye-to-hand transformations that consistently align camera and end-effector poses for both primary and secondary manipulator across all scenes.

IV-B Accurate Metric Scale Recovery with Bi-JCR

Bi-JCR also reconstructs a 3D dense point cloud on a metric scale of the real-world environment. Here, we evaluate the accuracy of scale recovery by comparing the difference between the side length of the real-world object and the side length of the reconstructed objects with 5 and 8 images collected from each manipulator, as shown in Figure 5. The error, computed by

\displaystyle\operatorname{err}_{obj}=\frac{|s_{\text{reconstructed}}-s_{\text% {real world}}|}{s_{\text{real world}}},

(21)

is reported in Table II. With only 8 images per manipulator, Bi-JCR is able to reduce the error to a median of 1.48% and at most 2.98%, which marks precise scale recovery giving real-world metric scale.

Darker Light Condition (9 items) Brighter Light Condition (8 items) Brighter Light Condition (7 items) Images per Manipulator 4 7 9 4 7 9 4 7 9 DUSt3R[28] + Bi-JCR Residual $\delta_{R}$ 0.0769 0.0724 0.0668 0.0740 0.0617 0.0569 0.0743 0.0634 0.0612 Residual $\delta_{\mathbf{t}}$ 0.0461 0.0391 0.0378 0.0587 0.0351 0.0340 0.0424 0.0341 0.0373 MASt3R[30] + Bi-JCR Residual $\delta_{R}$ 0.2613 0.2404 0.2020 0.2636 0.1541 0.1302 0.2350 0.1482 0.1312 Residual $\delta_{\mathbf{t}}$ 0.1639 0.1378 0.1219 0.1538 0.1012 0.0811 0.1514 0.1009 0.0896 VGGT[31] + Bi-JCR Residual $\delta_{R}$ 0.3763 0.3728 0.5358 0.3793 0.3561 0.5186 0.5274 0.3492 0.3524 Residual $\delta_{\mathbf{t}}$ 0.1241 0.1255 0.1438 0.1290 0.1676 0.1818 0.2448 0.1692 0.1649

TABLE III: Quantitative result on selecting the best foundation model for Bi-JCR. The foundation model used by Bi-JCR, DUSt3R [28] is compared against MASt3R [30] and VGGT [31] in term of rotational and translational residual loss from Equation 15. Under all three scenarios, DUSt3R outperforms both MASt3R and VGGT in all number of views per manipulator.

IV-C 3D Representation in Primary Manipulator’s Base Frame

Besides scale, the quality of the 3D representation built by Bi-JCR is critical to downstream tasks. Here, we qualitatively assess the reconstructed 3D point cloud by visualizing it against images taken on the real world environment in Figure 7. We observe that Bi-JCR reconstructs the relative position and orientation of objects in the environment correctly, and the shape of each object is highly preserved. We further investigated whether representations can be accurately transformed into the robot’s coordinate frame and the placement of the secondary manipulator base in the primary manipulator base frame. We inject the reconstruction, along with manipulator poses, into the PyBullet Simulator [34]. As shown in Figure 8, the relative pose of the two manipulators highly resembles the relative pose of the two manipulators in the real world, indicating the correct estimation of the pose of the secondary manipulator in the base frame of the primary manipulator. The table 3D reconstruction in simulation remains parallel to the bases of manipulators, and object orientations and positions are visually correct relative to the bases of manipulators, indicating both a high-quality 3D reconstruction produced along with its accurate transformation into the primary manipulator’s frame.

IV-D Ablation Study on Different Foundation Models

With recent development of 3D reconstruction using structure from motion (SfM), there have been many new foundation models that outperform the DUSt3R foundation model [28] chosen by us in various task benchmarks, such as MASt3R [30] and VGGT [31]. Therefore, we evaluate the performance of Bi-JCR with different foundation models in 4, 7, 9 numbers of views per manipulator, and we choose the complete mode to find correspondences between all pairs of views for MASt3R, and we report the residual loss in Table III. Unlike VGGT, both MASt3R and DUSt3R receive a lower residual loss compared to VGGT in a higher number of views per manipulator, which is likely because MASt3R and DUSt3R utilize the ICP process to retain consistency in camera pose estimation. Furthermore, DUSt3R produces camera poses with better overall residual calibration loss, so DUSt3R remains the primary choice for Bi-JCR.

IV-E Ablation Study on Bi-JCR’s Gradient Descent

Bi-JCR’s leverages gradient descent to further refine initial solutions. We experimentally evaluate the benefit of the gradient descent refinement by comparing the residual losses of Bi-JCR, with and without refinement via Gradient Descent [40]. As shown in Table IV, the gradient descent refinement shows a marked improvement in rotational loss with fewer views. When the number of views per manipulator is doubled from 4 to 8, Bi-JCR with gradient descent refinement still outperforms Bi-JCR without refinement. We observe that gradient descent refinement plays a critical component in Bi-JCR under a sparse view setup, but is less impactful when the number of images per manipulator increases.

Darker Condition Lighter Condition Images per Manipulator 4 8 4 8 Bi-JCR w/ GD Residual $\delta_{R}$ 0.0785 0.0698 0.0743 0.0619 Residual $\delta_{\mathbf{t}}$ 0.0478 0.0390 0.0424 0.0372 Bi-JCR w/o GD Residual $\delta_{R}$ 0.0920 0.0704 0.0927 0.0628 Residual $\delta_{\mathbf{t}}$ 0.0477 0.0391 0.0427 0.0374

TABLE IV: Quantitative result on the effect of Gradient Descent (GD) [35] in Bi-JCR, observe that under sparse images condition, GD is able to significantly improve rotational residual error, and Bi-JCR with GD also slightly outperform Bi-JCR without GD in large number of images setup.

IV-F Bi-JCR Enables Bi-manual Downstream Tasks

To assess both the relative‐pose estimation between the two manipulators’ bases and the fidelity of the reconstructed 3D scene, we conduct two real‐world bimanual tasks: (1) joint grasping of large objects and (2) passing of small objects. We begin by running Bi‐JCR to recover the base‐to‐base transformation and reconstruct the 3D environment in simulation (Fig. 9(a)). Next, we apply the 3D point‐cloud segmentation algorithm from [41] to isolate each object on the tabletop (Fig. 9(b)).

Joint grasping: We first select the cluster corresponding to the large objects: a toolbox, a controller, a battery pack and a box, as illustrated Fig. 10. A grasp pose for the primary manipulator is computed in its own base frame using [42], and likewise for the secondary manipulator. The secondary grasp must then be transformed into its base frame. Finally, both end‐effector poses are executed via inverse kinematics and joint control to perform the coordinated grasp.

Bimanual passing: We then focus on the small objects, a wrench, a spoon, a balance meter and a tape, shown in Fig. 11. After choosing a target transfer location in the primary manipulator’s base frame, we translate the segmented point cloud to that location and generate precise poses to enable object passing for both arms. Again, the secondary end‐effector pose is reprojected into its base frame. The primary manipulator loads the object onto its gripper and then successfully hands it off to the secondary manipulator at the specified location (Fig. 11). The successful passing of the objects highlights that the transformation between the local coordinate frames of each robot is accurately recovered. These experiments demonstrate that, Bi‐JCR reliably enables complex bimanual operations by accurately calibrating cameras on both manipulators and constructing a high‐quality dense 3D reconstruction.

IV-G Summary of Empirical Results

From our empirical experiments, we demonstrate that:

1.

Bi-JCR’s calibration produces highly accurate eye-to-hand transformations.
2.

The recovered scale factor successfully converts the foundation model’s output into real-world metric units.
3.

Bi-JCR generates a high-quality dense 3D point cloud with correct object geometry and can be correctly transformed into the robot’s frame.
4.

Among 3D foundation models, DUSt3R [28] consistently delivers the best performance for Bi-JCR.
5.

Gradient-descent refinement is effective under sparse-view conditions, although its marginal benefit diminishes as view density increases.
6.

The accurate transformations and dense 3D representations produced by Bi-JCR enable various downstream bimanual tasks, including coordinated grasping and object transfer between manipulators.

V Conclusion and Future work

We introduced Bi-JCR, a unified framework for joint calibration and 3D representation in bimanual robotic systems with wrist-mounted cameras. Leveraging large 3D foundation models, Bi-JCR removes calibration markers and simultaneously estimates camera extrinsics, inter-arm relative poses, and a shared, metric-consistent scene representation. Our approach unifies the calibration and perception processes using only RGB images, and facilitates downstream bi-manual tasks such as grasping and object handover. Extensive real-world evaluations demonstrate Bi-JCR’s performance over diverse environments. Future work will leverage confidence masks from 3D foundation models to actively guide novel image collection, continuously complete the reconstructed scene, refine calibration, and then generate motion [43, 44].

References

[1] E. Olson, “Apriltag: A robust and flexible visual fiducial system,” in IEEE International Conference on Robotics and Automation, 2011.
[2] R. Y. Tsai and R. K. Lenz, “A new technique for fully autonomous and efficient 3d robotics hand/eye calibration,” IEEE Trans. Robotics Autom., 1988.
[3] R. Horaud and F. Dornaika, “Hand-eye calibration,” I. J. Robotic Res., 1995.
[4] F. Park and B. Martin, “Robot Sensor Calibration: Solving AX = XB on the Euclidean Group,” IEEE Transactions on Robotics and Automation, 1994.
[5] E. Valassakis, K. Dreczkowski, and E. Johns, “Learning eye-in-hand camera calibration from a single image,” in Proceedings of the 5th Conference on Robot Learning, 2022.
[6] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” in Advances in Neural Information Processing Systems, 2021.
[7] P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” in Conference on Robot Learning (CoRL), 2021.
[8] W. Zhi, H. Tang, T. Zhang, and M. Johnson-Roberson, “Unifying representation and calibration with 3d foundation models,” IEEE Robotics and Automation Letters, 2024.
[9] W. Zhi, H. Tang, T. Zhang, and M. Johnson-Roberson, “3d foundation models enable simultaneous geometry and pose estimation of grasped objects,” arXiv preprint arXiv:2407.10331, 2024.
[10] L. Wu, J. Wang, L. Qi, K. Wu, H. Ren, and M. Q.-H. Meng, “Simultaneous hand–eye, tool–flange, and robot–robot calibration for comanipulation by solving the axb=ycz problem,” IEEE TRansactions on robotics, vol. 32, no. 2, pp. 413–428, 2016.
[11] Z. Fu, J. Pan, E. Spyrakos-Papastavridis, X. Chen, and M. Li, “A dual quaternion-based approach for coordinate calibration of dual robots in collaborative motion,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4086–4093, 2020.
[12] Z. Zhou, L. Ma, X. Liu, Z. Cao, and J. Yu, “Simultaneously calibration of multi hand–eye robot system based on graph,” IEEE Transactions on Industrial Electronics, vol. 71, no. 5, pp. 5010–5020, 2023.
[13] A. Elfes, “Sonar-based real-world mapping and navigation,” IEEE Journal on Robotics and Automation, 1987.
[14] R. Malladi, J. A. Sethian, and B. C. Vemuri, “Shape modeling with front propagation: a level set approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995.
[15] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, 2011.
[16] S. O’Callaghan, F. T. Ramos, and H. Durrant-Whyte, “Contextual occupancy maps using gaussian processes,” in 2009 IEEE International Conference on Robotics and Automation, 2009.
[17] W. Zhi, L. Ott, R. Senanayake, and F. Ramos, “Continuous occupancy map fusion with fast bayesian hilbert maps,” in International Conference on Robotics and Automation (ICRA), 2019.
[18] W. Zhi, R. Senanayake, L. Ott, and F. Ramos, “Spatiotemporal learning of directional uncertainty in urban environments with kernel recurrent mixture density networks,” IEEE Robotics and Automation Letters, 2019.
[19] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[20] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: deep hierarchical feature learning on point sets in a metric space,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
[22] W. Zhi, I. Akinola, K. van Wyk, N. Ratliff, and F. Ramos, “Global and reactive motion generation with geometric fabric command sequences,” in IEEE International Conference on Robotics and Automation, ICRA, 2023.
[23] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020.
[24] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., 2022.
[25] T. Zhang, K. Huang, W. Zhi, and M. Johnson-Roberson, “Darkgs: Learning neural illumination and 3d gaussians relighting for robotic exploration in the dark,” 2024.
[26] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, 2023.
[27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” CoRR, 2021.
[28] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” in CVPR, 2024.
[29] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[30] V. Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” 2024.
[31] J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025.
[32] R. Bommasani and et al., “On the opportunities and risks of foundation models,” CoRR, 2021.
[33] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,” CoRR, 2023.
[34] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http://pybullet.org, 2016–2019.
[35] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning, vol. 4. Springer, 2006.
[36] R. Brégier, “Deep regression on manifolds: a 3D rotation case study,” 2021.
[37] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[38] J. Y. Zhang, A. Lin, M. Kumar, T.-H. Yang, D. Ramanan, and S. Tulsiani, “Cameras as rays: Pose estimation via ray diffusion,” in International Conference on Learning Representations (ICLR), 2024.
[39] J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny, “Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,” in International Conference on Computer Vision, 2021.
[40] C. M. Bishop, Pattern recognition and machine learning, 5th Edition. Information science and statistics, Springer, 2007.
[41] H. Tang, T. Zhang, O. Kroemer, M. Johnson-Roberson, and W. Zhi, “Graphseg: Segmented 3d representations via graph edge addition and contraction,” arXiv preprint arXiv:2504.03129, 2025.
[42] A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” The International Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.
[43] W. Zhi, T. Lai, L. Ott, and F. Ramos, “Diffeomorphic transforms for generalised imitation learning,” in Learning for Dynamics and Control Conference, L4DC, 2022.
[44] W. Zhi, T. Zhang, and M. Johnson-Roberson, “Instructing robots by sketching: Learning from demonstration via probabilistic diagrammatic teaching,” in IEEE International Conference on Robotics and Automation, 2024.