Abstract
Robotic manipulation requires accurate perception of the
environment, which poses a significant challenge
due to its inherent complexity and constantly changing nature.
In this context, RGB image and point-cloud observations
are two commonly used modalities in visual-based robotic
manipulation, but each of these modalities have their own
limitations. Commercial point-cloud observations often suffer
from issues like sparse sampling and noisy output due to
the limits of the emission-reception imaging principle. On the
other hand, RGB images, while rich in texture information,
lack essential depth and 3D information crucial for robotic
manipulation. To mitigate these challenges, we propose an
image-only robotic manipulation framework that leverages
an eye-on-hand monocular camera installed on the robot’s
parallel gripper. By moving with the robot gripper, this camera
gains the ability to actively perceive object from multiple
perspectives during the manipulation process. This enables
the estimation of 6D object poses, which can be utilized for
manipulation. While, obtaining images from more and diverse
viewpoints typically improves pose estimation, it also increases
the manipulation time. To address this trade-off, we employ a
reinforcement learning policy to synchronize the manipulation
strategy with active perception, achieving a balance between 6D
pose accuracy and manipulation efficiency. Our experimental
results in both simulated and real-world environments showcase
the state-of-the-art effectiveness of our approach. We believe
that our method will inspire further research on real-world-oriented
robotic manipulation.