Abstract
        
          Robotic manipulation requires accurate perception of the
          environment, which poses a significant challenge
          due to its inherent complexity and constantly changing nature.
          In this context, RGB image and point-cloud observations
          are two commonly used modalities in visual-based robotic
          manipulation, but each of these modalities have their own
          limitations. Commercial point-cloud observations often suffer
          from issues like sparse sampling and noisy output due to
          the limits of the emission-reception imaging principle. On the
          other hand, RGB images, while rich in texture information,
          lack essential depth and 3D information crucial for robotic
          manipulation. To mitigate these challenges, we propose an
          image-only robotic manipulation framework that leverages
          an eye-on-hand monocular camera installed on the robot’s
          parallel gripper. By moving with the robot gripper, this camera
          gains the ability to actively perceive object from multiple
          perspectives during the manipulation process. This enables
          the estimation of 6D object poses, which can be utilized for
          manipulation. While, obtaining images from more and diverse
          viewpoints typically improves pose estimation, it also increases
          the manipulation time. To address this trade-off, we employ a
          reinforcement learning policy to synchronize the manipulation
          strategy with active perception, achieving a balance between 6D
          pose accuracy and manipulation efficiency. Our experimental
          results in both simulated and real-world environments showcase
          the state-of-the-art effectiveness of our approach. We believe
          that our method will inspire further research on real-world-oriented
          robotic manipulation.