Web Data to Real-World Action: Enabling Robots to Master Unseen Tasks

Posted by:

|

On:

|

To bring the vision of robot manipulators assisting with everyday activities in cluttered environments like living rooms, offices, and kitchens closer to reality, it’s essential to create robot policies that can generalize to new tasks in unfamiliar settings.

In a new paper Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, a research team from Google DeepMind, Carnegie Mellon University and Stanford University presents a novel language-conditioned robot manipulation framework called Gen2Act. This system achieves generalization to unseen tasks using publicly available web data, eliminating the need to collect specific robot data for every task.

The core idea behind Gen2Act is leveraging zero-shot video prediction from web data to predict movements in a highly generalized way. By tapping into the advances made in video generation models, the researchers design a robot policy that is conditioned on these generated videos, enabling the robot to perform tasks it has never encountered in its own dataset.

The approach taken by Gen2Act involves two key steps: generating human video representations conditioned on language input, followed by translating these videos into robot actions through a closed-loop policy. Instead of directly generating robot videos, the team opted for human videos, as web-trained models can generate human movement videos for novel scenarios without additional training. They then developed a translation model that aligns these human-generated videos with corresponding robot demonstrations. This translation model, built using an existing model, takes as input the first frame of each task’s trajectory and a language description. The closed-loop policy guiding the robot’s actions is based not only on the generated human video but also on the history of the robot’s visual observations, allowing it to adapt its behavior dynamically to the scene.

To enhance the robot’s ability to understand and replicate motion, the team introduced a method for extracting point tracks from both the generated human videos and the robot’s observational footage. They applied a track prediction auxiliary loss during training to ensure that the policy’s latent variables captured motion cues from the environment. This loss is combined with the typical behavior cloning loss, enabling the robot to predict actions more accurately while grounding its understanding of object motion in the scene.

The Gen2Act system demonstrated impressive results across a wide range of real-world tasks. Its ability to infer motion from web videos and utilize point track prediction allowed it to solve novel manipulation tasks that had not been encountered during training. When tested on new objects and motion types outside of its training data, Gen2Act achieved a 30% higher absolute success rate than competitive baselines. Additionally, the system can handle long-term tasks, such as “making coffee,” by sequencing multiple actions.

In summary, this research illustrates how models trained on non-robotic datasets like web videos can significantly improve a robot’s ability to generalize its manipulation skills to unseen tasks, all without the need for extensive robot-specific data collection for each new task.

The paper Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation is on arXiv.

Author: Hecate He | Editor: Chain Zhang

The post Web Data to Real-World Action: Enabling Robots to Master Unseen Tasks first appeared on Synced.

Posted by

in