YOLH
Imitation Learning

A robotic learning framework that enables robots to learn manipulation tasks directly from human hand demonstrations

Overview

YOLH is a robot learning project that aims to learn manipulation policies directly from human RGB video demonstrations, without requiring robot-collected data.

Traditional approaches rely on hand-hold gripper or another robot, which are expensive and difficult to scale. In contrast, human videos are abundant and easy to collect, but introduce a key challenge known as the embodiment gap—the difference between human motion and robot control.

To address this, YOLH leverages a 3D voxel-based representation to align visual observations with robot action space, enabling the model to infer executable robot actions from human demonstrations.

Step 1

Mask hand in the video

SAM2

Step 2

Hand Object Detector

F/P → CLOSE

OTHERWISE → OPEN

Step 3

HaMeR (Hand Mesh Recovery)

Fig. 3D voxel grid and point cloud for the girpper.

See the code: GO TO

Exciting Content is coming soon

Timeline

Milestone 1

Literature review, system design, and environment setup.

Finished

Milestone 2

Implementation of data processing pipeline.

Finished

Milestone 3

Implementation of embodiment alignment
(hand replacement + arm masking).

Finished

Milestone 4

Integration with RISE and policy training.

Finished

Milestone 5

Evaluation and final report writing.

…

YOLHImitation Learning

Overview

Step 1

Step 2

Step 3

See the code: GO TO

Exciting Content is coming soon

Timeline

Milestone 1

Literature review, system design, and environment setup.

Milestone 2

Implementation of data processing pipeline.

Milestone 3

Implementation of embodiment alignment (hand replacement + arm masking).

Milestone 4

Integration with RISE and policy training.

Milestone 5

Evaluation and final report writing.

YOLH
Imitation Learning

Implementation of embodiment alignment
(hand replacement + arm masking).