YOLH
Imitation Learning

A robotic learning framework that enables robots to learn manipulation tasks directly from human hand demonstrations

pipeline

Overview

YOLH is a robot learning project that aims to learn manipulation policies directly from human RGB video demonstrations, without requiring robot-collected data.

Traditional approaches rely on hand-hold gripper or another robot, which are expensive and difficult to scale. In contrast, human videos are abundant and easy to collect, but introduce a key challenge known as the embodiment gap—the difference between human motion and robot control.

To address this, YOLH leverages a 3D voxel-based representation to align visual observations with robot action space, enabling the model to infer executable robot actions from human demonstrations.

Step 1

sam

Mask hand in the video

SAM2

Step 2

hand object detector

Hand Object Detector

F/P → CLOSE

OTHERWISE → OPEN

Step 3

Placeholder

HaMeR (Hand Mesh Recovery)

3d grid

Fig. 3D voxel grid and point cloud for the girpper.

See the code: GO TO

Exciting Content is coming soon

Timeline

Milestone 1
Literature review, system design, and environment setup.

Finished

Milestone 2
Implementation of data processing pipeline.

Finished

Milestone 3
Implementation of embodiment alignment
(hand replacement + arm masking).

Finished

Milestone 5
Evaluation and final report writing.