A robotic learning framework that enables robots to learn manipulation tasks directly from human hand
demonstrations
Overview
YOLH is a robot learning project that aims to learn manipulation policies directly from human RGB video demonstrations, without requiring robot-collected data.
Traditional approaches rely on hand-hold gripper or another robot, which are expensive and difficult to scale. In contrast, human videos are abundant and easy to collect, but introduce a key challenge known as the embodiment gap—the difference between human motion and robot control.
To address this, YOLH leverages a 3D voxel-based representation to align visual observations with robot action space, enabling the model to infer executable robot actions from human demonstrations.
Step 1
Mask hand in the video
SAM2
Step 2
Hand Object Detector
F/P → CLOSE
OTHERWISE → OPEN
Step 3
HaMeR (Hand Mesh Recovery)
Fig. 3D voxel grid and point cloud for the girpper.