Robots could learn complex skills just by watching human videos. Researchers from Tsinghua, MIT, and Astribot present CLAP. Their new method aligns video frames with robot movement data, creating a shared "action dictionary" that translates human actions into executable robot commands. It outperforms existing models in transferring skills from human videos to robots, enabling better instruction following and precise manipulation. CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos Paper: Project: Our report: 📬 #PapersAccepted by Jiqizhixin