VLAs learn control from images, but they don’t understand physics. Video models do. mimic-video proposes Video-Action Models: use a pretrained video diffusion model to predict future trajectories, then decode actions from its latent plan.