I think RL with verifiable rewards will become increasingly important in pushing LLMs toward their own “AlphaZero moment.” It will likely begin with coding, then extend to math, physics, and other domains where models can self-explore, discover out-of-distribution solutions humans might never imagine, and verify them using an absolute reward signal (0/1). This also reminds me of @elonmusk talking about a future where programs could be generated directly as binaries, without going through the traditional compilation process. That may actually be possible if LLMs can generate binary code and then execute it directly against a verifiable reward.