I played with Gemini 3 yesterday via early access. Few thoughts - First I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high. Go talk to the model. Talk to the other models (Ride the LLM Cycle - use a different LLM every day). I had a positive early impression yesterday across personality, writing, vibe coding, humor, etc., very solid daily driver potential, clearly a tier 1 LLM, congrats to the team! Over the next few days/weeks, I am most curious and on a lookout for an ensemble over private evals, which a lot of people/orgs now seem to build for themselves and occasionally report on here.