The $1 million AI Benchmark Instead of asking “is it correct?” This asks: “would someone pay for it?” Across $1 million of real expert tasks, top models complete only about 40–48%. Best one: Claude Opus-4.6. Big gap isn’t knowledge, it’s execution. Models miss steps, constraints, and details. AI is powerful. Just not reliable end-to-end yet. Curious to see this number grow 📈
Link to study:
159