We need new benchmarks for low complexity solutions to code problems. Each new feature is like a jenga block in a tower, and current benchmarks only rank how well each block is assembled. We need evals that track how tall you can stack the blocks before the tower collapses.