This BullshitBench result goes a long way toward explaining the widespread intuition that Claude is the best daily driver, despite Google and OAI’s eye-popping benchmarks. Contrast BullshitBench with the problem-solving benchmarks. All of the latter presuppose correct solutions. But in real life, problems are poorly defined and it’s often unclear what questions are worth asking or even have answers. You need a model that can steer you off the wrong path — ie, call bullshit.