This paper shocked me 🤯 Everyone on X keeps bragging about “LLM-as-a-judge” like it’s some magical truth oracle. But this paper shows something insane: Most LLM evaluations you’ve seen are biased by design not because models are bad, but because the judge itself quietly misrepresent the score. Here’s the wild part: If a judge is slightly bad at catching wrong answers (low specificity), it inflates accuracy. If it’s slightly bad at recognizing correct answers (low sensitivity), it deflates accuracy. Same model. Same outputs. But you get two different judges = two different “accuracies.” The authors show the math, the error curves, and the exact point where the judge starts lying to you without meaning to. So they built a fix: A plug-in estimator that adjusts the judged score back to the real score using calibration data. Plus a confidence interval that finally reflects uncertainty from both the eval set and the calibration set. Here's what shocked me: They even show how to allocate calibration samples efficiently so you don’t waste budget something nobody in LLM eval talks about. ...