Another banger post from Anthropic! It's all about improving your agents via evals. Here are my quick takeaways from the blog: The capabilities that make agents useful (autonomy, intelligence, flexibility) are the same ones that make them hard to evaluate. You can't just run unit tests and expect your agentic app to work. This guide breaks down the practical framework Anthropic devs use for agent evals. They mentioned three types of graders, each with trade-offs: - Code-based graders are fast, cheap, and reproducible, but brittle to valid variations. - Model-based graders handle nuance and open-ended tasks, but are non-deterministic and require human calibration. - Human graders are gold-standard quality, but expensive and slow. They also talk about two categories of evals that serve different purposes. 1) Capability evals ask "what can this agent do well?" and start at low pass rates. 2) Regression evals ask "can it still handle previous tasks?" and should stay near 100%. Tasks graduating from capability to regression represent real progress. For non-determinism, two metrics matter. pass@k measures the probability of at least one success in k attempts. pass^k measures the probability that all k trials succeed. These diverge dramatically, at k=10, pass@k can approach 100% while pass^k falls to near zero. A really good tip in the blogs is to start with 20-50 simple tasks from real failures rather than waiting for perfection. Convert manual checks you already perform into test cases. Grade outputs, not paths taken. Include partial credit for complex tasks. Common pitfalls include rigid grading that penalizes equivalent but differently-formatted answers, ambiguous task specifications, and stochastic tasks impossible to reproduce. ...