When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: Stanford CS 224G lecture: