I remember ~2.5 years ago, with @_lewtun and @edwardbeeching and co at @huggingface how it took months to get DPO working right. Today, coding agents can build an entire repository from scratch, referencing high-quality implementations and discussing trade-offs, and run a representative training job on your desk. This was a 1B model on thousands of samples. It really changes accessibility to AI research and tinkering, along with what it means to work in AI. I just merged the PR for this which adds a bunch of direct alignment algorithms (DPO etc) to the rlhfbook code repo, and it's remarkable how much easier this is today. I'm feeling even more confident about what the book is becoming -- a dense place for intuitions for what actually works with models, free of hallucinations and hypes. Students can use this as a reference beside code and experiments that the AI models can spin up in an afternoon. At its best, the RLHF Book will become a central place for people to discuss, iterate, and make community around this learning material.