1/ Dense MLPs are a lie. The standard transformers we train are already doing sparse routing inside their feedforward layers—we just couldn't see it until now. 🧵