"ViT-5: Vision Transformers for The Mid-2020s" This paper shows that plain Vision Transformers still has a lot of low hanging fruits, with many under-optimized aspects. By systematically swapping in modern transformer's best-practices, eg. RMSNorm, 2D RoPE + absolute positions, QK-norm, register tokens, LayerScale You'd get a simple drop-in ViT backbone that's pretty strong and much more stable, without the need to change the core attention+FFN recipe at all!