The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown, and what it tells us about where LLM design is heading. TL;DR: Architecturally, GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with 96 Q heads, 8 KV heads, separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head Latent Attention (MLA). In the MLA pipeline, queries go through a LoRA-style two-stage projection: hidden -> q_a_proj to rank 2048 -> RMSNorm -> q_b_proj to 64 heads * 256 dim. Keys and values are jointly compressed into a single low-rank bottleneck: hidden -> kv_a_proj to rank 512+64 -> split into a latent KV path and a RoPE path. The latent part gets expanded back via kv_b_proj into 64 heads of (192 nope + 256 value) dims. This is the exact same MLA design as DeepSeek-V3. GLM-5 just tunes the dimensions: q_lora_rank 2048 vs 1536, v_head_dim 256 vs 128, qk_nope_head_dim 192 vs 128. The kv_lora_rank (512) and qk_rope_head_dim (64) are identical. Also, no bias anywhere in attention (attention_bias defaults to False). Every projection (q_a_proj, q_b_proj, kv_a_proj, kv_b_proj, o_proj, and all DSA indexer projections) is bias-free. This is now standard practice; among major models released in 2025, only GPT-oss still uses attention bias. ...