Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
The frontier exploration of LLM architectures has largely converged.
I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5).
Here's a detailed architectural breakdown, and what it tells us about where LLM design is heading.
TL;DR: Architecturally, GLM-5 closely follows DeepSeek-V3 with minor knob-tuning.
ATTENTION: MLA replaces GQA
The biggest change from GLM-4.7 to GLM-5 is attention.
GLM-4.7 used standard Grouped Query Attention (GQA) with 96 Q heads, 8 KV heads, separate q/k/v projections.
GLM-5 scraps all of that and adopts DeepSeek's Multi-head Latent Attention (MLA).
In the MLA pipeline, queries go through a LoRA-style two-stage projection:
hidden -> q_a_proj to rank 2048 -> RMSNorm -> q_b_proj to 64 heads * 256 dim.
Keys and values are jointly compressed into a single low-rank bottleneck:
hidden -> kv_a_proj to rank 512+64 -> split into a latent KV path and a RoPE path.
The latent part gets expanded back via kv_b_proj into 64 heads of (192 nope + 256 value) dims.
This is the exact same MLA design as DeepSeek-V3.
GLM-5 just tunes the dimensions: q_lora_rank 2048 vs 1536, v_head_dim 256 vs 128, qk_nope_head_dim 192 vs 128.
The kv_lora_rank (512) and qk_rope_head_dim (64) are identical.
Also, no bias anywhere in attention (attention_bias defaults to False).
Every projection (q_a_proj, q_b_proj, kv_a_proj, kv_b_proj, o_proj, and all DSA indexer projections) is bias-free.
This is now standard practice; among major models released in 2025, only GPT-oss still uses attention bias.
...
Top
Ranking
Favorites
