Export control has a huge impact, especially for MLA-based models. Take K2/2.5 as an example, it already reduced num_heads to 64, but the compute intensity for FP8 KVCache is still ≈2×2×64=256FLOP/Byte. H20 only has 148TFLOPS BF16 compute. Max bandwidth is merely 592GB/s.