I think we don't understand behavior of such large MoEs (particularly with advanced post-DSMoE architectures). But we know scaling is good with 0.8% even at ≈28B total. And clever ways to exploit sparsity beyond "finer grain" become possible. I say 1% at 10T is *conservative*.