In collaboration with @AMD and @IBM, we @ZyphraAI are sharing ZAYA1-base! The first large-scale model on an integrated AMD hardware, software, and networking stack. ZAYA1 uses Zyphra’s novel MoE architecture with 760M active and 8.3B total params. Tech paper and more below👇
PR: Technical Blog: Technical Paper: Hugging Face:
Architecturally, ZAYA1 follows our “MoE++” recipe: - Compressed Convolutional Attention (CCA) [] - New ZAYA1 router - Per-layer residual scaling with learned gates These give better scaling curves (per FLOP and per parameter) than standard MoE.
The ZAYA1 router replaces traditional linear routers with: - Downprojects residual stream - Applies Exponential Depth Averaging (EDA) to mix info across layers - 3-layer MLP per expert - Uses a control-theory-inspired balancing scheme to keep experts both busy and specialized
Training recipe: - 14T tokens total - 3 phases: web-heavy pretrain → math/code/structured-heavy phase → long-context + reasoning mid-train - Curriculum shifts towards dense STEM + reasoning data over time - Context extension from 4k → 32k with via context-parallel CCA
Our cluster, hosted by @IBMcloud, is comprised of 128 compute nodes, each containing: - 8 MI300X GPUs interconnected with InfinityFabric - 8 Pollara 400Gbps inter-node interconnects - 2 Intel Xeon Platinum 8570 CPUs Nodes are connected in a two-level rails-only topology.
We carried out co-design to reduce training time: - Kernels for RMSNorm + Muon’s Newton-Schulz iteration - Aegis, our automated fault-tolerance system to ensure high uptime - Distributed checkpointing and reshaping - Novel parallelism schemes for CP and distributed Muon
ZAYA1-base performs strongly compared to similar models, making it a strong foundation model for our subsequent post-training.
Despite only 760M active parameters, ZAYA1-base outperforms dense models such as Llama-3-8B and is competitive with Qwen3-4B and Gemma3-12B on mathematics and coding benchmarks. In high pass@k settings, the base model approaches the performance of specialized reasoning models.
42.48K