Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
In collaboration with @AMD and @IBM, we @ZyphraAI are sharing ZAYA1-base! The first large-scale model on an integrated AMD hardware, software, and networking stack. ZAYA1 uses Zyphra’s novel MoE architecture with 760M active and 8.3B total params.
Tech paper and more below👇

PR:
Technical Blog:
Technical Paper:
Hugging Face:
Architecturally, ZAYA1 follows our “MoE++” recipe:
- Compressed Convolutional Attention (CCA) []
- New ZAYA1 router
- Per-layer residual scaling with learned gates
These give better scaling curves (per FLOP and per parameter) than standard MoE.

The ZAYA1 router replaces traditional linear routers with:
- Downprojects residual stream
- Applies Exponential Depth Averaging (EDA) to mix info across layers
- 3-layer MLP per expert
- Uses a control-theory-inspired balancing scheme to keep experts both busy and specialized
Training recipe:
- 14T tokens total
- 3 phases: web-heavy pretrain → math/code/structured-heavy phase → long-context + reasoning mid-train
- Curriculum shifts towards dense STEM + reasoning data over time
- Context extension from 4k → 32k with via context-parallel CCA

Our cluster, hosted by @IBMcloud, is comprised of 128 compute nodes, each containing:
- 8 MI300X GPUs interconnected with InfinityFabric
- 8 Pollara 400Gbps inter-node interconnects
- 2 Intel Xeon Platinum 8570 CPUs
Nodes are connected in a two-level rails-only topology.

We carried out co-design to reduce training time:
- Kernels for RMSNorm + Muon’s Newton-Schulz iteration
- Aegis, our automated fault-tolerance system to ensure high uptime
- Distributed checkpointing and reshaping
- Novel parallelism schemes for CP and distributed Muon

ZAYA1-base performs strongly compared to similar models, making it a strong foundation model for our subsequent post-training.

Despite only 760M active parameters, ZAYA1-base outperforms dense models such as Llama-3-8B and is competitive with Qwen3-4B and Gemma3-12B on mathematics and coding benchmarks. In high pass@k settings, the base model approaches the performance of specialized reasoning models.

42.48K
Top
Ranking
Favorites

