Today, we release our largest LFM2 model: LFM2-24B-A2B 🐘 > 24B total parameters > 2.3B active per token > Built on our hybrid, hardware-aware LFM2 architecture It combines LFM2’s fast, memory-efficient design with a Mixture of Experts setup, so only 2.3B parameters activate each run. The result: best-in-class efficiency, fast edge inference, and predictable log-linear scaling all in a 32GB, 2B-active MoE footprint. 🧵
With this release, the LFM2 family spans nearly two orders of magnitude: from LFM2-350M to LFM2-24B-A2B. Each step up in scale has brought consistent quality gains on standard benchmarks. We designed LFM2-24B-A2B to fit in 32 GB of RAM, making it runnable on consumer laptops and desktops with integrated graphics processor (iGPU) and dedicated neural processing unit (NPU). > LFM2-24B-A2B expands the LFM2 family from 350M → 24B parameters > Nearly two orders of magnitude of scale with consistent, log-linear quality improvements across benchmarks
Scaling recipe: Go deeper. Add experts. Keep the active path lean. We scaled LFM2-24B-A2B by going deeper (24→40 layers) and doubling experts (32→64 per MoE block), while keeping hidden size (2048), top-4 routing, and a 1:3 attention:conv ratio fixed. > Total params grow 3× (8.3B→24B) > Active params only grow ~1.5× (1.5B→2.3B) Inference cost tracks the active path (not total parameter count) keeping latency and energy aligned with real-world deployment constraints. Capacity scales. Per-token compute stays lean.
We shipped this as a traditional instruct model (no reasoning traces) using lightweight post-training. Across: > GPQA Diamond > MMLU-Pro > IFEval > IFBench > GSM8K > MATH-500 Quality improves log-linearly from 350M → 24B. This nearly 100× parameter range confirms predictable scaling behavior of the hybrid LFM2 architecture, no small-model ceiling effect.
LFM2-24B-A2B ships with day-zero support across llama.cpp, vLLM, and SGLang, CPU or GPU out of the box, with GGUF quantizations (Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16). On CPU (AMD Ryzen AI Max+ 395, Q4_K_M), it sustains ~93 tok/s at 8K context, outperforming similarly sized MoE models while maintaining strong long-context scaling.
On CPU (AMD Ryzen AI Max+ 395, Q4_K_M, llama.cpp), LFM2-24B-A2B sustains strong prefill throughput across 1K→8K contexts (~1,132 tok/s at 8K), remaining competitive with similarly sized MoE models. On GPU (H100 SXM5, SGLang/vLLM), it demonstrates favorable output throughput scaling under realistic high-concurrency serving, critical for cost-efficient deployment and RLVR workloads.
On GPU (H100 SXM5, vLLM), LFM2-24B-A2B scales to ~26.8K total tokens throughput (tok/s) at 1024 concurrent requests (1024-max-input-tokens / 512-max-output-tokens), outperforming similarly sized MoE models under continuous batching. Measured with realistic interleaved prefill+decode — built for production-scale serving and RL workloads.
63