Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Today, we release our largest LFM2 model: LFM2-24B-A2B 🐘
> 24B total parameters
> 2.3B active per token
> Built on our hybrid, hardware-aware LFM2 architecture
It combines LFM2’s fast, memory-efficient design with a Mixture of Experts setup, so only 2.3B parameters activate each run.
The result: best-in-class efficiency, fast edge inference, and predictable log-linear scaling all in a 32GB, 2B-active MoE footprint.
🧵

With this release, the LFM2 family spans nearly two orders of magnitude: from LFM2-350M to LFM2-24B-A2B. Each step up in scale has brought consistent quality gains on standard benchmarks.
We designed LFM2-24B-A2B to fit in 32 GB of RAM, making it runnable on consumer laptops and desktops with integrated graphics processor (iGPU) and dedicated neural processing unit (NPU).
> LFM2-24B-A2B expands the LFM2 family from 350M → 24B parameters
> Nearly two orders of magnitude of scale with consistent, log-linear quality improvements across benchmarks
Scaling recipe: Go deeper. Add experts. Keep the active path lean.
We scaled LFM2-24B-A2B by going deeper (24→40 layers) and doubling experts (32→64 per MoE block), while keeping hidden size (2048), top-4 routing, and a 1:3 attention:conv ratio fixed.
> Total params grow 3× (8.3B→24B)
> Active params only grow ~1.5× (1.5B→2.3B)
Inference cost tracks the active path (not total parameter count) keeping latency and energy aligned with real-world deployment constraints.
Capacity scales. Per-token compute stays lean.

We shipped this as a traditional instruct model (no reasoning traces) using lightweight post-training.
Across:
> GPQA Diamond
> MMLU-Pro
> IFEval
> IFBench
> GSM8K
> MATH-500
Quality improves log-linearly from 350M → 24B.
This nearly 100× parameter range confirms predictable scaling behavior of the hybrid LFM2 architecture, no small-model ceiling effect.

LFM2-24B-A2B ships with day-zero support across llama.cpp, vLLM, and SGLang, CPU or GPU out of the box, with GGUF quantizations (Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16).
On CPU (AMD Ryzen AI Max+ 395, Q4_K_M), it sustains ~93 tok/s at 8K context, outperforming similarly sized MoE models while maintaining strong long-context scaling.

On CPU (AMD Ryzen AI Max+ 395, Q4_K_M, llama.cpp), LFM2-24B-A2B sustains strong prefill throughput across 1K→8K contexts (~1,132 tok/s at 8K), remaining competitive with similarly sized MoE models.
On GPU (H100 SXM5, SGLang/vLLM), it demonstrates favorable output throughput scaling under realistic high-concurrency serving, critical for cost-efficient deployment and RLVR workloads.

On GPU (H100 SXM5, vLLM), LFM2-24B-A2B scales to ~26.8K total tokens throughput (tok/s) at 1024 concurrent requests (1024-max-input-tokens / 512-max-output-tokens), outperforming similarly sized MoE models under continuous batching.
Measured with realistic interleaved prefill+decode — built for production-scale serving and RL workloads.

63
Top
Ranking
Favorites
