"Qwen3-Coder-Next-8bit EXO benchmark analysis on M3 Ultra" 1. Core Data: M3 Ultra (512GB RAM) Distributed Inference Hardware Configuration • Single Node: Apple M3 Ultra 512GB RAM (32 CPU cores, 80 GPU cores) • Dual Node: 2 × M3 Ultra (1024GB RAM aggregated) • Model: Qwen3-Coder-Next-8bit (8B parameters, quantized version) Performance Benchmark (tokens/s)
2. Key Information: 1. Prompt Processing scales linearly with the number of nodes • 0.5K-8K context: single node has reached peak (60 t/s), dual nodes actually decrease (-3%) • Reason: distributed communication overhead > computational acceleration benefits • Conclusion: small context does not require distribution • 16K-64K context: dual nodes start to benefit (+2% to +6%) • Reason: KV Cache requires more memory, single node bottleneck • Conclusion: distributed inference is valuable for large context 2. Generation performance trends • Small model (8B) + small context (<32K): Generation is slow • Large context (≥32K): performance begins to improve key insights • Reason: 8B model has low computational pressure, bottleneck is in memory bandwidth and KV Cache 3. Importance of /bench API • Standard OpenAI endpoint: default cache enabled, leading to erroneous test results • /bench API: no streaming, returns server measurement stats (accurate) • Key finding: testing distributed inference must use /bench, otherwise data is invalid
3. Comparison with Qwen3.5-35B
4. Technical conclusion Value intervals for distributed reasoning • Small context (<8K): Single node is optimal, but dual nodes are reduced (communication overhead) • Large context (≥32K): Dual nodes start to benefit, +6% increase at 64K • 128K+ context: Requires multiple nodes (encountered the problem of 1115KB gossipsub messages being too large in the test) Qwen3-Coder-Next-8bit vs Qwen3.5-35B:
Five, Bottlenecks of EXO • 128K context test failed: gossipsub message too large (1115KB), node needs to be restarted • Issue: Network layer limits distributed inference scalability • Solution: Need to optimize message fragmentation or switch to another communication protocol
6. Comparison of economic models Option A: M3 Ultra 512GB (Single Node) • Cost: $2000-3000 • Performance: 60 t/s (<8K) → 48 t/s (64K) • Applicable: Large context (≥32K), a single node is sufficient Scenario B: M3 Ultra × 2 (Dual Node) • Cost: $4000-6000 • Performance: 59-51 t/s (+6% vs single node, 64K context only) • Applicable: Very large context (≥128K) with insufficient memory on a single node Scenario C: RTX 3090 (single card) • Cost: $800-1000 (used) • Performance: 112 t/s (fixed, Qwen3.5-35B) • Suitable for: small context (<64K), economically viable
VII. 📌 Core conclusions 1. Qwen3-Coder-Next-8bit is suitable for large context (≥32K) distributed inference Benefits: Scalable to infinite context (multi-node aggregate memory) Disadvantages: Small context performance is not as good as single-card GPUs, and the ROI cycle is long 2. Qwen3.5-35B (RTX 3090) is suitable for small context (<64K) economic reasoning Advantages: 112 t/s high performance, ROI payback in 6 months Disadvantages: Single card limit (24GB VRAM), cannot be expanded to 128K+ 3. There are still bottlenecks in EXO's distributed reasoning Issue: The gossipsub message is too large (1115KB) and the node needs to be restarted Solution: Optimize the network layer or switch to a different communication protocol
VIII. Comparison of investment priorities The Mac Studio M5 (with M5 Ultra chip) is expected to be released in March-June 2026. In terms of performance, compared to the M3 Ultra, the M5 Ultra's prompt processing (TTFT) can be accelerated by 2-4 times, and the generation speed (tokens/s) is increased by about 20-30% (the memory bandwidth is increased from 800GB/s to a higher level, combined with the Neural Accelerator for each GPU core). For quantized versions similar to the Qwen model, the M5 Ultra may support larger contexts (64K+ tokens) to achieve higher throughput in benchmarks (e.g., large MoE models up to 150+ tok/s). Considering that the hardware cost is similar (about $4,000 up) but the performance is improved, the ROI is expected to be shortened to 8-12 months, which is suitable for high-intensity AI development scenarios and has a higher overall recommendation index.
3.32K