Combining NVIDIA DGX Spark + Apple M3 Ultra Mac Studio for 4x faster LLM inference using EXO. DGX Spark: 128GB @ 273GB/s, 100TFLOPS (fp16) M3 Ultra Mac Studio: 512GB @ 819GB/s, 26 TFLOPS (fp16) DGX Spark has ~4x FLOPS of M3 Ultra but 3x less memory bandwidth. We were able to get a 4x performance increase by combining the devices and carefully overlapping computation and network communication (over 10GbE). How? LLM inference consists of two stages: prefill and decode. Prefill is compute-bound and gets faster with more FLOPS. Decode is memory-bound and gets faster with more memory-bandwidth. By running compute-bound prefill on the DGX Spark and memory-bound decode on the M3 Ultra, we were able to achieve 4x speedup on the prefill compared to the M3 Ultra Mac Studio alone and 3x speedup on generation compared to the DGX Spark alone. More details in the blog post below.