GPUs are unreliable at scale. At @modal we've scaled to 20,000+ concurrent GPUs across AWS, GCP, Azure, and OCI, with 1M+ instances launched. Public-cloud GPUs fail in many ways, and we’ve seen most of them. Here’s how we handle GPU reliability 👇