binary searched a 512xh100 cluster (european cloud provider) to find the node killing our all-reduce. narrowed 64 nodes down to one bad node. bandwidth jumped from 50GB/s back to 157GB/s after the swap... time to automate this