Created by: stephenroller
Patch Description The PyTorch TCP rendezvous seems to have trouble if there are minor network blips, causing troubles in getting past initialization at sufficient scale. This patch introduces a custom rendezvous method which copy-pastes much of the pytorch code, and adds a simple retry loop around it.
Testing steps Ongoing