Created by: tangbinh
Summary of Changes
This all-reduce call currently fails for non-Slurm jobs on multiple nodes as GPU devices are not set correctly when we initialize distributed groups:
Traceback (most recent call last):
File "/home/binhtang/src/metaseq/metaseq/scripts/interactive.py", line 66, in <module>
distributed_utils.call_main(cfg, main)
File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 289, in call_main
return distributed_main(
File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 222, in distributed_main
cfg.distributed_training.distributed_rank = distributed_init(cfg)
File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 157, in distributed_init
dist.all_reduce(torch.zeros(1).cuda())
File "/home/binhtang/.conda/envs/metaseq/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 101c0
To fix it, we set device ID using the environment variable LOCAL_RANK
(see this documentation).
Test Plan
- Launch multi-node jobs successfully without Slurm on AWS:
NCCL_SOCKET_IFNAME=ens32 torchrun --nnodes 2 --node_rank 0 --nproc_per_node 8 --master_addr 172.31.25.180 --master_port 29600 metaseq/scripts/interactive.py --merges-filename /data/checkpoints/gpt2-merges.txt --vocab-filename /data/checkpoints/gpt2-vocab.json --hf-tokenizer /data/checkpoints/gpt2-unified.json --path /path/to/checkpoint/reshard.pt --model-parallel-size 16 --distributed-world-size 16 --ddp-backend fully_sharded --use-sharded-state --beam 1 --max-source-positions 4 --max-target-positions 128