Fix an issue with distributed utils to enable launching jobs with tor… (!554) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge fix-distributed-utils into main Dec 19, 2022

Created by: tangbinh

Summary of Changes

This all-reduce call currently fails for non-Slurm jobs on multiple nodes as GPU devices are not set correctly when we initialize distributed groups:

Traceback (most recent call last):
  File "/home/binhtang/src/metaseq/metaseq/scripts/interactive.py", line 66, in <module>
    distributed_utils.call_main(cfg, main)
  File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 289, in call_main
    return distributed_main(
  File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 222, in distributed_main
    cfg.distributed_training.distributed_rank = distributed_init(cfg)
  File "/home/binhtang/src/metaseq/metaseq/distributed/utils.py", line 157, in distributed_init
    dist.all_reduce(torch.zeros(1).cuda())
  File "/home/binhtang/.conda/envs/metaseq/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 101c0

To fix it, we set device ID using the environment variable LOCAL_RANK (see this documentation).

Test Plan

Launch multi-node jobs successfully without Slurm on AWS:

NCCL_SOCKET_IFNAME=ens32 torchrun --nnodes 2 --node_rank 0 --nproc_per_node 8 --master_addr 172.31.25.180 --master_port 29600 metaseq/scripts/interactive.py --merges-filename /data/checkpoints/gpt2-merges.txt --vocab-filename /data/checkpoints/gpt2-vocab.json --hf-tokenizer /data/checkpoints/gpt2-unified.json --path /path/to/checkpoint/reshard.pt --model-parallel-size 16 --distributed-world-size 16 --ddp-backend fully_sharded --use-sharded-state  --beam 1 --max-source-positions 4 --max-target-positions 128