Created by: tangbinh
Patch Description
Currently, rank 0 attempts to save a YAML config (i.e. config.yml
) before the checkpoint directory (i.e. cfg.checkpoint.save_dir
) is created in verify_checkpoint_directory
. The failure seems to happen sporadically as other ranks might get ahead of rank 0 and succeed in creating the directory before it tries to save the YAML. However, with a smaller world size and pdb
statements, the race condition appears problematic.
Testing steps
- Verify that training proceeds as expected after we add
from metaseq.pdb import set_trace; set_trace()
to line 59 and step over it:
python -m metaseq_internal.projects.zucchini.sweep_baseline -g 8 -n 1 -t 1 --azure --model-size 125m --prefix local-125m --data /data/gpt-z/zucchini/consolidated/v1.0_textonly --local