Created by: Xirider
Currently we have 2 versions of sweep.py and slurm.py: One version is used when a opt_baseline script is run from metaseq and the other version is used with sweep_baseline from metaseq-internal. Maintaining both versions adds unnecessary complexity to the code base and makes testing more difficult.
This PR brings most features of the sweep and slurm file from metaseq-internal to metaseq, in preparation of deleting these files in metaseq-internal.
Here some notes:
- brought in the tombstone feature
- metaseq-internal had a wrapper (to log some worker info) around the training script "train_wrapper.py" that I now moved to the train.py file
- there was some shuffling around in the path logic for the train command in slurm.py, so that it will now work independently of the user's working directory
Some things were not brought in (i.e. flags without use). For a number of other features that I included I'm not sure if they are actually used currently:
- post_cmds
- container_image and container_save
- array_length
- args.dep and args.sequential Should these stay?
Issue: https://github.com/facebookresearch/metaseq/issues/472 Internal PR: https://github.com/fairinternal/metaseq-internal/pull/558
Testing:
python metaseq/launcher/opt_baselines.py --prefix train.8m --model-size 8m --checkpoints-dir ./test-checkpoint --tensorboard-logdir ./test-checkpoint --num-trials 1 --azure --num-gpus 4 --num-nodes 1 --seed 1 --circleci --local --disable-validation --max-epoch 100 --max-update 100