Unify code path for metaseq and metaseq-internal (!476) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge github/fork/Xirider/unify_training_codepaths into main Nov 01, 2022

Created by: Xirider

Currently we have 2 versions of sweep.py and slurm.py: One version is used when a opt_baseline script is run from metaseq and the other version is used with sweep_baseline from metaseq-internal. Maintaining both versions adds unnecessary complexity to the code base and makes testing more difficult.

This PR brings most features of the sweep and slurm file from metaseq-internal to metaseq, in preparation of deleting these files in metaseq-internal.

Here some notes:

brought in the tombstone feature
metaseq-internal had a wrapper (to log some worker info) around the training script "train_wrapper.py" that I now moved to the train.py file
there was some shuffling around in the path logic for the train command in slurm.py, so that it will now work independently of the user's working directory

Some things were not brought in (i.e. flags without use). For a number of other features that I included I'm not sure if they are actually used currently:

post_cmds
container_image and container_save
array_length
args.dep and args.sequential Should these stay?

Issue: https://github.com/facebookresearch/metaseq/issues/472 Internal PR: https://github.com/fairinternal/metaseq-internal/pull/558

Testing: python metaseq/launcher/opt_baselines.py --prefix train.8m --model-size 8m --checkpoints-dir ./test-checkpoint --tensorboard-logdir ./test-checkpoint --num-trials 1 --azure --num-gpus 4 --num-nodes 1 --seed 1 --circleci --local --disable-validation --max-epoch 100 --max-update 100