fixed issue 437 and removed retired profiler (!450) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge new_profiler into main Oct 25, 2022

Created by: igormolybogFB

Patch Description the flag --profile in metaseq train.py was not accessible through opt_baseline.py or sweep_baseline.py that we normally use. This is due to --profile flag in those was mapped to --new_profiler flag in train.py and not the --profile flag.

Here is the description of how that is happening (How profiler option gets into slurm job):

[sweep_baseline.py] –profile is read from cli
–new-profile is added to the grid (profile is not)
[slurm.py] config is produced from the grid
train_cmd is extended from config
srun_cmd is extended from train_cmd (and srun_cmd_str)
srun_cmd_str -> wrapped_cmd -> sbatch_cmd run_batch (sbatch_cmd) is called

Moreover, --profile flag corresponds to the outdated version of profiler (torch.autograd.profiler) and not the new one (torch.profiler). As per @ngoyal2707 request, the outdated profiler gets cleaned out of our code (in metaseq only) and --profile in both metaseq-internal and metaseq are being matched.

Besides that, issue 437 is fixed by implementing the suggestion

Testing steps run

python -m metaseq_internal.projects.zucchini.sweep_baseline -g 2 -n 1 --azure --model-size 125m --data /data/gpt-z/zucchini/consolidated/v1.0 --tokenizer noregex --partition zetta --prefix profile_run --profile