Created by: igormolybogFB
Patch Description the flag --profile in metaseq train.py was not accessible through opt_baseline.py or sweep_baseline.py that we normally use. This is due to --profile flag in those was mapped to --new_profiler flag in train.py and not the --profile flag.
Here is the description of how that is happening (How profiler option gets into slurm job):
- [sweep_baseline.py] –profile is read from cli
- –new-profile is added to the grid (profile is not)
- [slurm.py] config is produced from the grid
- train_cmd is extended from config
- srun_cmd is extended from train_cmd (and srun_cmd_str)
- srun_cmd_str -> wrapped_cmd -> sbatch_cmd run_batch (sbatch_cmd) is called
Moreover, --profile flag corresponds to the outdated version of profiler (torch.autograd.profiler) and not the new one (torch.profiler). As per @ngoyal2707 request, the outdated profiler gets cleaned out of our code (in metaseq only) and --profile in both metaseq-internal and metaseq are being matched.
Besides that, issue 437 is fixed by implementing the suggestion
Testing steps run
python -m metaseq_internal.projects.zucchini.sweep_baseline -g 2 -n 1 --azure --model-size 125m --data /data/gpt-z/zucchini/consolidated/v1.0 --tokenizer noregex --partition zetta --prefix profile_run --profile