Created by: igormolybogFB
Patch Description Made tombstoning work for RSC:
Added dynamic configs:
enabling: Just include --dynamic-config-path and a specified json file path. The file must exist and will be picked up as the default dynamic config file.
using:
- when you want to change the value of a dynamic configuration just amend it in the dynamic config json file and save
- Allow some time for the changes to propagate (default timeout is 30 sec)
how it works: dynamic configuration is a dict with built-in timeout. If you are trying to access a value in the dict before the time is out, it will just give you the value with minimal overhead. If you are trying to access a value after the timeout, the entire dict will be reloaded from file and timer reset.
why it is useful: If there is some computationally expensive logging/profiling that needs to be done only when weird behavior of the training procedure is observed, one should be able to trigger these operations on demand.
I have added the first dynamic config flag "force_profile" - it enables enabling profiling not only on step 5 but anywhere throughout training, even if cfg.common.profile = False
Testing steps
(using metaseq-internal) on Azure, run: python -m metaseq_internal.projects.zucchini.sweep_baseline -g 8 -n 1 --azure --model-size 125m --data /data/gpt-z/zucchini/consolidated/v1.0 --tokenizer noregex --partition zetta --profile --dynamic-config --prefix dcfg_test
Results are in /shared/home/igormolybog/checkpoints/dcfg_test22/ (including traces from mnt/)