restore last checkpoint from the end of the training runs (!350) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge lastchkptfix into main Sep 26, 2022

Created by: ruanslv

After https://github.com/facebookresearch/metaseq/commit/e3ea5070a8c1bae77703aef7fc0f5537bd437963 we stopped storing checkpoints at the end of the runs. Let's bring them back.

I'm repurposing last_.* checkpoints to be only the ones corresponding to the end of training. In practice, with previous code they were never stored because "epoch" or "updates" one would take precedence. Now, if it's the end of the run and we are not at the end of an epoch or a saving interval, we store the checkpoint using "last_" naming (assuming cfg flag is enabled).

To test: Trained a model and checked that last checkpoint was stored in Azure.