Requeue with restore file and broken checkpoint upload... is broken
Created by: suchenzang
Right now, it seems like if we requeue a job that started with a restore file and since starting from the restore file there has been broken checkpoint uploads, the run will simply restart from scratch while continuing to increment its iteration count.
Go through our checkpointing spagetti and figure out how to clean this up: https://github.com/facebookresearch/metaseq/blob/ae825b2fa9010ab0406f20d6164ebb058a7e97cf/metaseq/checkpoint_utils.py#L257