Automatically detect and skip loss spikes (!521) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge peter/detect_spikes into bigbig Nov 17, 2022

Created by: Xirider

Patch Description Added a new flag max_loss_to_skip_batch that, if set to some maximum acceptable loss will abort the iteration before doing an optimizer step. The loss value to compare to is the same one used in the logs. It might be or not different to the one in tensorboard. The logic is similar to our skip_gradient_update_on_clip_norm flag which also skips batches, whenever the gradient norm is above the clip value, and also how we handle overflows.

Testing steps Tested this with our small sweep script. I think our disks are full so I couldn't test this with a longer run. For testing I increased the loss and checked whether we are skipping correctly