Created by: sriniiyer
Currently, num_gpus is set to the local setting of 8, and not the actual number of global gpus. This causes verify-shards to fail and so, resuming from checkpoint fails.
This fixes that
Created by: sriniiyer
Currently, num_gpus is set to the local setting of 8, and not the actual number of global gpus. This causes verify-shards to fail and so, resuming from checkpoint fails.
This fixes that