Created by: davides
Patch Description
- Add trailing wildcard support to
PathManager.ls()
to support changes below - Update checkpoint caching:
- Remove the double call to
get_local_path
here which may have been causing a race condition. Passingforce=True
should get the intended effect. Add a utility to stress test file locking - Fix
load_checkpoint_to_cpu()
to support remote checkpoints when DP>1 (see the stacktrace I got here). I think this only worked before becauseget_local_path()
is a no-op for local paths and the other shard files are already nearby. Updated to ensure we cache all shards locally before attempting to load, using the new wildcard support in PathManager.ls()
- Remove the double call to
Testing steps Multiple eval runs on OPT 125M:
- remote path + consolidated
- remote path + DP>1
- local path + consolidated
- local path + DP>1
Running the stress test:
python -m tests.file_io.async_download_test