Created by: dgrnbrg-meta
When you try to run the API service on a slurm node, but you're not inheriting the slurm environment, you get a very strange error:
... snip ...
File "/shared/home/dgrnbrg/metaseq/metaseq/checkpoint_utils.py", line 316, in _is_checkpoint_sharded
size_ratio = max(sizes) / min(sizes)
ValueError: max() arg is an empty sequence
It turns out this is due to incorrectly inferred config in metaseq/distributed/utils.py
.
Patch Description This adds a warning & a sane default (use the entire node that the API server is being run on).
Testing steps I tried removing each of these env vars, and they're all necessary: any subset causes different crashes.
I ensured I could run & query the API successfully, using the fully sharded 175b param model.