Created by: dgrnbrg-meta
Patch Description Allows the hosted model to act as a Celery worker.
Please see https://github.com/fairinternal/metaseq-internal/pull/510, and specifically the readme at https://github.com/fairinternal/metaseq-internal/blob/82326b7ae049d3244293b448b617687c46440a13/api-infra/README.md, for details.
Testing steps
Did pip install celery redis[hiredis]
in the conda environment. Connected to celery/redis and submitted a query.
Next, did pip uninstall celery redis[hiredis]
in the conda environment. In the logs, saw:
Not starting with celery
2022-10-20 16:13:40 | WARNING | metaseq.cli.interactive | Missing slurm configuration, defaulting to 'use entire node' for API
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 0
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 2
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 7
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 3
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 4
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 6
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 1
got override
2022-10-20 16:13:46 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 5
2022-10-20 16:13:54 | INFO | metaseq.distributed.utils | SLURM nodelist: fairwus3-1-htc-108
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-10-20 16:13:58 | INFO | metaseq.hub_utils | loading model(s) from /mnt/resource_nvme/dgrnbrg/reshard.pt
^[2022-10-20 16:14:16 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-10-20 16:14:21 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-10-20 16:14:21 | INFO | metaseq.cli.interactive | loaded model 0
2022-10-20 16:14:27 | INFO | metaseq.cli.interactive | Worker engaged! 10.37.64.8:6010
* Serving Flask app 'interactive_hosted' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2022-10-20 16:14:27 | INFO | werkzeug | * Running on all addresses (0.0.0.0)
WARNING: This is a development server. Do not use it in a production deployment.
* Running on http://127.0.0.1:6010
* Running on http://10.37.64.8:6010 (Press CTRL+C to quit)
This indicates that we can start without the celery dependency installed.
Finally, reinstalled celery as above. Logs included:
2022-10-20 16:15:47 | WARNING | metaseq.cli.interactive | Missing slurm configuration, defaulting to 'use entire node' for API
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 4
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 0
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 7
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 2
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 5
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 6
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 1
got override
2022-10-20 16:15:54 | INFO | metaseq.distributed.utils | initialized host fairwus3-1-htc-108 as rank 3
-------------- celery@fairwus3-1-htc-108 v5.2.7 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.4.0-1083-azure-x86_64-with-glibc2.27 2022-10-20 16:15:54
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: metaseq.cli.interactive_hosted:0x151485813df0
- ** ---------- .> transport: redis://localhost:6379//
- ** ---------- .> results: redis://localhost:6379/
- *** --- * --- .> concurrency: 10 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> gptz exchange=gptz(direct) key=gptz
==> logs/gptz-backend-stderr---supervisor-nuqgv3m1.log <==
[2022-10-20 16:16:02,471: INFO/MainProcess] SLURM nodelist: fairwus3-1-htc-108
[2022-10-20 16:16:02,472: WARNING/MainProcess] > initializing tensor model parallel with size 8
[2022-10-20 16:16:02,472: WARNING/MainProcess] > initializing pipeline model parallel with size 1
[2022-10-20 16:16:05,423: WARNING/MainProcess] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
[2022-10-20 16:16:05,603: INFO/MainProcess] loading model(s) from /mnt/resource_nvme/dgrnbrg/reshard.pt
[2022-10-20 16:16:25,409: INFO/MainProcess] Done reading from disk
[2022-10-20 16:16:31,378: INFO/MainProcess] Done loading state dict
[2022-10-20 16:16:31,512: INFO/MainProcess] loaded model 0
[2022-10-20 16:16:35,654: INFO/MainProcess] Worker engaged! 10.37.64.8:6010
[2022-10-20 16:16:35,661: WARNING/MainProcess] * Serving Flask app 'interactive_hosted' (lazy loading)
[2022-10-20 16:16:35,661: WARNING/MainProcess] * Environment: production
[2022-10-20 16:16:35,662: WARNING/MainProcess] WARNING: This is a development server. Do not use it in a production deployment.
[2022-10-20 16:16:35,662: WARNING/MainProcess] Use a production WSGI server instead.
[2022-10-20 16:16:35,662: WARNING/MainProcess] * Debug mode: off
[2022-10-20 16:16:35,675: INFO/MainProcess] * Running on all addresses (0.0.0.0)
WARNING: This is a development server. Do not use it in a production deployment.
* Running on http://127.0.0.1:6010
* Running on http://10.37.64.8:6010 (Press CTRL+C to quit)
Which shows that Celery and flask started successfully. I made successful queries to each.
(api-infra) dgrnbrg@fairwus3-1-htc-108:~/metaseq-internal/api-infra$ python
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tasks import app
>>> phrase = 'A boy runs up to a dog. He says,'
>>> args = {'max_tokens': 32, 'min_tokens': 4, 'top_p': 0.9, 'n': 1, 'echo': True, 'stop': '\n'}
>>> t=app.signature('metaseq.cli.interactive_hosted.compute_gptz')
>>> t.apply_async((phrase, args), queue='gptz').get(timeout=10)
[{'text': 'A boy runs up to a dog. He says, "I\'ll give you $10 if you bite that man over there."\n', 'tokens': ['</s>', 'A', ' boy', ' runs', ' up', ' to', ' a', ' dog', '.', ' He', ' says', ',', ' "', 'I', "'ll", ' give', ' you', ' $', '10', ' if', ' you', ' bite', ' that', ' man', ' over', ' there', '."', '\n'], 'text_offset': [0, 1, 5, 10, 13, 16, 18, 22, 23, 26, 31, 32, 34, 35, 38, 43, 47, 49, 51, 54, 58, 63, 68, 72, 77, 83, 85], 'token_scores': [None, -4.333739280700684, -7.7863240242004395, -6.155770301818848, -3.3266587257385254, -0.18844783306121826, -3.0652873516082764, -5.2763872146606445, -2.325634479522705, -1.7917954921722412, -2.5778729915618896, -1.1685595512390137, -0.31558117270469666, -1.7844150066375732, -3.181551933288574, -0.3298424780368805, -0.009746207855641842, -2.2504146099090576, -1.7780544757843018, -0.6374585032463074, -0.009275766089558601, -1.9499202966690063, -1.686514139175415, -0.38561180233955383, -1.5499969720840454, -0.0037220758385956287, -0.19743964076042175, -2.5736234188079834], 'top_logprobs': None}]
>>>
Curl request output is even longer, so I'll omit for readability.