#460 and #479 require `fairseq_v3` branch of Megatron-LM
Created by: EIFY
🐛 Bug
#460 and #479 require fairseq_v3
branch of Megatron-LM
To Reproduce
- Set up as instructed with Megatron-LM branch
fairseq_v2
- Edit
metaseq/service/constants.py
as necessary. In my case, follow https://github.com/facebookresearch/metaseq/issues/407#issuecomment-1293015551. - Run
metaseq-api-local
- See error
$ metaseq-api-local
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 0
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 2
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 3
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 6
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 7
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 5
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 1
2022-11-17 23:51:51 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 4
Traceback (most recent call last):
File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 176, in distributed_init
from megatron.global_vars import (
ImportError: cannot import name '_GLOBAL_MEMORY_BUFFER' from 'megatron.global_vars' (/home/jason_chou/Megatron-LM/megatron/global_vars.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
sys.exit(cli_main())
File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 380, in cli_main
distributed_utils.call_main(cfg, worker_main, namespace_args=args)
File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 283, in call_main
return _spawn_helper(main, cfg, kwargs)
File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 261, in _spawn_helper
retval = distributed_main(-1, main, cfg, kwargs)
File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 218, in distributed_main
cfg.distributed_training.distributed_rank = distributed_init(cfg)
File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 181, in distributed_init
raise ImportError(
ImportError:
Please install megatron using the setup instructions!
- If I roll back to the commit right before #479 (364d7315) then
metaseq-api-local
can run, but then actually requesting completion results in errors:
~/metaseq$ git checkout 364d7315dfe91046fb1b58450edeac67e7d83a10
M metaseq/service/constants.py
Note: switching to '364d7315dfe91046fb1b58450edeac67e7d83a10'.
(...)
$ metaseq-api-local
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 0
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 7
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 5
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 6
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 2
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 3
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 4
2022-11-18 00:00:43 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 1
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-11-18 00:00:47 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
2022-11-18 00:01:07 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-11-18 00:01:13 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-11-18 00:01:14 | INFO | metaseq.cli.interactive | loaded model 0
2022-11-18 00:01:14 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.45.228:6010
* Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2022-11-18 00:01:14 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:6010
* Running on http://172.21.45.228:6010
2022-11-18 00:01:14 | INFO | werkzeug | Press CTRL+C to quit
$ curl -k http://localhost:6010/completions -H "Content-Type: application/json" -H "Authorization: authentic" -d '{
"prompt": "Description: A chair, two beds\nItems mentioned: 1 chair, 2 beds.\nDescription: A carpet, four beds\nItems mentioned: 1 carpet, 4 beds.\nDescription: Outside chair leg broken unrepairable/trash left around entire home\nItems mentioned: 1 chair.\nDescription: 3 rugs - 2 kitchen rugs and the living room rug\nItems mentioned: 3 rugs.",
"temperature": 1.0,
"max_tokens": 32, "min_tokens": 4,
"top_p": 0.9, "n": 1,
"echo": false, "stop": "\n"
}'| jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2557 100 2093 100 464 2450 543 --:--:-- --:--:-- --:--:-- 2990
{
"error": {
"code": null,
"message": "module 'megatron.mpu' has no attribute 'LinearWithGradAccumulationAndAsyncCommunication'",
"param": null,
"stacktrace": [
" File \"/home/default_user/.conda/envs/user/lib/python3.10/site-packages/flask/app.py\", line 1523, in full_dispatch_request\n rv = self.dispatch_request()\n",
" File \"/home/default_user/.conda/envs/user/lib/python3.10/site-packages/flask/app.py\", line 1509, in dispatch_request\n return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)\n",
" File \"/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py\", line 348, in completions\n raise generations\n",
" File \"/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py\", line 153, in batching_loop\n generations = generator.generate(**request_object)\n",
" File \"/home/jason_chou/metaseq/metaseq/hub_utils.py\", line 277, in generate\n translations = self.task.inference_step(generator, self.models, batch)\n",
" File \"/home/jason_chou/metaseq/metaseq/tasks/base_task.py\", line 426, in inference_step\n return generator.generate(models, sample, prefix_tokens=prefix_tokens)\n",
" File \"/home/default_user/.conda/envs/user/lib/python3.10/site-packages/torch/autograd/grad_mode.py\", line 27, in decorate_context\n return func(*args, **kwargs)\n",
" File \"/home/jason_chou/metaseq/metaseq/sequence_generator.py\", line 88, in generate\n return self._generate(sample, **kwargs)\n",
" File \"/home/jason_chou/metaseq/metaseq/sequence_generator.py\", line 169, in _generate\n model_out = self.model.decoder(\n",
" File \"/home/default_user/.conda/envs/user/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1130, in _call_impl\n return forward_call(*input, **kwargs)\n",
" File \"/home/jason_chou/metaseq/metaseq/models/transformer_decoder.py\", line 379, in forward\n x = self.output_layer(x)\n",
" File \"/home/jason_chou/metaseq/metaseq/model_parallel/models/transformer.py\", line 65, in output_layer\n x = mpu.LinearWithGradAccumulationAndAsyncCommunication.apply(\n"
],
"type": "invalid_request_error"
}
}
- If I roll all the way back to the commit right before #460 (ce294a11), then things work:
~/metaseq$ git checkout ce294a115cecf02efb8bae2f26305728d7c05500
M metaseq/service/constants.py
Previous HEAD position was 364d731 Launch separate sbatch job to copy checkpoints over from scratch to NFS (#494)
HEAD is now at ce294a1 Remove different dimension args (#462)
$ metaseq-api-local
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 0
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 5
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 3
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 1
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 4
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 6
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 2
2022-11-18 00:07:18 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 7
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-11-18 00:07:22 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
2022-11-18 00:07:42 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-11-18 00:07:46 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-11-18 00:07:47 | INFO | metaseq.cli.interactive | loaded model 0
2022-11-18 00:07:48 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.45.228:6010
* Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2022-11-18 00:07:48 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:6010
* Running on http://172.21.45.228:6010
2022-11-18 00:07:48 | INFO | werkzeug | Press CTRL+C to quit
$ curl -k http://localhost:6010/completions -H "Content-Type: application/json" -H "Authorization: authentic" -d '{
"prompt": "Description: A chair, two beds\nItems mentioned: 1 chair, 2 beds.\nDescription: A carpet, four beds\nItems mentioned: 1 carpet, 4 beds.\nDescription: Outside chair leg broken unrepairable/trash left around entire home\nItems mentioned: 1 chair.\nDescription: 3 rugs - 2 kitchen rugs and the living room rug\nItems mentioned: 3 rugs.",
"temperature": 1.0,
"max_tokens": 32, "min_tokens": 4,
"top_p": 0.9, "n": 1,
"echo": false, "stop": "\n"
}'| jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1811 100 1347 100 464 382 131 0:00:03 0:00:03 --:--:-- 514
{
"choices": [
{
"logprobs": {
"finish_reason": "length",
"text_offset": [
0,
2,
4,
11,
15,
21,
23,
26,
27,
29,
33,
33,
34,
35,
40,
43,
45,
48,
59,
63,
66,
69,
77,
80,
89,
91,
95,
103,
106,
108,
112,
114
],
"token_logprobs": [
-2.8142499923706055,
-7.563986301422119,
-5.285521984100342,
-2.192822217941284,
-4.636736869812012,
-1.7162295579910278,
-0.0005318744806572795,
-2.6370177268981934,
-2.1895456314086914,
-1.9937776327133179,
-1.9050769805908203,
-0.002048682188615203,
-0.00029025712865404785,
-1.0687462091445923,
-1.5542668104171753,
-3.009408950805664,
-3.9404053688049316,
-5.934344291687012,
-1.175774335861206,
-1.0492457151412964,
-1.059978723526001,
-4.073267459869385,
-2.1524734497070312,
-3.039748430252075,
-2.659883499145508,
-5.401388168334961,
-2.7808990478515625,
-0.6962568163871765,
-0.23176079988479614,
-0.10268733650445938,
-2.26662015914917,
-0.04845426231622696
],
"tokens": [
" (",
"My",
" family",
" has",
" three",
" r",
"ugs",
".",
" I",
" don",
"�",
"�",
"t",
" know",
" if",
" I",
" am",
" forgetting",
" one",
" or",
" if",
" someone",
" is",
" counting",
" a",
" dog",
" blanket",
" as",
" a",
" rug",
").",
"\n"
],
"top_logprobs": null
},
"text": " (My family has three rugs. I don’t know if I am forgetting one or if someone is counting a dog blanket as a rug).\n"
}
],
"created": 1668730119,
"id": "0e60c361-808a-4028-a064-3b625a66d36e",
"model": "/home/jason_chou/redspot_home/66b/",
"object": "text_completion"
}
- Alternatively, the current
main
also works withfairseq_v3
branch of Megatron-LM:
~/metaseq$ git checkout main
M metaseq/service/constants.py
Previous HEAD position was ce294a1 Remove different dimension args (#462)
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
$ cd ../Megatron-LM/
~/Megatron-LM$ git checkout fairseq_v3
Switched to branch 'fairseq_v3'
Your branch is up to date with 'origin/fairseq_v3'.
$ metaseq-api-local
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 0
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 7
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 2
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 5
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 4
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 3
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 1
2022-11-18 00:14:15 | INFO | metaseq.distributed.utils | initialized host i-0bf8e5569aa4999be as rank 6
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-11-18 00:14:19 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
2022-11-18 00:14:40 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-11-18 00:14:45 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-11-18 00:14:46 | INFO | metaseq.cli.interactive | loaded model 0
2022-11-18 00:14:47 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.45.228:6010
* Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
2022-11-18 00:14:47 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:6010
* Running on http://172.21.45.228:6010
2022-11-18 00:14:47 | INFO | werkzeug | Press CTRL+C to quit
$ curl -k http://localhost:6010/completions -H "Content-Type: application/json" -H "Authorization: authentic" -d '{
"prompt": "Description: A chair, two beds\nItems mentioned: 1 chair, 2 beds.\nDescription: A carpet, four beds\nItems mentioned: 1 carpet, 4 beds.\nDescription: Outside chair leg broken unrepairable/trash left around entire home\nItems mentioned: 1 chair.\nDescription: 3 rugs - 2 kitchen rugs and the living room rug\nItems mentioned: 3 rugs.",
"temperature": 1.0,
"max_tokens": 32, "min_tokens": 4,
"top_p": 0.9, "n": 1,
"echo": false, "stop": "\n"
}'| jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 935 100 471 100 464 331 326 0:00:01 0:00:01 --:--:-- 658
{
"choices": [
{
"logprobs": {
"finish_reason": "length",
"text_offset": [
0,
3,
9,
15,
22,
23
],
"token_logprobs": [
-4.556225776672363,
-3.6027157306671143,
-1.8605602979660034,
-2.4117279052734375,
-0.19675344228744507,
-0.05910465121269226
],
"tokens": [
" No",
" other",
" items",
" listed",
".",
"\n"
],
"top_logprobs": null
},
"text": " No other items listed.\n"
}
],
"created": 1668730514,
"id": "e856b3cc-d46a-491c-a5ab-c440e6bac510",
"model": "/home/jason_chou/redspot_home/66b/",
"object": "text_completion"
}
It seems that we should either instruct people to just use fairseq_v3
branch of Megatron-LM, or roll back #460 and #479.
Expected behavior
metaseq-api-local
just works
Environment
- metaseq Version: 4e1592ae (current main)
- PyTorch Version: 1.12.1+cu113
- OS: Ubuntu 18.04.6 LTS
- How you installed metaseq:
pip
- Build command you used (if compiling from source): N.A.
- Python version: 3.10
- CUDA/cuDNN version: CUDA 11.8
- GPU models and configuration: 8 x V100 SXM2 32 GB