[Community] Add integration on serving OPT-175B using Alpa backend
Created by: zhisbug
Motivation
Thanks for open sourcing the OPT model weights!
With much help from this thread, we recently have integrated OPT serving with the Alpa system (developed by UC Berkeley).
Alpa offers unique advantages to train or serve large OPT models with more flexible/heterogeneous cluster setup and GPU specs -- such as using many in-house clusters with lower-end GPUs (e.g., 40GB A100, V100, or even Titan X, 2080Ti), as long as the total memory is sufficient.
Alpa can also provide various types of parallelism strategies beyond FSDP and tensor model parallel, which might be more advantageous in certain cluster setups, e.g., multiple-node cluster, or cluster with limited communication bandwidth.
Pitch
Like the HuggingFace integration with OPT, We want to contribute a guide on how to setup OPT-175B serving (and later, inference or training if enough interest), so people who do not have 8x 80Gb A100 can still benefit from these weights by using Alpa.
We have generated a setup guide on our end. For the OPT-175B model, we have tested it on various cluster setups, such as 32x 16GB V100 (AWS p3.16x), or 12x 40GB V100, etc.
Plan
I can draft a PR to Metaseq to describe how to use this integration and might need your review, what do you think? @stephenroller