properly loading model (!328) · Merge requests · Administrator / metaseq

Open Administrator requested to merge github/fork/lilisierrayu/model_initialize into main Sep 02, 2022

Created by: lilisierrayu

Patch Description Fix model initialization for some of the OPT models.

Issue: Models up to 13B get loaded but run into an error about a half/float mismatch, while 30B model runs fine (see discussion: https://fb.workplace.com/groups/gogogptzusers/permalink/762243555001669/).

Debugging: After loading model from {azure_dir}/1.3B/consolidated_mp_2/consolidated.pt, and print out [p.dtype for p in model.parameters()], it shows a mix of torch.float16 and torch.float32. Found that cfg.model.tensor_parallel_init_model_on_gpu = False in 1.3B model (as "True" in 30B model), so the model is not properly initialized and it fails silently due to dtype auto cast in model.load_state_dict.

Testing steps With the fix, able to launch internactive_cli.py and interactive_hosted.py with the following model paths: f"--path {azure_dir}/2.7B/consolidated_mp_1/consolidated.pt", f"--path {azure_dir}/30B/consolidated_mp_4/reshard.pt", f"--path {azure_dir}/30B/consolidated_mp_2/consolidated.pt", f"--path {azure_dir}/1.3B/consolidated_mp_2/consolidated.pt",