Created by: bashnick
Patch Description PR removes unnecessary inheritance layers and flattens the class structure for better interpretability and transparency. Short summary:
- removed TransformerDecoder -> ModelParallelTransformerDecoder
- removed LanguageModel -> BaseModel
- removed TransformerDecoderLayer -> ModelParallelTransformerDecoderLayer
- removed MultiheadAttention -> ModelParallelMultiheadAttention
- removed arch transformer_lm -> transformer_lm_megatron
- updated test gpu_tests/test_hf_compatibility.py to work with model_parallel
Testing steps
- tested model run: python -m PROJECT_NAME.projects.MODEL_NAME.sweep_baseline -g 4 -n 1 --rsc --model-size 8m --tokenizer rsc --prefix NB000 --local --data /checkpoint/TEAM_NAME/datasets/consolidated/v4.0
- tested evaluations: FSD=/checkpoint/TEAM_NAME/datasets/few_shot_data python PROJECT_NAME/scripts/eval/schedule_jobs_few_shot_opt_evaluation.py -t copa cb flan_cb --model-name punitkoura_125m --model-path /checkpoint/TEAM_NAME/checkpoints/punitkoura/small_test_run/1000/checkpoint_1000.pt --model-template gptz_sharded_config --nshot 0 -o ~/MODEL_NAME/fix_scoring_001 --slurm-partition learn --combine-tasks --max-ingestible-tokens 4000
- tested continued run from a checkpoint