Created by: ngoyal2707
this branch is what I am using to train latest model with following changes:
I will not merge this and will piece out separate PRs for each of these following:
- Seq_parallel
- ability to only checkpoint MHA
- disable bias
- disable LN affine,
- act cpu_offload experimental
- merging of gelu with FC2 to save on act memory of gelu
- bunch of cleanup