Created by: ngoyal2707
Gives 2-3% speed up with almost no ppl loss, plus it is very hard to have correct bias gradients with sequence parallel + FSDP for now
Created by: ngoyal2707
Gives 2-3% speed up with almost no ppl loss, plus it is very hard to have correct bias gradients with sequence parallel + FSDP for now