Created by: ruanslv
Patch Description v2 for https://github.com/facebookresearch/metaseq/pull/343, with an attempt to decouple gate logic from decoder.
Also consolidated our GeLU implementation to a new version of gelu_accurate that explicitly defines the multiplying constants + relies on JIT for better performance.
Testing steps Running ablations to compare perfomance against previous runs and making sure PPL matches