_future_mask is created on CPU

Created by: KUNAL1612

🐛 Bug

Using profiler to look for CPU intensive operations lead me to observe that _future_mask in Line 725 in transformer.py is being instantiated on GPU.

While this is eventually being moved to GPU in L747, the operations that happen between L728 to L745 happen on CPU since the tensor itself is on CPU.

https://github.com/facebookresearch/metaseq/blob/c9c817d2a230519c2865264bafdf45931afa02e6/metaseq/models/transformer.py#L725-L747

However the code follows that logic only during document attention when the doc separator is not -1. So in effect, creating self._future_mask on GPU in L725 would only offer marginal improvements in some very few specific scenarios.

Need advice on if this change is worth making