Created by: KUNAL1612
Patch Description Moved the future mask to CUDA so that all operations for document attention also take place on CUDA. Refer to issue #285 for context. The speedup offered by this is marginal but over multiple runs may cumulate.
To test, I set attn doc seperator to a random number to trigger the code to enter the branch, and timed it using the methods used in #220 (closed). Observed that over 436 calls to the function on CPU it takes 0.42 seconds vs 0.09 seconds on GPU.