Created by: ngoyal2707
working version of cuda graphs with incremental decoding. Currently at n_best=1 even with one single cuda graph of max_seq_len (2048), I see following improvement for 175B on 8x azure A100 :
per token latency baseline: ~95ms
per token latency this change: ~66ms
My guess is saving will be larger for 16xMP serving and any of our smaller models.
If I create bunch more cuda graphs and keep all of them in memory, we can get it down to as following: