[WIP | Very hacky | no to be merged yet] Cuda graph incremental decoding (!270) · Merge requests · Administrator / metaseq

Open Administrator requested to merge cuda_graph_incremental_decoding into main Jul 28, 2022

Created by: ngoyal2707

working version of cuda graphs with incremental decoding. Currently at n_best=1 even with one single cuda graph of max_seq_len (2048), I see following improvement for 175B on 8x azure A100 :

per token latency baseline: ~95ms
per token latency this change: ~66ms

My guess is saving will be larger for 16xMP serving and any of our smaller models.

If I create bunch more cuda graphs and keep all of them in memory, we can get it down to as following: