Created by: KUNAL1612
Patch Description Observed during evaluating some models that the torch.cat operation in L732 took up 11.98GB of CPU memory. Used debugger to confirm that this was indeed because all tensors in shards_to_load function was on CPU. Because of this CPU operation, model loading using this function was slower than it needed to be. This speeds that up.
Testing steps Observed memory traces generated using profiler. Made changes and saw that model loading time using this function dropped anywhere between 10-15% given the same baseline conditions.