Created by: zdevito
Patch Description Introduce machinery to skip ahead in the dataset without having to re-tokenize or read the files in the dataset again.
Measuring in a separate benchmark script in the internal repo indicates this can run reduce the 'fast-forward' stage of data loader from ~20mins to 14 seconds (around 70x speedup).
This works by storing a cache from document idx -> number of tokens that can be stored in a snapshot as a array of numbers. When the token count is known, the DeferredDataset
creates DeferredTensor
objects that know their size, how to compute the value if need and how to generate new tensors via slices and concatenates. Another object SkipDeferredDataset
skips the first to_skip
elements, without ever computing the DeferredTensors values, bypassing tokenization while keeping the state of the data loader (e.g. the shuffle buffer) exactly the same as if it had actually been running.