[jsonl] Make skipping entries in the dataloader significantly faster (!337) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge github/fork/zdevito/dataloader into main Sep 13, 2022

Created by: zdevito

Patch Description Introduce machinery to skip ahead in the dataset without having to re-tokenize or read the files in the dataset again.

Measuring in a separate benchmark script in the internal repo indicates this can run reduce the 'fast-forward' stage of data loader from ~20mins to 14 seconds (around 70x speedup).

This works by storing a cache from document idx -> number of tokens that can be stored in a snapshot as a array of numbers. When the token count is known, the DeferredDataset creates DeferredTensor objects that know their size, how to compute the value if need and how to generate new tensors via slices and concatenates. Another object SkipDeferredDataset skips the first to_skip elements, without ever computing the DeferredTensors values, bypassing tokenization while keeping the state of the data loader (e.g. the shuffle buffer) exactly the same as if it had actually been running.