Created by: zdevito
The fast-forwarding process added with DeferredTensor avoids having to tokenize documents that have already been read, but the code to simulate the shuffle buffer and sequence construction process can still take ~7 minutes to execute on full-size shards.
This patch does some optimization to the hot path of the skipping process to make it somewhere between 5--9x faster compared to the current state and around 500 -- 900x faster than the original OPT method, which should bring the time to do the fast forward down below a minute.
- Merges the shuffling of documents, creation of sequences of documents into one object (DocumentToSequenceDataset). This allows all the deferred execution logic to be put into one file rather than have to expose a DeferredTensor object whose construction is slow.
- Uses coarse-grained per-worker locking rather than atomics so that we do not need C extensions anymore.
- Performs blocked generation of random numbers from numpy because there is high call overhead for asking for a single number.
- The merge of the datasets and change to the random numbers are done in such a way that the random behavior of the merged object matches that of the previous code, so it is safe to have a checkpoint created in an older version of the code loaded by this code.
- I also manually check a requeue after checkpoint to see that the docsperex and loss match before and after checkpoint load.