Add support for source target style datasets to DocumentToSequence (!398) · Merge requests · Administrator / metaseq

Merged Administrator requested to merge github/fork/zdevito/dataloader4 into main Oct 07, 2022

Created by: zdevito

Training runs using StreamingSrcTgtDataset were failing because they did not do the same token length caching as DocumentsToSequences.

StreamingSrcTgtDataset is really just another instances of StreamingTokenBlockDataset where the the blocks are split into a tuple (src, target). To avoid duplication this PR just adds support for this case directly to DocumentToSequences, and a test to verify this replicates the old behavior.