Created by: janEbert
Patch Description
This adds a preprocessing script, which is mostly just a modified BigScience/Megatron-DeepSpeed version, using metaseq
classes instead where possible (except the tokenizer, I couldn't figure that out).
Testing steps The data works with https://github.com/chelseajohn/metaseq and https://github.com/chelseajohn/OPT-Code and fixes #186 (closed).