Created by: sriniiyer
PR to control data proportions
PER BENCHMARK PROPORTIONS
--data-sampling-prob '{"flan": 0.35, "ni_v2": 0.24, "exmix": 0.03, "crossfit": 0.03, "pretrain": 0.0, "cot": 0.02, "t5": 0.03, "promptsource": 0.28, "unified_skg": 0.02,}'
-
You don't need to specify probs for all benchmarks. The code will uniformly distribute the remaining prob mass amongst the rest of the benchmarks. So, you can specify just --data-sampling-prob '{"pretrain": 0.0}'
-
The logs print the before proportions, and the after proportions, which you can double check.
EQUALIZE CLUSTER PROPORTIONS AFTER APPLYING BENCHMARK PROPS
--equalize-cluster-probs
This will equalize the prob of each cluster within a benchmark. Note that this needs cluster names to be included in the dataset names.
PROVIDE PER BENCHMARK EPS
--caps '{"flan": 30000, "ni_v2": 5000, "exmix": 20000, "crossfit": 20000, "cot": 10000, "t5": 20000, "promptsource": 20000, "unified_skg": 20000,}'
Instead of applying a uniform eps, this helps provide a per benchmark eps. If its not provided for a benchmark, it will fall back to the default eps, and if no default is specified, it will fall back to the dataset length.