Created by: Xirider
Currently there is quite some time between the start of the first train_step and first train_inner log update. This is due to the time it takes for the different processes for each gpu to start. Also it takes some time for each of these processes to completely fill their streaming buffer of training examples. This makes it difficult to debug when something hangs or if there are issues with the data loaders.
To solve this I added some logging to the following:
- once each process starts filling their buffers (this and starting workers takes the most time)
- print out the first batch on the first rank before it enters the model (First point in https://github.com/facebookresearch/metaseq/issues/277)
- Start and finish of the first the forward and backward pass.
This is an example log:
2022-10-24 19:19:23 | INFO | metaseq.trainer | nvidia-smi stats: {'gpu_0_mem_used_gb': 16.06640625, 'gpu_1_mem_used_gb': 16.64453125, 'gpu_2_mem_used_gb': 16.62890625, 'gpu_3_mem_used_gb': 3.9892578125}
2022-10-24 19:19:23 | INFO | metaseq.utils | ***********************CUDA enviroments for all 4 workers***********************
2022-10-24 19:19:23 | INFO | metaseq.utils | rank 0: capabilities = 8.0 ; total memory = 39.409 GB ; name = NVIDIA A100-SXM4-40GB
2022-10-24 19:19:23 | INFO | metaseq.utils | rank 1: capabilities = 8.0 ; total memory = 39.409 GB ; name = NVIDIA A100-SXM4-40GB
2022-10-24 19:19:23 | INFO | metaseq.utils | rank 2: capabilities = 8.0 ; total memory = 39.409 GB ; name = NVIDIA A100-SXM4-40GB
2022-10-24 19:19:23 | INFO | metaseq.utils | rank 3: capabilities = 8.0 ; total memory = 39.409 GB ; name = NVIDIA A100-SXM4-40GB
2022-10-24 19:19:23 | INFO | metaseq.utils | ***********************CUDA enviroments for all 4 workers***********************
2022-10-24 19:19:23 | INFO | metaseq.cli.train | training on 4 devices (GPUs/TPUs)
2022-10-24 19:19:23 | INFO | metaseq.cli.train | max tokens per GPU = None and batch size per GPU = 32
2022-10-24 19:19:23 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2022-10-24 19:19:23 | INFO | metaseq.cli.train | nvidia-smi stats: {'gpu_0_mem_used_gb': 16.06640625, 'gpu_1_mem_used_gb': 16.64453125, 'gpu_2_mem_used_gb': 16.62890625, 'gpu_3_mem_used_gb': 3.9892578125}
2022-10-24 19:19:23 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2022-10-24 19:19:23 | INFO | metaseq.checkpoint_utils | attempting to load checkpoint from: ./test-checkpoint/train.8m.bm_none.gpt2.me_fp16.fsdp.zero2.relu.transformer_lm_megatron.nlay4.emb128.lrnpos.0emb_scale.tps2048.adam.b2_0.95.eps1e-08.cl1.0.lr0.001.endlr0.0001.wu2861.dr0.1.atdr0.1.0emb_dr.wd0.1.ms32.uf1.mu1000.s1.me100.ngpu4/checkpoint_last-model_part-0-shard0.pt
2022-10-24 19:19:23 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2022-10-24 19:19:23 | WARNING | metaseq.checkpoint_utils | Proceeding without metaseq-internal installed! Please check if you need this!
2022-10-24 19:19:23 | INFO | metaseq.trainer | No existing checkpoint found ./test-checkpoint/train.8m.bm_none.gpt2.me_fp16.fsdp.zero2.relu.transformer_lm_megatron.nlay4.emb128.lrnpos.0emb_scale.tps2048.adam.b2_0.95.eps1e-08.cl1.0.lr0.001.endlr0.0001.wu2861.dr0.1.atdr0.1.0emb_dr.wd0.1.ms32.uf1.mu1000.s1.me100.ngpu4/checkpoint_last-model_part-0-shard0.pt
2022-10-24 19:19:23 | INFO | metaseq.trainer | loading train data for epoch 1
2022-10-24 19:19:23 | INFO | metaseq.data.jsonl_dataset | Loading up cache: /data/home/peteralbert/repos/sor/metaseq/gpu_tests/circleci/corpus_dedup_10_10_1_0.05_exp29/train/00/BookCorpusFair.jsonl.fairseq.idx.npy
2022-10-24 19:19:23 | INFO | metaseq.tasks.streaming_language_modeling | setting shuffle buffer size to 640
2022-10-24 19:19:23 | INFO | metaseq.trainer | finished creating batch iterator
2022-10-24 19:19:23 | INFO | metaseq.optim.adam | using FusedAdam
2022-10-24 19:19:23 | INFO | metaseq.optim.dynamic_loss_scaler | *** SCALE_WINDOW: 256, loss scale: 4 ***
/fsx/users/peteralbert/conda/envs/sor/lib/python3.8/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/fsx/users/peteralbert/conda/envs/sor/lib/python3.8/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/fsx/users/peteralbert/conda/envs/sor/lib/python3.8/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/fsx/users/peteralbert/conda/envs/sor/lib/python3.8/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
2022-10-24 19:19:24 | INFO | metaseq.data.document_to_sequence | Begin filling streaming dataset buffer for each worker
2022-10-24 19:19:24 | INFO | metaseq.trainer | begin training epoch 1
2022-10-24 19:19:24 | INFO | metaseq.cli.train | Start iterating over samples
2022-10-24 19:19:36 | INFO | metaseq.trainer | First batch on first rank:
{'id': tensor([192, 400, 417, 370, 468, 483, 259, 407, 405, 369, 463, 239, 434, 253,
241, 449, 342, 132, 488, 17, 401, 63, 225, 37]), 'net_input': {'src_tokens': tensor([[ 212, 282, 449, ..., 6316, 37653, 1343],
[ 181, 506, 35905, ..., 10668, 2348, 856],
[40577, 52, 642, ..., 605, 939, 219],
...,
[ 992, 571, 4122, ..., 9169, 295, 11483],
[ 267, 2231, 29, ..., 784, 338, 329],
[ 821, 46106, 462, ..., 992, 428, 4791]])}, 'target': tensor([[ 282, 449, 8628, ..., 37653, 1343, 605],
[ 506, 35905, 784, ..., 2348, 856, 438],
[ 52, 642, 1343, ..., 939, 219, 3662],
...,
[ 571, 4122, 910, ..., 295, 11483, 417],
[ 2231, 29, 329, ..., 338, 329, 417],
[46106, 462, 3988, ..., 428, 4791, 30461]]), 'nsentences': 32, 'ntokens': tensor(65536)}
2022-10-24 19:19:36 | INFO | metaseq.tasks.base_task | Starting first forward pass and waiting for dataloader in other ranks
2022-10-24 19:20:32 | INFO | metaseq.data.document_to_sequence | Begin filling streaming dataset buffer for each worker
2022-10-24 19:20:32 | INFO | metaseq.data.document_to_sequence | Begin filling streaming dataset buffer for each worker
2022-10-24 19:20:32 | INFO | metaseq.data.document_to_sequence | Begin filling streaming dataset buffer for each worker
2022-10-24 19:21:07 | INFO | metaseq.tasks.base_task | Starting backward pass
2022-10-24 19:21:07 | INFO | metaseq.tasks.base_task | Finished first backward pass
2022-10-24 19:21:08 | INFO | metaseq.logging.progress_bar.base_progress_bar | Setting tensorboard directory to ./test-checkpoint/train.8m.bm_none.gpt2.me_fp16.fsdp.zero2.relu.transformer_lm_megatron.nlay4.emb128.lrnpos.0emb_scale.tps2048.adam.b2_0.95.eps1e-08.cl1.0.lr0.001.endlr0.0001.wu2861.dr0.1.atdr0.1.0emb_dr.wd0.1.ms32.uf1.mu1000.s1.me100.ngpu4/train_inner
2022-10-24 19:21:08 | INFO | train_inner | {"epoch": 1, "actv_norm": "0.109", "pos_norm": "0.068", "tok_norm": "0.096", "emb_norm": "0", "docsperex": "1.78", "loss": "15.623", "ppl": "50462.5", "wps": "0", "ups": "0", "wpb": "131072", "bsz": "64", "num_updates": "1", "lr": "3.49528e-07", "gnorm": "2.969", "clip": "100", "loss_scale": "4", "scale_window": "256", "train_wall": "92", "cuda_gb_allocated": "9.3", "cuda_gb_reserved": "10.6", "cuda_gb_free": "30.1", "wall": "105"}
2022-10-24 19:21:10 | INFO | train_inner | {"epoch": 1, "actv_norm": "0.109", "pos_norm": "0.068", "tok_norm": "0.096", "emb_norm": "0", "docsperex": "1.78", "loss": "15.623", "ppl": "50464.2", "wps": "91888.3", "ups": "0.69", "wpb": "131072", "bsz": "64", "num_updates": "2", "lr": "6.99056e-07", "gnorm": "2.968", "clip": "100", "loss_scale": "4", "scale_window": "256", "train_wall": "1", "cuda_gb_allocated": "9.3", "cuda_gb_reserved": "10.6", "cuda_gb_free": "30.1", "wall": "107"}
2022-10-24 19:21:12 | INFO | train_inner | {"epoch": 1, "actv_norm": "0.109", "pos_norm": "0.068", "tok_norm": "0.096", "emb_norm": "0", "docsperex": "1.78", "loss": "15.623", "ppl": "50462.4", "wps": "94660.1", "ups": "0.72", "wpb": "131072", "bsz": "64", "num_updates": "3", "lr": "1.04858e-06", "gnorm": "2.969", "clip": "100", "loss_scale": "4", "scale_window": "256", "train_wall": "1", "cuda_gb_allocated": "9.3", "cuda_gb_reserved": "10.6", "cuda_gb_free": "30.1", "wall": "109"}
2022-10-24 19:21:13 | INFO | train_inner | {"epoch": 1, "actv_norm": "0.109", "pos_norm": "0.068", "tok_norm": "0.096", "emb_norm": "0", "docsperex": "1.78", "loss": "15.623", "ppl": "50448.4", "wps": "94391.2", "ups": "0.72", "wpb": "131072", "bsz": "64", "num_updates": "4", "lr": "1.39811e-06", "gnorm": "2.968", "clip": "100", "loss_scale": "4", "scale_window": "256", "train_wall": "1", "cuda_gb_allocated": "9.3", "cuda_gb_reserved": "10.6", "cuda_gb_free": "30.1", "wall": "110"}
2022-10-24 19:21:15 | INFO | train_inner | {"epoch": 1, "actv_norm": "0.109", "pos_norm": "0.068", "tok_norm": "0.096", "emb_norm": "0", "docsperex": "1.77", "loss": "15.622", "ppl": "50418.5", "wps": "95332.7", "ups": "0.73", "wpb": "131072", "bsz": "64", "num_updates": "5", "lr": "1.74764e-06", "gnorm": "2.98", "clip": "100", "loss_scale": "4", "scale_window": "256", "train_wall": "1", "cuda_gb_allocated": "9.3", "cuda_gb_reserved": "10.6", "cuda_gb_free": "30.1", "wall": "112"}
2022-10-24 19:21:17 | INFO | train_inner | {"epoch": 1, "actv_norm": "0.109", "pos_norm": "0.068", "tok_norm": "0.096", "emb_norm": "0", "docsperex": "1.77", "loss": "15.621", "ppl": "50387.8", "wps": "96668.7", "ups": "0.74", "wpb": "131072", "bsz": "64", "num_updates": "6", "lr": "2.09717e-06", "gnorm": "2.98", "clip": "100", "loss_scale": "4", "scale_window": "256", "train_wall": "1", "cuda_gb_allocated": "9.3", "cuda_gb_reserved": "10.6", "cuda_gb_free": "30.1", "wall": "114"}