|
Model parameters: d_model 768 ffw_size 3072 kv_size 64 n_heads 12 n_layers 15 |
|
Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 15 --hidden-size 768 --num-attention-heads 12 --kv-channels 64 --ffn-hidden-size 3072 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 4 --global-batch-size 256 --train-samples 1 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-146m1b5100mdedupval --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 1 --lr-warmup-samples 0 --clip-grad 1.0 --weight-decay 1e-1 --no-load-optim --reset-progress --override-lr-scheduler --log-interval 10 --save-interval 1000 --eval-interval 1 --eval-iters 100 --eval-only true --tensorboard-dir tensorboard_146m1b5100mdedupval --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_146m1b5100mdedup --load checkpoints_146m1b5100mdedup --train-weighted-split-paths-path train14b.txt --valid-weighted-split-paths-path val.txt --data-impl mmap --deepspeed --deepspeed_config ds_configs/3406547.json --zero-stage 0 |
|
START 3406547: Mon 24 Apr 2023 12:05:08 PM EEST |
|
0: |
|
0: |
|
0: ======================= ROCm System Management Interface ======================= |
|
0: ================================= Concise Info ================================= |
|
0: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
0: 0 48.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
0: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
0: 2 36.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
0: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
0: 4 48.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
0: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
0: 6 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
0: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
0: ================================================================================ |
|
0: ============================= End of ROCm SMI Log ============================== |
|
7: |
|
7: |
|
7: ======================= ROCm System Management Interface ======================= |
|
7: ================================= Concise Info ================================= |
|
7: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
7: 0 49.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
7: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
7: 2 43.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
7: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
7: 4 45.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
7: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
7: 6 41.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
7: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
7: ================================================================================ |
|
7: ============================= End of ROCm SMI Log ============================== |
|
3: |
|
3: |
|
3: ======================= ROCm System Management Interface ======================= |
|
3: ================================= Concise Info ================================= |
|
3: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
3: 0 47.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
3: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
3: 2 42.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
3: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
3: 4 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
3: 5 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
3: 6 43.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
3: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
3: ================================================================================ |
|
3: ============================= End of ROCm SMI Log ============================== |
|
4: |
|
4: |
|
4: ======================= ROCm System Management Interface ======================= |
|
4: ================================= Concise Info ================================= |
|
4: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
4: 0 47.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
4: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
4: 2 40.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
4: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
4: 4 42.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
4: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
4: 6 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
4: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
4: ================================================================================ |
|
4: ============================= End of ROCm SMI Log ============================== |
|
6: |
|
6: |
|
6: ======================= ROCm System Management Interface ======================= |
|
6: ================================= Concise Info ================================= |
|
6: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
6: 0 45.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
6: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
6: 2 40.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
6: 3 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
6: 4 42.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
6: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
6: 6 37.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
6: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
6: ================================================================================ |
|
6: ============================= End of ROCm SMI Log ============================== |
|
1: |
|
1: |
|
1: ======================= ROCm System Management Interface ======================= |
|
1: ================================= Concise Info ================================= |
|
1: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
1: 0 40.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
1: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
1: 2 38.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
1: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
1: 4 45.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
1: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
1: 6 43.0c 82.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
1: 7 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
1: ================================================================================ |
|
1: ============================= End of ROCm SMI Log ============================== |
|
5: |
|
5: |
|
5: ======================= ROCm System Management Interface ======================= |
|
5: ================================= Concise Info ================================= |
|
5: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
5: 0 47.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
5: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
5: 2 38.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
5: 3 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
5: 4 44.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
5: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
5: 6 41.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
5: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
5: ================================================================================ |
|
5: ============================= End of ROCm SMI Log ============================== |
|
2: |
|
2: |
|
2: ======================= ROCm System Management Interface ======================= |
|
2: ================================= Concise Info ================================= |
|
2: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% |
|
2: 0 44.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
2: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
2: 2 43.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
2: 3 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
2: 4 45.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
2: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
2: 6 43.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% |
|
2: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% |
|
2: ================================================================================ |
|
2: ============================= End of ROCm SMI Log ============================== |
|
0: Launching on nid006908 (0/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
1: Launching on nid006909 (1/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
2: Launching on nid006910 (2/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
7: Launching on nid006915 (7/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
6: Launching on nid006914 (6/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
3: Launching on nid006911 (3/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
4: Launching on nid006912 (4/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
5: Launching on nid006913 (5/8), master nid006908 port 9999, GPUs 8, CUDA: True |
|
|