File size: 9,981 Bytes
f9fc05c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
Model parameters: d_model 768 ffw_size 3072 kv_size 64 n_heads 12 n_layers 15
Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 15 --hidden-size 768 --num-attention-heads 12 --kv-channels 64 --ffn-hidden-size 3072 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 4 --global-batch-size 256 --train-samples 1 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-146m1b5100mdedupval --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 1 --lr-warmup-samples 0 --clip-grad 1.0 --weight-decay 1e-1 --no-load-optim --reset-progress --override-lr-scheduler --log-interval 10 --save-interval 1000 --eval-interval 1 --eval-iters 100 --eval-only true --tensorboard-dir tensorboard_146m1b5100mdedupval --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_146m1b5100mdedup --load checkpoints_146m1b5100mdedup --train-weighted-split-paths-path train14b.txt --valid-weighted-split-paths-path val.txt --data-impl mmap --deepspeed --deepspeed_config ds_configs/3406547.json --zero-stage 0
START 3406547: Mon 24 Apr 2023 12:05:08 PM EEST
0:
0:
0: ======================= ROCm System Management Interface =======================
0: ================================= Concise Info =================================
0: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0: 0 48.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: 2 36.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: 4 48.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: 6 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: ================================================================================
0: ============================= End of ROCm SMI Log ==============================
7:
7:
7: ======================= ROCm System Management Interface =======================
7: ================================= Concise Info =================================
7: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
7: 0 49.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: 2 43.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: 4 45.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: 6 41.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: ================================================================================
7: ============================= End of ROCm SMI Log ==============================
3:
3:
3: ======================= ROCm System Management Interface =======================
3: ================================= Concise Info =================================
3: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
3: 0 47.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: 2 42.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: 4 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 5 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: 6 43.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: ================================================================================
3: ============================= End of ROCm SMI Log ==============================
4:
4:
4: ======================= ROCm System Management Interface =======================
4: ================================= Concise Info =================================
4: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
4: 0 47.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: 2 40.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: 4 42.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: 6 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: ================================================================================
4: ============================= End of ROCm SMI Log ==============================
6:
6:
6: ======================= ROCm System Management Interface =======================
6: ================================= Concise Info =================================
6: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
6: 0 45.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: 2 40.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 3 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: 4 42.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: 6 37.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: ================================================================================
6: ============================= End of ROCm SMI Log ==============================
1:
1:
1: ======================= ROCm System Management Interface =======================
1: ================================= Concise Info =================================
1: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
1: 0 40.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: 2 38.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: 4 45.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: 6 43.0c 82.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 7 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: ================================================================================
1: ============================= End of ROCm SMI Log ==============================
5:
5:
5: ======================= ROCm System Management Interface =======================
5: ================================= Concise Info =================================
5: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
5: 0 47.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: 2 38.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 3 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: 4 44.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: 6 41.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: ================================================================================
5: ============================= End of ROCm SMI Log ==============================
2:
2:
2: ======================= ROCm System Management Interface =======================
2: ================================= Concise Info =================================
2: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
2: 0 44.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: 2 43.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 3 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: 4 45.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: 6 43.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: ================================================================================
2: ============================= End of ROCm SMI Log ==============================
0: Launching on nid006908 (0/8), master nid006908 port 9999, GPUs 8, CUDA: True
1: Launching on nid006909 (1/8), master nid006908 port 9999, GPUs 8, CUDA: True
2: Launching on nid006910 (2/8), master nid006908 port 9999, GPUs 8, CUDA: True
7: Launching on nid006915 (7/8), master nid006908 port 9999, GPUs 8, CUDA: True
6: Launching on nid006914 (6/8), master nid006908 port 9999, GPUs 8, CUDA: True
3: Launching on nid006911 (3/8), master nid006908 port 9999, GPUs 8, CUDA: True
4: Launching on nid006912 (4/8), master nid006908 port 9999, GPUs 8, CUDA: True
5: Launching on nid006913 (5/8), master nid006908 port 9999, GPUs 8, CUDA: True
|