File size: 9,981 Bytes
f9fc05c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
Model parameters: d_model 768 ffw_size 3072 kv_size 64 n_heads 12 n_layers 15
Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 15 --hidden-size 768 --num-attention-heads 12 --kv-channels 64 --ffn-hidden-size 3072 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 4 --global-batch-size 256 --train-samples 1 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-146m1b5100mdedupval --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 1 --lr-warmup-samples 0 --clip-grad 1.0 --weight-decay 1e-1 --no-load-optim --reset-progress --override-lr-scheduler --log-interval 10 --save-interval 1000 --eval-interval 1 --eval-iters 100 --eval-only true --tensorboard-dir tensorboard_146m1b5100mdedupval --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_146m1b5100mdedup --load checkpoints_146m1b5100mdedup --train-weighted-split-paths-path train14b.txt --valid-weighted-split-paths-path val.txt --data-impl mmap --deepspeed --deepspeed_config ds_configs/3406547.json --zero-stage 0
START 3406547: Mon 24 Apr 2023 12:05:08 PM EEST
0: 
0: 
0: ======================= ROCm System Management Interface =======================
0: ================================= Concise Info =================================
0: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0: 0    48.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
0: 1    47.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
0: 2    36.0c  87.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
0: 3    46.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
0: 4    48.0c  85.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
0: 5    48.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
0: 6    41.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
0: 7    44.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
0: ================================================================================
0: ============================= End of ROCm SMI Log ==============================
7: 
7: 
7: ======================= ROCm System Management Interface =======================
7: ================================= Concise Info =================================
7: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
7: 0    49.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
7: 1    49.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
7: 2    43.0c  86.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
7: 3    45.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
7: 4    45.0c  92.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
7: 5    47.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
7: 6    41.0c  87.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
7: 7    43.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
7: ================================================================================
7: ============================= End of ROCm SMI Log ==============================
3: 
3: 
3: ======================= ROCm System Management Interface =======================
3: ================================= Concise Info =================================
3: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
3: 0    47.0c  86.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
3: 1    47.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
3: 2    42.0c  91.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
3: 3    45.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
3: 4    42.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
3: 5    49.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
3: 6    43.0c  88.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
3: 7    41.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
3: ================================================================================
3: ============================= End of ROCm SMI Log ==============================
4: 
4: 
4: ======================= ROCm System Management Interface =======================
4: ================================= Concise Info =================================
4: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
4: 0    47.0c  92.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
4: 1    48.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
4: 2    40.0c  87.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
4: 3    44.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
4: 4    42.0c  88.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
4: 5    42.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
4: 6    41.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
4: 7    43.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
4: ================================================================================
4: ============================= End of ROCm SMI Log ==============================
6: 
6: 
6: ======================= ROCm System Management Interface =======================
6: ================================= Concise Info =================================
6: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
6: 0    45.0c  91.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
6: 1    46.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
6: 2    40.0c  88.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
6: 3    49.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
6: 4    42.0c  93.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
6: 5    44.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
6: 6    37.0c  95.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
6: 7    41.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
6: ================================================================================
6: ============================= End of ROCm SMI Log ==============================
1: 
1: 
1: ======================= ROCm System Management Interface =======================
1: ================================= Concise Info =================================
1: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
1: 0    40.0c  93.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
1: 1    48.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
1: 2    38.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
1: 3    47.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
1: 4    45.0c  86.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
1: 5    47.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
1: 6    43.0c  82.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
1: 7    47.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
1: ================================================================================
1: ============================= End of ROCm SMI Log ==============================
5: 
5: 
5: ======================= ROCm System Management Interface =======================
5: ================================= Concise Info =================================
5: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
5: 0    47.0c  90.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
5: 1    48.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
5: 2    38.0c  88.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
5: 3    50.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
5: 4    44.0c  85.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
5: 5    45.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
5: 6    41.0c  84.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
5: 7    46.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
5: ================================================================================
5: ============================= End of ROCm SMI Log ==============================
2: 
2: 
2: ======================= ROCm System Management Interface =======================
2: ================================= Concise Info =================================
2: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
2: 0    44.0c  86.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
2: 1    48.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
2: 2    43.0c  91.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
2: 3    49.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
2: 4    45.0c  86.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
2: 5    42.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
2: 6    43.0c  90.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
2: 7    45.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
2: ================================================================================
2: ============================= End of ROCm SMI Log ==============================
0: Launching on nid006908 (0/8), master nid006908 port 9999, GPUs 8, CUDA: True
1: Launching on nid006909 (1/8), master nid006908 port 9999, GPUs 8, CUDA: True
2: Launching on nid006910 (2/8), master nid006908 port 9999, GPUs 8, CUDA: True
7: Launching on nid006915 (7/8), master nid006908 port 9999, GPUs 8, CUDA: True
6: Launching on nid006914 (6/8), master nid006908 port 9999, GPUs 8, CUDA: True
3: Launching on nid006911 (3/8), master nid006908 port 9999, GPUs 8, CUDA: True
4: Launching on nid006912 (4/8), master nid006908 port 9999, GPUs 8, CUDA: True
5: Launching on nid006913 (5/8), master nid006908 port 9999, GPUs 8, CUDA: True