Training in progress, step 2000
Browse files
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 483536061
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ac8d9fd9f9d28b93847de52a3619b51013265161572e174b1835ef9602818730
|
3 |
size 483536061
|
run.log
CHANGED
@@ -675,3 +675,252 @@ Rank: 0 partition count [1] and sizes[(241734912, False)]
|
|
675 |
[2022-12-18 13:48:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
676 |
[2022-12-18 13:48:08,208] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
677 |
[2022-12-18 13:48:08,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
675 |
[2022-12-18 13:48:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
676 |
[2022-12-18 13:48:08,208] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
677 |
[2022-12-18 13:48:08,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now!
|
678 |
+
[2022-12-18 13:51:32,738] [INFO] [logging.py:68:log_dist] [Rank 0] step=1010, skipped=4, lr=[8.877777777777779e-06], mom=[[0.9, 0.999]]
|
679 |
+
[2022-12-18 13:51:32,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=17.570497186974222, CurrSamplesPerSec=17.771050915545032, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
680 |
+
[2022-12-18 13:54:24,427] [INFO] [logging.py:68:log_dist] [Rank 0] step=1020, skipped=4, lr=[8.855555555555556e-06], mom=[[0.9, 0.999]]
|
681 |
+
[2022-12-18 13:54:24,429] [INFO] [timer.py:196:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=17.572388134802974, CurrSamplesPerSec=17.892674397538443, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
682 |
+
{'loss': 0.0037, 'learning_rate': 8.844444444444445e-06, 'epoch': 8.02}
|
683 |
+
[2022-12-18 13:57:13,310] [INFO] [logging.py:68:log_dist] [Rank 0] step=1030, skipped=4, lr=[8.833333333333334e-06], mom=[[0.9, 0.999]]
|
684 |
+
[2022-12-18 13:57:13,312] [INFO] [timer.py:196:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=17.574865744089255, CurrSamplesPerSec=17.830042096987665, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
685 |
+
[2022-12-18 13:59:11,204] [INFO] [logging.py:68:log_dist] [Rank 0] step=1040, skipped=4, lr=[8.811111111111112e-06], mom=[[0.9, 0.999]]
|
686 |
+
[2022-12-18 13:59:11,205] [INFO] [timer.py:196:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=17.578049897136907, CurrSamplesPerSec=17.867588966876575, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
687 |
+
[2022-12-18 14:02:49,022] [INFO] [logging.py:68:log_dist] [Rank 0] step=1050, skipped=4, lr=[8.788888888888891e-06], mom=[[0.9, 0.999]]
|
688 |
+
[2022-12-18 14:02:49,024] [INFO] [timer.py:196:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=17.580728468924875, CurrSamplesPerSec=16.936598922699496, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
689 |
+
{'loss': 0.0035, 'learning_rate': 8.788888888888891e-06, 'epoch': 9.0}
|
690 |
+
[2022-12-18 14:05:33,620] [INFO] [logging.py:68:log_dist] [Rank 0] step=1060, skipped=4, lr=[8.766666666666669e-06], mom=[[0.9, 0.999]]
|
691 |
+
[2022-12-18 14:05:33,622] [INFO] [timer.py:196:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=17.58019104202516, CurrSamplesPerSec=17.78021928381013, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
692 |
+
[2022-12-18 14:08:19,806] [INFO] [logging.py:68:log_dist] [Rank 0] step=1070, skipped=4, lr=[8.744444444444446e-06], mom=[[0.9, 0.999]]
|
693 |
+
[2022-12-18 14:08:19,808] [INFO] [timer.py:196:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=17.58085065511869, CurrSamplesPerSec=17.769715705552866, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
694 |
+
{'loss': 0.0038, 'learning_rate': 8.733333333333333e-06, 'epoch': 9.01}
|
695 |
+
[2022-12-18 14:11:06,335] [INFO] [logging.py:68:log_dist] [Rank 0] step=1080, skipped=4, lr=[8.722222222222224e-06], mom=[[0.9, 0.999]]
|
696 |
+
[2022-12-18 14:11:06,337] [INFO] [timer.py:196:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=17.579904119591816, CurrSamplesPerSec=17.725968554296802, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
697 |
+
[2022-12-18 14:13:51,062] [INFO] [logging.py:68:log_dist] [Rank 0] step=1090, skipped=4, lr=[8.700000000000001e-06], mom=[[0.9, 0.999]]
|
698 |
+
[2022-12-18 14:13:51,064] [INFO] [timer.py:196:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=17.57849146162872, CurrSamplesPerSec=17.506514413151734, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
699 |
+
[2022-12-18 14:16:41,006] [INFO] [logging.py:68:log_dist] [Rank 0] step=1100, skipped=4, lr=[8.677777777777779e-06], mom=[[0.9, 0.999]]
|
700 |
+
[2022-12-18 14:16:41,008] [INFO] [timer.py:196:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=17.577930456746696, CurrSamplesPerSec=17.35344135867515, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
701 |
+
{'loss': 0.003, 'learning_rate': 8.677777777777779e-06, 'epoch': 9.01}
|
702 |
+
[2022-12-18 14:19:28,216] [INFO] [logging.py:68:log_dist] [Rank 0] step=1110, skipped=4, lr=[8.655555555555557e-06], mom=[[0.9, 0.999]]
|
703 |
+
[2022-12-18 14:19:28,217] [INFO] [timer.py:196:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=17.57769973626041, CurrSamplesPerSec=17.8288981283458, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
704 |
+
[2022-12-18 14:22:17,888] [INFO] [logging.py:68:log_dist] [Rank 0] step=1120, skipped=4, lr=[8.633333333333334e-06], mom=[[0.9, 0.999]]
|
705 |
+
[2022-12-18 14:22:17,890] [INFO] [timer.py:196:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=17.578493365622567, CurrSamplesPerSec=17.76609576686019, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
706 |
+
{'loss': 0.0037, 'learning_rate': 8.622222222222223e-06, 'epoch': 9.02}
|
707 |
+
[2022-12-18 14:25:16,389] [INFO] [logging.py:68:log_dist] [Rank 0] step=1130, skipped=4, lr=[8.611111111111112e-06], mom=[[0.9, 0.999]]
|
708 |
+
[2022-12-18 14:25:16,391] [INFO] [timer.py:196:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=17.57871703334657, CurrSamplesPerSec=17.785122219131072, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
709 |
+
[2022-12-18 14:28:10,876] [INFO] [logging.py:68:log_dist] [Rank 0] step=1140, skipped=4, lr=[8.58888888888889e-06], mom=[[0.9, 0.999]]
|
710 |
+
[2022-12-18 14:28:10,877] [INFO] [timer.py:196:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=17.579459740076313, CurrSamplesPerSec=17.701639889108993, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
711 |
+
[2022-12-18 14:31:05,041] [INFO] [logging.py:68:log_dist] [Rank 0] step=1150, skipped=4, lr=[8.566666666666667e-06], mom=[[0.9, 0.999]]
|
712 |
+
[2022-12-18 14:31:05,043] [INFO] [timer.py:196:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=17.580003764189616, CurrSamplesPerSec=17.80629521962796, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
713 |
+
{'loss': 0.0036, 'learning_rate': 8.566666666666667e-06, 'epoch': 9.02}
|
714 |
+
[2022-12-18 14:32:07,420] [INFO] [logging.py:68:log_dist] [Rank 0] step=1160, skipped=4, lr=[8.544444444444445e-06], mom=[[0.9, 0.999]]
|
715 |
+
[2022-12-18 14:32:07,422] [INFO] [timer.py:196:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=17.585442663890536, CurrSamplesPerSec=23.566298323625077, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
716 |
+
[2022-12-18 14:36:43,234] [INFO] [logging.py:68:log_dist] [Rank 0] step=1170, skipped=4, lr=[8.522222222222222e-06], mom=[[0.9, 0.999]]
|
717 |
+
[2022-12-18 14:36:43,236] [INFO] [timer.py:196:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=17.586206519395216, CurrSamplesPerSec=17.676859898043542, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
718 |
+
{'loss': 0.0034, 'learning_rate': 8.511111111111113e-06, 'epoch': 10.0}
|
719 |
+
[2022-12-18 14:39:41,206] [INFO] [logging.py:68:log_dist] [Rank 0] step=1180, skipped=4, lr=[8.5e-06], mom=[[0.9, 0.999]]
|
720 |
+
[2022-12-18 14:39:41,207] [INFO] [timer.py:196:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=17.58715483416817, CurrSamplesPerSec=17.49722916739182, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
721 |
+
[2022-12-18 14:42:39,474] [INFO] [logging.py:68:log_dist] [Rank 0] step=1190, skipped=4, lr=[8.477777777777778e-06], mom=[[0.9, 0.999]]
|
722 |
+
[2022-12-18 14:42:39,476] [INFO] [timer.py:196:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=17.587668127524644, CurrSamplesPerSec=17.847793588674662, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
723 |
+
[2022-12-18 14:45:31,663] [INFO] [logging.py:68:log_dist] [Rank 0] step=1200, skipped=4, lr=[8.455555555555555e-06], mom=[[0.9, 0.999]]
|
724 |
+
[2022-12-18 14:45:31,664] [INFO] [timer.py:196:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=17.588774465446917, CurrSamplesPerSec=17.72776783125711, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
725 |
+
{'loss': 0.0032, 'learning_rate': 8.455555555555555e-06, 'epoch': 10.01}
|
726 |
+
[2022-12-18 14:48:31,335] [INFO] [logging.py:68:log_dist] [Rank 0] step=1210, skipped=4, lr=[8.433333333333334e-06], mom=[[0.9, 0.999]]
|
727 |
+
[2022-12-18 14:48:31,336] [INFO] [timer.py:196:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=17.59026563335073, CurrSamplesPerSec=17.859623009470603, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
728 |
+
[2022-12-18 14:51:22,847] [INFO] [logging.py:68:log_dist] [Rank 0] step=1220, skipped=4, lr=[8.411111111111112e-06], mom=[[0.9, 0.999]]
|
729 |
+
[2022-12-18 14:51:22,848] [INFO] [timer.py:196:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=17.591299866434216, CurrSamplesPerSec=17.783575181860076, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
730 |
+
{'loss': 0.0028, 'learning_rate': 8.400000000000001e-06, 'epoch': 10.01}
|
731 |
+
[2022-12-18 14:54:19,800] [INFO] [logging.py:68:log_dist] [Rank 0] step=1230, skipped=4, lr=[8.38888888888889e-06], mom=[[0.9, 0.999]]
|
732 |
+
[2022-12-18 14:54:19,802] [INFO] [timer.py:196:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=17.592045413292322, CurrSamplesPerSec=17.752746037986785, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
733 |
+
[2022-12-18 14:57:16,096] [INFO] [logging.py:68:log_dist] [Rank 0] step=1240, skipped=4, lr=[8.366666666666667e-06], mom=[[0.9, 0.999]]
|
734 |
+
[2022-12-18 14:57:16,098] [INFO] [timer.py:196:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=17.592574591618792, CurrSamplesPerSec=17.73023145463939, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
735 |
+
[2022-12-18 15:00:08,985] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=4, lr=[8.344444444444445e-06], mom=[[0.9, 0.999]]
|
736 |
+
[2022-12-18 15:00:08,987] [INFO] [timer.py:196:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=17.592765978430887, CurrSamplesPerSec=17.581207935360524, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
737 |
+
{'loss': 0.0026, 'learning_rate': 8.344444444444445e-06, 'epoch': 10.02}
|
738 |
+
[2022-12-18 15:02:58,611] [INFO] [logging.py:68:log_dist] [Rank 0] step=1260, skipped=4, lr=[8.322222222222223e-06], mom=[[0.9, 0.999]]
|
739 |
+
[2022-12-18 15:02:58,612] [INFO] [timer.py:196:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=17.591578207832033, CurrSamplesPerSec=17.7094327944215, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
740 |
+
[2022-12-18 15:05:18,634] [INFO] [logging.py:68:log_dist] [Rank 0] step=1270, skipped=4, lr=[8.3e-06], mom=[[0.9, 0.999]]
|
741 |
+
[2022-12-18 15:05:18,636] [INFO] [timer.py:196:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=17.59139456418053, CurrSamplesPerSec=17.86228150125289, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
742 |
+
{'loss': 0.0022, 'learning_rate': 8.288888888888889e-06, 'epoch': 10.02}
|
743 |
+
[2022-12-18 15:08:24,605] [INFO] [logging.py:68:log_dist] [Rank 0] step=1280, skipped=4, lr=[8.277777777777778e-06], mom=[[0.9, 0.999]]
|
744 |
+
[2022-12-18 15:08:24,607] [INFO] [timer.py:196:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=17.595103033188895, CurrSamplesPerSec=17.542366899024405, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
745 |
+
[2022-12-18 15:11:11,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=1290, skipped=4, lr=[8.255555555555557e-06], mom=[[0.9, 0.999]]
|
746 |
+
[2022-12-18 15:11:11,265] [INFO] [timer.py:196:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=17.5945939842215, CurrSamplesPerSec=17.651455059768324, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
747 |
+
[2022-12-18 15:13:59,569] [INFO] [logging.py:68:log_dist] [Rank 0] step=1300, skipped=4, lr=[8.233333333333335e-06], mom=[[0.9, 0.999]]
|
748 |
+
[2022-12-18 15:13:59,571] [INFO] [timer.py:196:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=17.593519986136247, CurrSamplesPerSec=17.543569555967117, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
749 |
+
{'loss': 0.002, 'learning_rate': 8.233333333333335e-06, 'epoch': 11.0}
|
750 |
+
[2022-12-18 15:16:46,768] [INFO] [logging.py:68:log_dist] [Rank 0] step=1310, skipped=4, lr=[8.211111111111112e-06], mom=[[0.9, 0.999]]
|
751 |
+
[2022-12-18 15:16:46,770] [INFO] [timer.py:196:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=17.593159287186108, CurrSamplesPerSec=17.657518333371485, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
752 |
+
[2022-12-18 15:19:35,218] [INFO] [logging.py:68:log_dist] [Rank 0] step=1320, skipped=4, lr=[8.18888888888889e-06], mom=[[0.9, 0.999]]
|
753 |
+
[2022-12-18 15:19:35,220] [INFO] [timer.py:196:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=17.592123863305886, CurrSamplesPerSec=17.2624771803989, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
754 |
+
{'loss': 0.0019, 'learning_rate': 8.177777777777779e-06, 'epoch': 11.01}
|
755 |
+
[2022-12-18 15:22:23,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=1330, skipped=4, lr=[8.166666666666668e-06], mom=[[0.9, 0.999]]
|
756 |
+
[2022-12-18 15:22:23,741] [INFO] [timer.py:196:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=17.59144315331231, CurrSamplesPerSec=17.624369892176606, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
757 |
+
[2022-12-18 15:25:12,977] [INFO] [logging.py:68:log_dist] [Rank 0] step=1340, skipped=4, lr=[8.144444444444445e-06], mom=[[0.9, 0.999]]
|
758 |
+
[2022-12-18 15:25:12,979] [INFO] [timer.py:196:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=17.591467485388325, CurrSamplesPerSec=17.800980423298352, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
759 |
+
[2022-12-18 15:27:59,991] [INFO] [logging.py:68:log_dist] [Rank 0] step=1350, skipped=4, lr=[8.122222222222223e-06], mom=[[0.9, 0.999]]
|
760 |
+
[2022-12-18 15:27:59,993] [INFO] [timer.py:196:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=17.58933135085906, CurrSamplesPerSec=17.65995201259033, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
761 |
+
{'loss': 0.0015, 'learning_rate': 8.122222222222223e-06, 'epoch': 11.01}
|
762 |
+
[2022-12-18 15:30:49,368] [INFO] [logging.py:68:log_dist] [Rank 0] step=1360, skipped=4, lr=[8.1e-06], mom=[[0.9, 0.999]]
|
763 |
+
[2022-12-18 15:30:49,369] [INFO] [timer.py:196:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=17.589196844316973, CurrSamplesPerSec=17.74898631234037, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
764 |
+
[2022-12-18 15:33:39,600] [INFO] [logging.py:68:log_dist] [Rank 0] step=1370, skipped=4, lr=[8.077777777777778e-06], mom=[[0.9, 0.999]]
|
765 |
+
[2022-12-18 15:33:39,602] [INFO] [timer.py:196:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=17.588471414233023, CurrSamplesPerSec=17.547551309022158, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
766 |
+
{'loss': 0.0013, 'learning_rate': 8.066666666666667e-06, 'epoch': 11.02}
|
767 |
+
[2022-12-18 15:36:28,193] [INFO] [logging.py:68:log_dist] [Rank 0] step=1380, skipped=4, lr=[8.055555555555557e-06], mom=[[0.9, 0.999]]
|
768 |
+
[2022-12-18 15:36:28,195] [INFO] [timer.py:196:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=17.588757839650413, CurrSamplesPerSec=17.470279456311328, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
769 |
+
[2022-12-18 15:37:56,456] [INFO] [logging.py:68:log_dist] [Rank 0] step=1390, skipped=4, lr=[8.033333333333335e-06], mom=[[0.9, 0.999]]
|
770 |
+
[2022-12-18 15:37:56,457] [INFO] [timer.py:196:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=17.589444637240174, CurrSamplesPerSec=17.796452184981753, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
771 |
+
[2022-12-18 15:41:58,134] [INFO] [logging.py:68:log_dist] [Rank 0] step=1400, skipped=4, lr=[8.011111111111113e-06], mom=[[0.9, 0.999]]
|
772 |
+
[2022-12-18 15:41:58,136] [INFO] [timer.py:196:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=17.59123118643266, CurrSamplesPerSec=17.351121701395353, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
773 |
+
{'loss': 0.0011, 'learning_rate': 8.011111111111113e-06, 'epoch': 12.0}
|
774 |
+
[2022-12-18 15:44:46,847] [INFO] [logging.py:68:log_dist] [Rank 0] step=1410, skipped=4, lr=[7.98888888888889e-06], mom=[[0.9, 0.999]]
|
775 |
+
[2022-12-18 15:44:46,849] [INFO] [timer.py:196:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=17.589641567546884, CurrSamplesPerSec=16.81193579966427, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
776 |
+
[2022-12-18 15:47:37,871] [INFO] [logging.py:68:log_dist] [Rank 0] step=1420, skipped=4, lr=[7.966666666666668e-06], mom=[[0.9, 0.999]]
|
777 |
+
[2022-12-18 15:47:37,873] [INFO] [timer.py:196:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=17.588067850747443, CurrSamplesPerSec=17.428626152497475, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
778 |
+
{'loss': 0.0012, 'learning_rate': 7.955555555555557e-06, 'epoch': 12.01}
|
779 |
+
[2022-12-18 15:50:26,270] [INFO] [logging.py:68:log_dist] [Rank 0] step=1430, skipped=4, lr=[7.944444444444445e-06], mom=[[0.9, 0.999]]
|
780 |
+
[2022-12-18 15:50:26,271] [INFO] [timer.py:196:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=17.58741915847445, CurrSamplesPerSec=17.76162991584455, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
781 |
+
[2022-12-18 15:53:13,708] [INFO] [logging.py:68:log_dist] [Rank 0] step=1440, skipped=4, lr=[7.922222222222223e-06], mom=[[0.9, 0.999]]
|
782 |
+
[2022-12-18 15:53:13,710] [INFO] [timer.py:196:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=17.585312149367923, CurrSamplesPerSec=17.475912824942153, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
783 |
+
[2022-12-18 15:56:04,224] [INFO] [logging.py:68:log_dist] [Rank 0] step=1450, skipped=4, lr=[7.9e-06], mom=[[0.9, 0.999]]
|
784 |
+
[2022-12-18 15:56:04,226] [INFO] [timer.py:196:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=17.584736964932496, CurrSamplesPerSec=17.084766353718567, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
785 |
+
{'loss': 0.0011, 'learning_rate': 7.9e-06, 'epoch': 12.01}
|
786 |
+
[2022-12-18 15:58:51,836] [INFO] [logging.py:68:log_dist] [Rank 0] step=1460, skipped=4, lr=[7.877777777777778e-06], mom=[[0.9, 0.999]]
|
787 |
+
[2022-12-18 15:58:51,838] [INFO] [timer.py:196:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=17.58396256582276, CurrSamplesPerSec=17.408034199285936, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
788 |
+
[2022-12-18 16:01:39,733] [INFO] [logging.py:68:log_dist] [Rank 0] step=1470, skipped=4, lr=[7.855555555555556e-06], mom=[[0.9, 0.999]]
|
789 |
+
[2022-12-18 16:01:39,735] [INFO] [timer.py:196:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=17.584820238828023, CurrSamplesPerSec=17.65451172025162, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
790 |
+
{'loss': 0.0012, 'learning_rate': 7.844444444444446e-06, 'epoch': 12.02}
|
791 |
+
[2022-12-18 16:04:29,437] [INFO] [logging.py:68:log_dist] [Rank 0] step=1480, skipped=4, lr=[7.833333333333333e-06], mom=[[0.9, 0.999]]
|
792 |
+
[2022-12-18 16:04:29,438] [INFO] [timer.py:196:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=17.58523223228052, CurrSamplesPerSec=17.692160554263566, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
793 |
+
[2022-12-18 16:07:19,577] [INFO] [logging.py:68:log_dist] [Rank 0] step=1490, skipped=4, lr=[7.811111111111111e-06], mom=[[0.9, 0.999]]
|
794 |
+
[2022-12-18 16:07:19,579] [INFO] [timer.py:196:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=17.585349491276787, CurrSamplesPerSec=17.528617434648247, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
795 |
+
[2022-12-18 16:10:04,717] [INFO] [logging.py:68:log_dist] [Rank 0] step=1500, skipped=4, lr=[7.788888888888889e-06], mom=[[0.9, 0.999]]
|
796 |
+
[2022-12-18 16:10:04,720] [INFO] [timer.py:196:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=17.58553346224858, CurrSamplesPerSec=17.30400841724948, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
797 |
+
{'loss': 0.001, 'learning_rate': 7.788888888888889e-06, 'epoch': 12.02}
|
798 |
+
[2022-12-18 16:12:51,306] [INFO] [logging.py:68:log_dist] [Rank 0] step=1510, skipped=4, lr=[7.766666666666666e-06], mom=[[0.9, 0.999]]
|
799 |
+
[2022-12-18 16:12:51,307] [INFO] [timer.py:196:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=17.58739109597355, CurrSamplesPerSec=17.62212763286849, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
800 |
+
[2022-12-18 16:15:39,256] [INFO] [logging.py:68:log_dist] [Rank 0] step=1520, skipped=4, lr=[7.744444444444446e-06], mom=[[0.9, 0.999]]
|
801 |
+
[2022-12-18 16:15:39,258] [INFO] [timer.py:196:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=17.58624163007446, CurrSamplesPerSec=17.223591201370315, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
802 |
+
{'loss': 0.0009, 'learning_rate': 7.733333333333334e-06, 'epoch': 13.0}
|
803 |
+
[2022-12-18 16:18:30,572] [INFO] [logging.py:68:log_dist] [Rank 0] step=1530, skipped=4, lr=[7.722222222222223e-06], mom=[[0.9, 0.999]]
|
804 |
+
[2022-12-18 16:18:30,573] [INFO] [timer.py:196:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=17.584710397614575, CurrSamplesPerSec=17.58900462856031, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
805 |
+
[2022-12-18 16:21:29,403] [INFO] [logging.py:68:log_dist] [Rank 0] step=1540, skipped=4, lr=[7.7e-06], mom=[[0.9, 0.999]]
|
806 |
+
[2022-12-18 16:21:29,405] [INFO] [timer.py:196:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=17.584223073868095, CurrSamplesPerSec=17.580700146010567, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
807 |
+
[2022-12-18 16:24:23,678] [INFO] [logging.py:68:log_dist] [Rank 0] step=1550, skipped=4, lr=[7.677777777777778e-06], mom=[[0.9, 0.999]]
|
808 |
+
[2022-12-18 16:24:23,680] [INFO] [timer.py:196:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=17.58439155673778, CurrSamplesPerSec=17.668182700157885, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
809 |
+
{'loss': 0.0009, 'learning_rate': 7.677777777777778e-06, 'epoch': 13.01}
|
810 |
+
[2022-12-18 16:27:18,302] [INFO] [logging.py:68:log_dist] [Rank 0] step=1560, skipped=4, lr=[7.655555555555556e-06], mom=[[0.9, 0.999]]
|
811 |
+
[2022-12-18 16:27:18,304] [INFO] [timer.py:196:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=17.584671210437033, CurrSamplesPerSec=17.72357047648451, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
812 |
+
[2022-12-18 16:30:12,136] [INFO] [logging.py:68:log_dist] [Rank 0] step=1570, skipped=4, lr=[7.633333333333334e-06], mom=[[0.9, 0.999]]
|
813 |
+
[2022-12-18 16:30:12,138] [INFO] [timer.py:196:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=17.584434792027235, CurrSamplesPerSec=17.67633958453577, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
814 |
+
{'loss': 0.0012, 'learning_rate': 7.622222222222223e-06, 'epoch': 13.01}
|
815 |
+
[2022-12-18 16:33:09,808] [INFO] [logging.py:68:log_dist] [Rank 0] step=1580, skipped=4, lr=[7.611111111111111e-06], mom=[[0.9, 0.999]]
|
816 |
+
[2022-12-18 16:33:09,810] [INFO] [timer.py:196:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=17.584395220473127, CurrSamplesPerSec=17.75029258656175, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
817 |
+
[2022-12-18 16:36:05,926] [INFO] [logging.py:68:log_dist] [Rank 0] step=1590, skipped=4, lr=[7.588888888888889e-06], mom=[[0.9, 0.999]]
|
818 |
+
[2022-12-18 16:36:05,928] [INFO] [timer.py:196:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=17.58474933416415, CurrSamplesPerSec=17.61421024415819, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
819 |
+
[2022-12-18 16:39:02,292] [INFO] [logging.py:68:log_dist] [Rank 0] step=1600, skipped=4, lr=[7.566666666666667e-06], mom=[[0.9, 0.999]]
|
820 |
+
[2022-12-18 16:39:02,294] [INFO] [timer.py:196:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=17.584912959793066, CurrSamplesPerSec=17.653429636601118, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
821 |
+
{'loss': 0.0009, 'learning_rate': 7.566666666666667e-06, 'epoch': 13.02}
|
822 |
+
[2022-12-18 16:42:00,495] [INFO] [logging.py:68:log_dist] [Rank 0] step=1610, skipped=4, lr=[7.544444444444445e-06], mom=[[0.9, 0.999]]
|
823 |
+
[2022-12-18 16:42:00,497] [INFO] [timer.py:196:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=17.58463914064396, CurrSamplesPerSec=17.55525847226904, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
824 |
+
[2022-12-18 16:44:04,575] [INFO] [logging.py:68:log_dist] [Rank 0] step=1620, skipped=4, lr=[7.5222222222222226e-06], mom=[[0.9, 0.999]]
|
825 |
+
[2022-12-18 16:44:04,578] [INFO] [timer.py:196:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=17.585002091371685, CurrSamplesPerSec=17.664836496154194, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
826 |
+
{'loss': 0.0008, 'learning_rate': 7.511111111111111e-06, 'epoch': 14.0}
|
827 |
+
[2022-12-18 16:47:55,627] [INFO] [logging.py:68:log_dist] [Rank 0] step=1630, skipped=4, lr=[7.500000000000001e-06], mom=[[0.9, 0.999]]
|
828 |
+
[2022-12-18 16:47:55,629] [INFO] [timer.py:196:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=17.587178112703374, CurrSamplesPerSec=17.63657926828507, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
829 |
+
[2022-12-18 16:50:51,334] [INFO] [logging.py:68:log_dist] [Rank 0] step=1640, skipped=4, lr=[7.477777777777779e-06], mom=[[0.9, 0.999]]
|
830 |
+
[2022-12-18 16:50:51,337] [INFO] [timer.py:196:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=17.58712243334102, CurrSamplesPerSec=17.347132175313174, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
831 |
+
[2022-12-18 16:53:53,203] [INFO] [logging.py:68:log_dist] [Rank 0] step=1650, skipped=4, lr=[7.455555555555556e-06], mom=[[0.9, 0.999]]
|
832 |
+
[2022-12-18 16:53:53,205] [INFO] [timer.py:196:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=17.587439830952103, CurrSamplesPerSec=17.622363633296064, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
833 |
+
{'loss': 0.0008, 'learning_rate': 7.455555555555556e-06, 'epoch': 14.01}
|
834 |
+
[2022-12-18 16:56:55,071] [INFO] [logging.py:68:log_dist] [Rank 0] step=1660, skipped=4, lr=[7.433333333333334e-06], mom=[[0.9, 0.999]]
|
835 |
+
[2022-12-18 16:56:55,073] [INFO] [timer.py:196:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=17.58774677231902, CurrSamplesPerSec=17.723122298678486, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
836 |
+
[2022-12-18 16:59:53,412] [INFO] [logging.py:68:log_dist] [Rank 0] step=1670, skipped=4, lr=[7.411111111111112e-06], mom=[[0.9, 0.999]]
|
837 |
+
[2022-12-18 16:59:53,415] [INFO] [timer.py:196:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=17.586889141711087, CurrSamplesPerSec=17.5252632408432, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
838 |
+
{'loss': 0.0008, 'learning_rate': 7.4e-06, 'epoch': 14.01}
|
839 |
+
[2022-12-18 17:02:55,081] [INFO] [logging.py:68:log_dist] [Rank 0] step=1680, skipped=4, lr=[7.38888888888889e-06], mom=[[0.9, 0.999]]
|
840 |
+
[2022-12-18 17:02:55,082] [INFO] [timer.py:196:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=17.58688935579313, CurrSamplesPerSec=17.783240595339233, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
841 |
+
[2022-12-18 17:05:57,804] [INFO] [logging.py:68:log_dist] [Rank 0] step=1690, skipped=4, lr=[7.3666666666666676e-06], mom=[[0.9, 0.999]]
|
842 |
+
[2022-12-18 17:05:57,806] [INFO] [timer.py:196:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=17.587471464153204, CurrSamplesPerSec=17.783433805737804, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
843 |
+
[2022-12-18 17:08:55,981] [INFO] [logging.py:68:log_dist] [Rank 0] step=1700, skipped=4, lr=[7.344444444444445e-06], mom=[[0.9, 0.999]]
|
844 |
+
[2022-12-18 17:08:55,982] [INFO] [timer.py:196:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=17.587408570215107, CurrSamplesPerSec=17.451048991964477, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
845 |
+
{'loss': 0.0008, 'learning_rate': 7.344444444444445e-06, 'epoch': 14.02}
|
846 |
+
[2022-12-18 17:11:56,277] [INFO] [logging.py:68:log_dist] [Rank 0] step=1710, skipped=4, lr=[7.322222222222223e-06], mom=[[0.9, 0.999]]
|
847 |
+
[2022-12-18 17:11:56,280] [INFO] [timer.py:196:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=17.5876806833596, CurrSamplesPerSec=17.729723217717872, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
848 |
+
[2022-12-18 17:14:57,436] [INFO] [logging.py:68:log_dist] [Rank 0] step=1720, skipped=4, lr=[7.3e-06], mom=[[0.9, 0.999]]
|
849 |
+
[2022-12-18 17:14:57,438] [INFO] [timer.py:196:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=17.5872673472387, CurrSamplesPerSec=17.7992642130134, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
850 |
+
{'loss': 0.0007, 'learning_rate': 7.28888888888889e-06, 'epoch': 14.02}
|
851 |
+
[2022-12-18 17:17:53,606] [INFO] [logging.py:68:log_dist] [Rank 0] step=1730, skipped=4, lr=[7.277777777777778e-06], mom=[[0.9, 0.999]]
|
852 |
+
[2022-12-18 17:17:53,608] [INFO] [timer.py:196:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=17.586756975707022, CurrSamplesPerSec=17.563738888918333, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
853 |
+
[2022-12-18 17:18:56,785] [INFO] [logging.py:68:log_dist] [Rank 0] step=1740, skipped=4, lr=[7.255555555555556e-06], mom=[[0.9, 0.999]]
|
854 |
+
[2022-12-18 17:18:56,787] [INFO] [timer.py:196:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=17.58932825421085, CurrSamplesPerSec=23.433776461872853, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
855 |
+
[2022-12-18 17:23:37,623] [INFO] [logging.py:68:log_dist] [Rank 0] step=1750, skipped=4, lr=[7.233333333333334e-06], mom=[[0.9, 0.999]]
|
856 |
+
[2022-12-18 17:23:37,624] [INFO] [timer.py:196:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=17.58929052224577, CurrSamplesPerSec=17.722104328870888, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
857 |
+
{'loss': 0.0007, 'learning_rate': 7.233333333333334e-06, 'epoch': 15.0}
|
858 |
+
[2022-12-18 17:26:29,765] [INFO] [logging.py:68:log_dist] [Rank 0] step=1760, skipped=4, lr=[7.211111111111112e-06], mom=[[0.9, 0.999]]
|
859 |
+
[2022-12-18 17:26:29,767] [INFO] [timer.py:196:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=17.589265190884376, CurrSamplesPerSec=17.29575795734955, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
860 |
+
[2022-12-18 17:29:23,777] [INFO] [logging.py:68:log_dist] [Rank 0] step=1770, skipped=4, lr=[7.188888888888889e-06], mom=[[0.9, 0.999]]
|
861 |
+
[2022-12-18 17:29:23,778] [INFO] [timer.py:196:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=17.589199628233683, CurrSamplesPerSec=17.432689482722747, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
862 |
+
{'loss': 0.0007, 'learning_rate': 7.177777777777778e-06, 'epoch': 15.01}
|
863 |
+
[2022-12-18 17:32:22,814] [INFO] [logging.py:68:log_dist] [Rank 0] step=1780, skipped=4, lr=[7.166666666666667e-06], mom=[[0.9, 0.999]]
|
864 |
+
[2022-12-18 17:32:22,815] [INFO] [timer.py:196:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=17.589156096057113, CurrSamplesPerSec=17.679121929983722, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
865 |
+
[2022-12-18 17:35:21,130] [INFO] [logging.py:68:log_dist] [Rank 0] step=1790, skipped=4, lr=[7.1444444444444446e-06], mom=[[0.9, 0.999]]
|
866 |
+
[2022-12-18 17:35:21,132] [INFO] [timer.py:196:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=17.589510541041726, CurrSamplesPerSec=17.75061537037282, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
867 |
+
[2022-12-18 17:38:22,077] [INFO] [logging.py:68:log_dist] [Rank 0] step=1800, skipped=4, lr=[7.122222222222222e-06], mom=[[0.9, 0.999]]
|
868 |
+
[2022-12-18 17:38:22,079] [INFO] [timer.py:196:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=17.589853715416716, CurrSamplesPerSec=17.70494516775891, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
869 |
+
{'loss': 0.0007, 'learning_rate': 7.122222222222222e-06, 'epoch': 15.01}
|
870 |
+
[2022-12-18 17:41:22,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=1810, skipped=4, lr=[7.100000000000001e-06], mom=[[0.9, 0.999]]
|
871 |
+
[2022-12-18 17:41:22,985] [INFO] [timer.py:196:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=17.589705689066374, CurrSamplesPerSec=17.502633426748798, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
872 |
+
[2022-12-18 17:44:18,958] [INFO] [logging.py:68:log_dist] [Rank 0] step=1820, skipped=4, lr=[7.077777777777778e-06], mom=[[0.9, 0.999]]
|
873 |
+
[2022-12-18 17:44:18,960] [INFO] [timer.py:196:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=17.58889554956267, CurrSamplesPerSec=17.227935399411322, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
874 |
+
{'loss': 0.0007, 'learning_rate': 7.066666666666667e-06, 'epoch': 15.02}
|
875 |
+
[2022-12-18 17:47:21,773] [INFO] [logging.py:68:log_dist] [Rank 0] step=1830, skipped=4, lr=[7.055555555555557e-06], mom=[[0.9, 0.999]]
|
876 |
+
[2022-12-18 17:47:21,775] [INFO] [timer.py:196:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=17.588036262189785, CurrSamplesPerSec=17.291913033681215, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
877 |
+
[2022-12-18 17:50:27,447] [INFO] [logging.py:68:log_dist] [Rank 0] step=1840, skipped=4, lr=[7.033333333333334e-06], mom=[[0.9, 0.999]]
|
878 |
+
[2022-12-18 17:50:27,448] [INFO] [timer.py:196:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=17.587960597286486, CurrSamplesPerSec=17.724241030252486, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
879 |
+
[2022-12-18 17:52:52,644] [INFO] [logging.py:68:log_dist] [Rank 0] step=1850, skipped=4, lr=[7.011111111111112e-06], mom=[[0.9, 0.999]]
|
880 |
+
[2022-12-18 17:52:52,646] [INFO] [timer.py:196:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=17.587975214747647, CurrSamplesPerSec=17.64940665653735, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
881 |
+
{'loss': 0.0006, 'learning_rate': 7.011111111111112e-06, 'epoch': 15.02}
|
882 |
+
[2022-12-18 17:56:08,287] [INFO] [logging.py:68:log_dist] [Rank 0] step=1860, skipped=4, lr=[6.9888888888888895e-06], mom=[[0.9, 0.999]]
|
883 |
+
[2022-12-18 17:56:08,288] [INFO] [timer.py:196:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=17.589801958416484, CurrSamplesPerSec=17.619758723894087, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
884 |
+
[2022-12-18 17:59:00,804] [INFO] [logging.py:68:log_dist] [Rank 0] step=1870, skipped=4, lr=[6.966666666666667e-06], mom=[[0.9, 0.999]]
|
885 |
+
[2022-12-18 17:59:00,805] [INFO] [timer.py:196:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=17.59022362418036, CurrSamplesPerSec=17.588560925502627, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
886 |
+
{'loss': 0.0006, 'learning_rate': 6.955555555555557e-06, 'epoch': 16.0}
|
887 |
+
[2022-12-18 18:01:58,641] [INFO] [logging.py:68:log_dist] [Rank 0] step=1880, skipped=4, lr=[6.944444444444445e-06], mom=[[0.9, 0.999]]
|
888 |
+
[2022-12-18 18:01:58,643] [INFO] [timer.py:196:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=17.59061098794626, CurrSamplesPerSec=17.788324368998808, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
889 |
+
[2022-12-18 18:04:58,587] [INFO] [logging.py:68:log_dist] [Rank 0] step=1890, skipped=4, lr=[6.922222222222222e-06], mom=[[0.9, 0.999]]
|
890 |
+
[2022-12-18 18:04:58,588] [INFO] [timer.py:196:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=17.5910300704761, CurrSamplesPerSec=17.873249456898765, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
891 |
+
[2022-12-18 18:07:53,171] [INFO] [logging.py:68:log_dist] [Rank 0] step=1900, skipped=4, lr=[6.9e-06], mom=[[0.9, 0.999]]
|
892 |
+
[2022-12-18 18:07:53,173] [INFO] [timer.py:196:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=17.590747818093448, CurrSamplesPerSec=17.480120031668438, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
893 |
+
{'loss': 0.0006, 'learning_rate': 6.9e-06, 'epoch': 16.01}
|
894 |
+
[2022-12-18 18:10:47,689] [INFO] [logging.py:68:log_dist] [Rank 0] step=1910, skipped=4, lr=[6.8777777777777785e-06], mom=[[0.9, 0.999]]
|
895 |
+
[2022-12-18 18:10:47,691] [INFO] [timer.py:196:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=17.59104725904212, CurrSamplesPerSec=17.710297408588108, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
896 |
+
[2022-12-18 18:13:42,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=1920, skipped=4, lr=[6.855555555555556e-06], mom=[[0.9, 0.999]]
|
897 |
+
[2022-12-18 18:13:42,985] [INFO] [timer.py:196:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=17.590833049552817, CurrSamplesPerSec=17.4918464812311, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
898 |
+
{'loss': 0.0006, 'learning_rate': 6.844444444444445e-06, 'epoch': 16.01}
|
899 |
+
[2022-12-18 18:16:38,362] [INFO] [logging.py:68:log_dist] [Rank 0] step=1930, skipped=4, lr=[6.833333333333334e-06], mom=[[0.9, 0.999]]
|
900 |
+
[2022-12-18 18:16:38,364] [INFO] [timer.py:196:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=17.591234933505756, CurrSamplesPerSec=17.439830364828275, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
901 |
+
[2022-12-18 18:19:28,894] [INFO] [logging.py:68:log_dist] [Rank 0] step=1940, skipped=4, lr=[6.811111111111111e-06], mom=[[0.9, 0.999]]
|
902 |
+
[2022-12-18 18:19:28,896] [INFO] [timer.py:196:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=17.590774338771933, CurrSamplesPerSec=17.599662529740584, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
903 |
+
[2022-12-18 18:22:22,602] [INFO] [logging.py:68:log_dist] [Rank 0] step=1950, skipped=4, lr=[6.788888888888889e-06], mom=[[0.9, 0.999]]
|
904 |
+
[2022-12-18 18:22:22,604] [INFO] [timer.py:196:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=17.589976661895054, CurrSamplesPerSec=17.679976600177262, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
905 |
+
{'loss': 0.0006, 'learning_rate': 6.788888888888889e-06, 'epoch': 16.02}
|
906 |
+
[2022-12-18 18:25:14,881] [INFO] [logging.py:68:log_dist] [Rank 0] step=1960, skipped=4, lr=[6.7666666666666665e-06], mom=[[0.9, 0.999]]
|
907 |
+
[2022-12-18 18:25:14,882] [INFO] [timer.py:196:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=17.59061118827806, CurrSamplesPerSec=17.789994847282397, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
908 |
+
[2022-12-18 18:26:44,140] [INFO] [logging.py:68:log_dist] [Rank 0] step=1970, skipped=4, lr=[6.744444444444444e-06], mom=[[0.9, 0.999]]
|
909 |
+
[2022-12-18 18:26:44,142] [INFO] [timer.py:196:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=17.590670368534774, CurrSamplesPerSec=17.667350100092985, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
910 |
+
{'loss': 0.0005, 'learning_rate': 6.733333333333334e-06, 'epoch': 17.0}
|
911 |
+
[2022-12-18 18:30:57,744] [INFO] [logging.py:68:log_dist] [Rank 0] step=1980, skipped=4, lr=[6.7222222222222235e-06], mom=[[0.9, 0.999]]
|
912 |
+
[2022-12-18 18:30:57,745] [INFO] [timer.py:196:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=17.591780103374816, CurrSamplesPerSec=17.696074726111295, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
913 |
+
[2022-12-18 18:33:47,038] [INFO] [logging.py:68:log_dist] [Rank 0] step=1990, skipped=4, lr=[6.700000000000001e-06], mom=[[0.9, 0.999]]
|
914 |
+
[2022-12-18 18:33:47,039] [INFO] [timer.py:196:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=17.591581591137096, CurrSamplesPerSec=17.78166561596026, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
915 |
+
[2022-12-18 18:36:39,443] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=4, lr=[6.677777777777779e-06], mom=[[0.9, 0.999]]
|
916 |
+
[2022-12-18 18:36:39,444] [INFO] [timer.py:196:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=17.589825563330376, CurrSamplesPerSec=16.981787098279423, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
|
917 |
+
{'loss': 0.0006, 'learning_rate': 6.677777777777779e-06, 'epoch': 17.01}
|
918 |
+
{'eval_loss': 0.30908203125, 'eval_wer': 17.69160002830656, 'eval_runtime': 1241.2535, 'eval_samples_per_second': 3.109, 'eval_steps_per_second': 0.097, 'epoch': 17.01}
|
919 |
+
[2022-12-18 18:57:21,750] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is begin to save!
|
920 |
+
[2022-12-18 18:57:21,758] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-2000/global_step2000/mp_rank_00_model_states.pt
|
921 |
+
[2022-12-18 18:57:21,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-2000/global_step2000/mp_rank_00_model_states.pt...
|
922 |
+
[2022-12-18 18:57:22,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-2000/global_step2000/mp_rank_00_model_states.pt.
|
923 |
+
[2022-12-18 18:57:22,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
|
924 |
+
[2022-12-18 18:57:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
|
925 |
+
[2022-12-18 18:57:27,460] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
926 |
+
[2022-12-18 18:57:27,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now!
|
runs/Dec18_08-41-04_fe2747a042f0/events.out.tfevents.1671381730.fe2747a042f0.46148.0
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:28a685a481a4b67604e2aaac9b6de6dfd406c1f2afe4ce0b196060cc15a36104
|
3 |
+
size 17457
|