mikr commited on
Commit
30463f7
·
1 Parent(s): 40f1bc6

Training in progress, step 2000

Browse files
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:154295da2b680e283731469d66fb3552823d07524d02e1453e1606abef5b5318
3
  size 483536061
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac8d9fd9f9d28b93847de52a3619b51013265161572e174b1835ef9602818730
3
  size 483536061
run.log CHANGED
@@ -675,3 +675,252 @@ Rank: 0 partition count [1] and sizes[(241734912, False)]
675
  [2022-12-18 13:48:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
676
  [2022-12-18 13:48:08,208] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt
677
  [2022-12-18 13:48:08,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
675
  [2022-12-18 13:48:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
676
  [2022-12-18 13:48:08,208] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_optim_states.pt
677
  [2022-12-18 13:48:08,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now!
678
+ [2022-12-18 13:51:32,738] [INFO] [logging.py:68:log_dist] [Rank 0] step=1010, skipped=4, lr=[8.877777777777779e-06], mom=[[0.9, 0.999]]
679
+ [2022-12-18 13:51:32,740] [INFO] [timer.py:196:stop] epoch=0/micro_step=1010/global_step=1010, RunningAvgSamplesPerSec=17.570497186974222, CurrSamplesPerSec=17.771050915545032, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
680
+ [2022-12-18 13:54:24,427] [INFO] [logging.py:68:log_dist] [Rank 0] step=1020, skipped=4, lr=[8.855555555555556e-06], mom=[[0.9, 0.999]]
681
+ [2022-12-18 13:54:24,429] [INFO] [timer.py:196:stop] epoch=0/micro_step=1020/global_step=1020, RunningAvgSamplesPerSec=17.572388134802974, CurrSamplesPerSec=17.892674397538443, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
682
+ {'loss': 0.0037, 'learning_rate': 8.844444444444445e-06, 'epoch': 8.02}
683
+ [2022-12-18 13:57:13,310] [INFO] [logging.py:68:log_dist] [Rank 0] step=1030, skipped=4, lr=[8.833333333333334e-06], mom=[[0.9, 0.999]]
684
+ [2022-12-18 13:57:13,312] [INFO] [timer.py:196:stop] epoch=0/micro_step=1030/global_step=1030, RunningAvgSamplesPerSec=17.574865744089255, CurrSamplesPerSec=17.830042096987665, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
685
+ [2022-12-18 13:59:11,204] [INFO] [logging.py:68:log_dist] [Rank 0] step=1040, skipped=4, lr=[8.811111111111112e-06], mom=[[0.9, 0.999]]
686
+ [2022-12-18 13:59:11,205] [INFO] [timer.py:196:stop] epoch=0/micro_step=1040/global_step=1040, RunningAvgSamplesPerSec=17.578049897136907, CurrSamplesPerSec=17.867588966876575, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
687
+ [2022-12-18 14:02:49,022] [INFO] [logging.py:68:log_dist] [Rank 0] step=1050, skipped=4, lr=[8.788888888888891e-06], mom=[[0.9, 0.999]]
688
+ [2022-12-18 14:02:49,024] [INFO] [timer.py:196:stop] epoch=0/micro_step=1050/global_step=1050, RunningAvgSamplesPerSec=17.580728468924875, CurrSamplesPerSec=16.936598922699496, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
689
+ {'loss': 0.0035, 'learning_rate': 8.788888888888891e-06, 'epoch': 9.0}
690
+ [2022-12-18 14:05:33,620] [INFO] [logging.py:68:log_dist] [Rank 0] step=1060, skipped=4, lr=[8.766666666666669e-06], mom=[[0.9, 0.999]]
691
+ [2022-12-18 14:05:33,622] [INFO] [timer.py:196:stop] epoch=0/micro_step=1060/global_step=1060, RunningAvgSamplesPerSec=17.58019104202516, CurrSamplesPerSec=17.78021928381013, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
692
+ [2022-12-18 14:08:19,806] [INFO] [logging.py:68:log_dist] [Rank 0] step=1070, skipped=4, lr=[8.744444444444446e-06], mom=[[0.9, 0.999]]
693
+ [2022-12-18 14:08:19,808] [INFO] [timer.py:196:stop] epoch=0/micro_step=1070/global_step=1070, RunningAvgSamplesPerSec=17.58085065511869, CurrSamplesPerSec=17.769715705552866, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
694
+ {'loss': 0.0038, 'learning_rate': 8.733333333333333e-06, 'epoch': 9.01}
695
+ [2022-12-18 14:11:06,335] [INFO] [logging.py:68:log_dist] [Rank 0] step=1080, skipped=4, lr=[8.722222222222224e-06], mom=[[0.9, 0.999]]
696
+ [2022-12-18 14:11:06,337] [INFO] [timer.py:196:stop] epoch=0/micro_step=1080/global_step=1080, RunningAvgSamplesPerSec=17.579904119591816, CurrSamplesPerSec=17.725968554296802, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
697
+ [2022-12-18 14:13:51,062] [INFO] [logging.py:68:log_dist] [Rank 0] step=1090, skipped=4, lr=[8.700000000000001e-06], mom=[[0.9, 0.999]]
698
+ [2022-12-18 14:13:51,064] [INFO] [timer.py:196:stop] epoch=0/micro_step=1090/global_step=1090, RunningAvgSamplesPerSec=17.57849146162872, CurrSamplesPerSec=17.506514413151734, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
699
+ [2022-12-18 14:16:41,006] [INFO] [logging.py:68:log_dist] [Rank 0] step=1100, skipped=4, lr=[8.677777777777779e-06], mom=[[0.9, 0.999]]
700
+ [2022-12-18 14:16:41,008] [INFO] [timer.py:196:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=17.577930456746696, CurrSamplesPerSec=17.35344135867515, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
701
+ {'loss': 0.003, 'learning_rate': 8.677777777777779e-06, 'epoch': 9.01}
702
+ [2022-12-18 14:19:28,216] [INFO] [logging.py:68:log_dist] [Rank 0] step=1110, skipped=4, lr=[8.655555555555557e-06], mom=[[0.9, 0.999]]
703
+ [2022-12-18 14:19:28,217] [INFO] [timer.py:196:stop] epoch=0/micro_step=1110/global_step=1110, RunningAvgSamplesPerSec=17.57769973626041, CurrSamplesPerSec=17.8288981283458, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
704
+ [2022-12-18 14:22:17,888] [INFO] [logging.py:68:log_dist] [Rank 0] step=1120, skipped=4, lr=[8.633333333333334e-06], mom=[[0.9, 0.999]]
705
+ [2022-12-18 14:22:17,890] [INFO] [timer.py:196:stop] epoch=0/micro_step=1120/global_step=1120, RunningAvgSamplesPerSec=17.578493365622567, CurrSamplesPerSec=17.76609576686019, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
706
+ {'loss': 0.0037, 'learning_rate': 8.622222222222223e-06, 'epoch': 9.02}
707
+ [2022-12-18 14:25:16,389] [INFO] [logging.py:68:log_dist] [Rank 0] step=1130, skipped=4, lr=[8.611111111111112e-06], mom=[[0.9, 0.999]]
708
+ [2022-12-18 14:25:16,391] [INFO] [timer.py:196:stop] epoch=0/micro_step=1130/global_step=1130, RunningAvgSamplesPerSec=17.57871703334657, CurrSamplesPerSec=17.785122219131072, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
709
+ [2022-12-18 14:28:10,876] [INFO] [logging.py:68:log_dist] [Rank 0] step=1140, skipped=4, lr=[8.58888888888889e-06], mom=[[0.9, 0.999]]
710
+ [2022-12-18 14:28:10,877] [INFO] [timer.py:196:stop] epoch=0/micro_step=1140/global_step=1140, RunningAvgSamplesPerSec=17.579459740076313, CurrSamplesPerSec=17.701639889108993, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
711
+ [2022-12-18 14:31:05,041] [INFO] [logging.py:68:log_dist] [Rank 0] step=1150, skipped=4, lr=[8.566666666666667e-06], mom=[[0.9, 0.999]]
712
+ [2022-12-18 14:31:05,043] [INFO] [timer.py:196:stop] epoch=0/micro_step=1150/global_step=1150, RunningAvgSamplesPerSec=17.580003764189616, CurrSamplesPerSec=17.80629521962796, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
713
+ {'loss': 0.0036, 'learning_rate': 8.566666666666667e-06, 'epoch': 9.02}
714
+ [2022-12-18 14:32:07,420] [INFO] [logging.py:68:log_dist] [Rank 0] step=1160, skipped=4, lr=[8.544444444444445e-06], mom=[[0.9, 0.999]]
715
+ [2022-12-18 14:32:07,422] [INFO] [timer.py:196:stop] epoch=0/micro_step=1160/global_step=1160, RunningAvgSamplesPerSec=17.585442663890536, CurrSamplesPerSec=23.566298323625077, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
716
+ [2022-12-18 14:36:43,234] [INFO] [logging.py:68:log_dist] [Rank 0] step=1170, skipped=4, lr=[8.522222222222222e-06], mom=[[0.9, 0.999]]
717
+ [2022-12-18 14:36:43,236] [INFO] [timer.py:196:stop] epoch=0/micro_step=1170/global_step=1170, RunningAvgSamplesPerSec=17.586206519395216, CurrSamplesPerSec=17.676859898043542, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
718
+ {'loss': 0.0034, 'learning_rate': 8.511111111111113e-06, 'epoch': 10.0}
719
+ [2022-12-18 14:39:41,206] [INFO] [logging.py:68:log_dist] [Rank 0] step=1180, skipped=4, lr=[8.5e-06], mom=[[0.9, 0.999]]
720
+ [2022-12-18 14:39:41,207] [INFO] [timer.py:196:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=17.58715483416817, CurrSamplesPerSec=17.49722916739182, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
721
+ [2022-12-18 14:42:39,474] [INFO] [logging.py:68:log_dist] [Rank 0] step=1190, skipped=4, lr=[8.477777777777778e-06], mom=[[0.9, 0.999]]
722
+ [2022-12-18 14:42:39,476] [INFO] [timer.py:196:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=17.587668127524644, CurrSamplesPerSec=17.847793588674662, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
723
+ [2022-12-18 14:45:31,663] [INFO] [logging.py:68:log_dist] [Rank 0] step=1200, skipped=4, lr=[8.455555555555555e-06], mom=[[0.9, 0.999]]
724
+ [2022-12-18 14:45:31,664] [INFO] [timer.py:196:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=17.588774465446917, CurrSamplesPerSec=17.72776783125711, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
725
+ {'loss': 0.0032, 'learning_rate': 8.455555555555555e-06, 'epoch': 10.01}
726
+ [2022-12-18 14:48:31,335] [INFO] [logging.py:68:log_dist] [Rank 0] step=1210, skipped=4, lr=[8.433333333333334e-06], mom=[[0.9, 0.999]]
727
+ [2022-12-18 14:48:31,336] [INFO] [timer.py:196:stop] epoch=0/micro_step=1210/global_step=1210, RunningAvgSamplesPerSec=17.59026563335073, CurrSamplesPerSec=17.859623009470603, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
728
+ [2022-12-18 14:51:22,847] [INFO] [logging.py:68:log_dist] [Rank 0] step=1220, skipped=4, lr=[8.411111111111112e-06], mom=[[0.9, 0.999]]
729
+ [2022-12-18 14:51:22,848] [INFO] [timer.py:196:stop] epoch=0/micro_step=1220/global_step=1220, RunningAvgSamplesPerSec=17.591299866434216, CurrSamplesPerSec=17.783575181860076, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
730
+ {'loss': 0.0028, 'learning_rate': 8.400000000000001e-06, 'epoch': 10.01}
731
+ [2022-12-18 14:54:19,800] [INFO] [logging.py:68:log_dist] [Rank 0] step=1230, skipped=4, lr=[8.38888888888889e-06], mom=[[0.9, 0.999]]
732
+ [2022-12-18 14:54:19,802] [INFO] [timer.py:196:stop] epoch=0/micro_step=1230/global_step=1230, RunningAvgSamplesPerSec=17.592045413292322, CurrSamplesPerSec=17.752746037986785, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
733
+ [2022-12-18 14:57:16,096] [INFO] [logging.py:68:log_dist] [Rank 0] step=1240, skipped=4, lr=[8.366666666666667e-06], mom=[[0.9, 0.999]]
734
+ [2022-12-18 14:57:16,098] [INFO] [timer.py:196:stop] epoch=0/micro_step=1240/global_step=1240, RunningAvgSamplesPerSec=17.592574591618792, CurrSamplesPerSec=17.73023145463939, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
735
+ [2022-12-18 15:00:08,985] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=4, lr=[8.344444444444445e-06], mom=[[0.9, 0.999]]
736
+ [2022-12-18 15:00:08,987] [INFO] [timer.py:196:stop] epoch=0/micro_step=1250/global_step=1250, RunningAvgSamplesPerSec=17.592765978430887, CurrSamplesPerSec=17.581207935360524, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
737
+ {'loss': 0.0026, 'learning_rate': 8.344444444444445e-06, 'epoch': 10.02}
738
+ [2022-12-18 15:02:58,611] [INFO] [logging.py:68:log_dist] [Rank 0] step=1260, skipped=4, lr=[8.322222222222223e-06], mom=[[0.9, 0.999]]
739
+ [2022-12-18 15:02:58,612] [INFO] [timer.py:196:stop] epoch=0/micro_step=1260/global_step=1260, RunningAvgSamplesPerSec=17.591578207832033, CurrSamplesPerSec=17.7094327944215, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
740
+ [2022-12-18 15:05:18,634] [INFO] [logging.py:68:log_dist] [Rank 0] step=1270, skipped=4, lr=[8.3e-06], mom=[[0.9, 0.999]]
741
+ [2022-12-18 15:05:18,636] [INFO] [timer.py:196:stop] epoch=0/micro_step=1270/global_step=1270, RunningAvgSamplesPerSec=17.59139456418053, CurrSamplesPerSec=17.86228150125289, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
742
+ {'loss': 0.0022, 'learning_rate': 8.288888888888889e-06, 'epoch': 10.02}
743
+ [2022-12-18 15:08:24,605] [INFO] [logging.py:68:log_dist] [Rank 0] step=1280, skipped=4, lr=[8.277777777777778e-06], mom=[[0.9, 0.999]]
744
+ [2022-12-18 15:08:24,607] [INFO] [timer.py:196:stop] epoch=0/micro_step=1280/global_step=1280, RunningAvgSamplesPerSec=17.595103033188895, CurrSamplesPerSec=17.542366899024405, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
745
+ [2022-12-18 15:11:11,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=1290, skipped=4, lr=[8.255555555555557e-06], mom=[[0.9, 0.999]]
746
+ [2022-12-18 15:11:11,265] [INFO] [timer.py:196:stop] epoch=0/micro_step=1290/global_step=1290, RunningAvgSamplesPerSec=17.5945939842215, CurrSamplesPerSec=17.651455059768324, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
747
+ [2022-12-18 15:13:59,569] [INFO] [logging.py:68:log_dist] [Rank 0] step=1300, skipped=4, lr=[8.233333333333335e-06], mom=[[0.9, 0.999]]
748
+ [2022-12-18 15:13:59,571] [INFO] [timer.py:196:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=17.593519986136247, CurrSamplesPerSec=17.543569555967117, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
749
+ {'loss': 0.002, 'learning_rate': 8.233333333333335e-06, 'epoch': 11.0}
750
+ [2022-12-18 15:16:46,768] [INFO] [logging.py:68:log_dist] [Rank 0] step=1310, skipped=4, lr=[8.211111111111112e-06], mom=[[0.9, 0.999]]
751
+ [2022-12-18 15:16:46,770] [INFO] [timer.py:196:stop] epoch=0/micro_step=1310/global_step=1310, RunningAvgSamplesPerSec=17.593159287186108, CurrSamplesPerSec=17.657518333371485, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
752
+ [2022-12-18 15:19:35,218] [INFO] [logging.py:68:log_dist] [Rank 0] step=1320, skipped=4, lr=[8.18888888888889e-06], mom=[[0.9, 0.999]]
753
+ [2022-12-18 15:19:35,220] [INFO] [timer.py:196:stop] epoch=0/micro_step=1320/global_step=1320, RunningAvgSamplesPerSec=17.592123863305886, CurrSamplesPerSec=17.2624771803989, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
754
+ {'loss': 0.0019, 'learning_rate': 8.177777777777779e-06, 'epoch': 11.01}
755
+ [2022-12-18 15:22:23,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=1330, skipped=4, lr=[8.166666666666668e-06], mom=[[0.9, 0.999]]
756
+ [2022-12-18 15:22:23,741] [INFO] [timer.py:196:stop] epoch=0/micro_step=1330/global_step=1330, RunningAvgSamplesPerSec=17.59144315331231, CurrSamplesPerSec=17.624369892176606, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
757
+ [2022-12-18 15:25:12,977] [INFO] [logging.py:68:log_dist] [Rank 0] step=1340, skipped=4, lr=[8.144444444444445e-06], mom=[[0.9, 0.999]]
758
+ [2022-12-18 15:25:12,979] [INFO] [timer.py:196:stop] epoch=0/micro_step=1340/global_step=1340, RunningAvgSamplesPerSec=17.591467485388325, CurrSamplesPerSec=17.800980423298352, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
759
+ [2022-12-18 15:27:59,991] [INFO] [logging.py:68:log_dist] [Rank 0] step=1350, skipped=4, lr=[8.122222222222223e-06], mom=[[0.9, 0.999]]
760
+ [2022-12-18 15:27:59,993] [INFO] [timer.py:196:stop] epoch=0/micro_step=1350/global_step=1350, RunningAvgSamplesPerSec=17.58933135085906, CurrSamplesPerSec=17.65995201259033, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
761
+ {'loss': 0.0015, 'learning_rate': 8.122222222222223e-06, 'epoch': 11.01}
762
+ [2022-12-18 15:30:49,368] [INFO] [logging.py:68:log_dist] [Rank 0] step=1360, skipped=4, lr=[8.1e-06], mom=[[0.9, 0.999]]
763
+ [2022-12-18 15:30:49,369] [INFO] [timer.py:196:stop] epoch=0/micro_step=1360/global_step=1360, RunningAvgSamplesPerSec=17.589196844316973, CurrSamplesPerSec=17.74898631234037, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
764
+ [2022-12-18 15:33:39,600] [INFO] [logging.py:68:log_dist] [Rank 0] step=1370, skipped=4, lr=[8.077777777777778e-06], mom=[[0.9, 0.999]]
765
+ [2022-12-18 15:33:39,602] [INFO] [timer.py:196:stop] epoch=0/micro_step=1370/global_step=1370, RunningAvgSamplesPerSec=17.588471414233023, CurrSamplesPerSec=17.547551309022158, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
766
+ {'loss': 0.0013, 'learning_rate': 8.066666666666667e-06, 'epoch': 11.02}
767
+ [2022-12-18 15:36:28,193] [INFO] [logging.py:68:log_dist] [Rank 0] step=1380, skipped=4, lr=[8.055555555555557e-06], mom=[[0.9, 0.999]]
768
+ [2022-12-18 15:36:28,195] [INFO] [timer.py:196:stop] epoch=0/micro_step=1380/global_step=1380, RunningAvgSamplesPerSec=17.588757839650413, CurrSamplesPerSec=17.470279456311328, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
769
+ [2022-12-18 15:37:56,456] [INFO] [logging.py:68:log_dist] [Rank 0] step=1390, skipped=4, lr=[8.033333333333335e-06], mom=[[0.9, 0.999]]
770
+ [2022-12-18 15:37:56,457] [INFO] [timer.py:196:stop] epoch=0/micro_step=1390/global_step=1390, RunningAvgSamplesPerSec=17.589444637240174, CurrSamplesPerSec=17.796452184981753, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
771
+ [2022-12-18 15:41:58,134] [INFO] [logging.py:68:log_dist] [Rank 0] step=1400, skipped=4, lr=[8.011111111111113e-06], mom=[[0.9, 0.999]]
772
+ [2022-12-18 15:41:58,136] [INFO] [timer.py:196:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=17.59123118643266, CurrSamplesPerSec=17.351121701395353, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
773
+ {'loss': 0.0011, 'learning_rate': 8.011111111111113e-06, 'epoch': 12.0}
774
+ [2022-12-18 15:44:46,847] [INFO] [logging.py:68:log_dist] [Rank 0] step=1410, skipped=4, lr=[7.98888888888889e-06], mom=[[0.9, 0.999]]
775
+ [2022-12-18 15:44:46,849] [INFO] [timer.py:196:stop] epoch=0/micro_step=1410/global_step=1410, RunningAvgSamplesPerSec=17.589641567546884, CurrSamplesPerSec=16.81193579966427, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
776
+ [2022-12-18 15:47:37,871] [INFO] [logging.py:68:log_dist] [Rank 0] step=1420, skipped=4, lr=[7.966666666666668e-06], mom=[[0.9, 0.999]]
777
+ [2022-12-18 15:47:37,873] [INFO] [timer.py:196:stop] epoch=0/micro_step=1420/global_step=1420, RunningAvgSamplesPerSec=17.588067850747443, CurrSamplesPerSec=17.428626152497475, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
778
+ {'loss': 0.0012, 'learning_rate': 7.955555555555557e-06, 'epoch': 12.01}
779
+ [2022-12-18 15:50:26,270] [INFO] [logging.py:68:log_dist] [Rank 0] step=1430, skipped=4, lr=[7.944444444444445e-06], mom=[[0.9, 0.999]]
780
+ [2022-12-18 15:50:26,271] [INFO] [timer.py:196:stop] epoch=0/micro_step=1430/global_step=1430, RunningAvgSamplesPerSec=17.58741915847445, CurrSamplesPerSec=17.76162991584455, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
781
+ [2022-12-18 15:53:13,708] [INFO] [logging.py:68:log_dist] [Rank 0] step=1440, skipped=4, lr=[7.922222222222223e-06], mom=[[0.9, 0.999]]
782
+ [2022-12-18 15:53:13,710] [INFO] [timer.py:196:stop] epoch=0/micro_step=1440/global_step=1440, RunningAvgSamplesPerSec=17.585312149367923, CurrSamplesPerSec=17.475912824942153, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
783
+ [2022-12-18 15:56:04,224] [INFO] [logging.py:68:log_dist] [Rank 0] step=1450, skipped=4, lr=[7.9e-06], mom=[[0.9, 0.999]]
784
+ [2022-12-18 15:56:04,226] [INFO] [timer.py:196:stop] epoch=0/micro_step=1450/global_step=1450, RunningAvgSamplesPerSec=17.584736964932496, CurrSamplesPerSec=17.084766353718567, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
785
+ {'loss': 0.0011, 'learning_rate': 7.9e-06, 'epoch': 12.01}
786
+ [2022-12-18 15:58:51,836] [INFO] [logging.py:68:log_dist] [Rank 0] step=1460, skipped=4, lr=[7.877777777777778e-06], mom=[[0.9, 0.999]]
787
+ [2022-12-18 15:58:51,838] [INFO] [timer.py:196:stop] epoch=0/micro_step=1460/global_step=1460, RunningAvgSamplesPerSec=17.58396256582276, CurrSamplesPerSec=17.408034199285936, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
788
+ [2022-12-18 16:01:39,733] [INFO] [logging.py:68:log_dist] [Rank 0] step=1470, skipped=4, lr=[7.855555555555556e-06], mom=[[0.9, 0.999]]
789
+ [2022-12-18 16:01:39,735] [INFO] [timer.py:196:stop] epoch=0/micro_step=1470/global_step=1470, RunningAvgSamplesPerSec=17.584820238828023, CurrSamplesPerSec=17.65451172025162, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
790
+ {'loss': 0.0012, 'learning_rate': 7.844444444444446e-06, 'epoch': 12.02}
791
+ [2022-12-18 16:04:29,437] [INFO] [logging.py:68:log_dist] [Rank 0] step=1480, skipped=4, lr=[7.833333333333333e-06], mom=[[0.9, 0.999]]
792
+ [2022-12-18 16:04:29,438] [INFO] [timer.py:196:stop] epoch=0/micro_step=1480/global_step=1480, RunningAvgSamplesPerSec=17.58523223228052, CurrSamplesPerSec=17.692160554263566, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
793
+ [2022-12-18 16:07:19,577] [INFO] [logging.py:68:log_dist] [Rank 0] step=1490, skipped=4, lr=[7.811111111111111e-06], mom=[[0.9, 0.999]]
794
+ [2022-12-18 16:07:19,579] [INFO] [timer.py:196:stop] epoch=0/micro_step=1490/global_step=1490, RunningAvgSamplesPerSec=17.585349491276787, CurrSamplesPerSec=17.528617434648247, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
795
+ [2022-12-18 16:10:04,717] [INFO] [logging.py:68:log_dist] [Rank 0] step=1500, skipped=4, lr=[7.788888888888889e-06], mom=[[0.9, 0.999]]
796
+ [2022-12-18 16:10:04,720] [INFO] [timer.py:196:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=17.58553346224858, CurrSamplesPerSec=17.30400841724948, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
797
+ {'loss': 0.001, 'learning_rate': 7.788888888888889e-06, 'epoch': 12.02}
798
+ [2022-12-18 16:12:51,306] [INFO] [logging.py:68:log_dist] [Rank 0] step=1510, skipped=4, lr=[7.766666666666666e-06], mom=[[0.9, 0.999]]
799
+ [2022-12-18 16:12:51,307] [INFO] [timer.py:196:stop] epoch=0/micro_step=1510/global_step=1510, RunningAvgSamplesPerSec=17.58739109597355, CurrSamplesPerSec=17.62212763286849, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
800
+ [2022-12-18 16:15:39,256] [INFO] [logging.py:68:log_dist] [Rank 0] step=1520, skipped=4, lr=[7.744444444444446e-06], mom=[[0.9, 0.999]]
801
+ [2022-12-18 16:15:39,258] [INFO] [timer.py:196:stop] epoch=0/micro_step=1520/global_step=1520, RunningAvgSamplesPerSec=17.58624163007446, CurrSamplesPerSec=17.223591201370315, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
802
+ {'loss': 0.0009, 'learning_rate': 7.733333333333334e-06, 'epoch': 13.0}
803
+ [2022-12-18 16:18:30,572] [INFO] [logging.py:68:log_dist] [Rank 0] step=1530, skipped=4, lr=[7.722222222222223e-06], mom=[[0.9, 0.999]]
804
+ [2022-12-18 16:18:30,573] [INFO] [timer.py:196:stop] epoch=0/micro_step=1530/global_step=1530, RunningAvgSamplesPerSec=17.584710397614575, CurrSamplesPerSec=17.58900462856031, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
805
+ [2022-12-18 16:21:29,403] [INFO] [logging.py:68:log_dist] [Rank 0] step=1540, skipped=4, lr=[7.7e-06], mom=[[0.9, 0.999]]
806
+ [2022-12-18 16:21:29,405] [INFO] [timer.py:196:stop] epoch=0/micro_step=1540/global_step=1540, RunningAvgSamplesPerSec=17.584223073868095, CurrSamplesPerSec=17.580700146010567, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
807
+ [2022-12-18 16:24:23,678] [INFO] [logging.py:68:log_dist] [Rank 0] step=1550, skipped=4, lr=[7.677777777777778e-06], mom=[[0.9, 0.999]]
808
+ [2022-12-18 16:24:23,680] [INFO] [timer.py:196:stop] epoch=0/micro_step=1550/global_step=1550, RunningAvgSamplesPerSec=17.58439155673778, CurrSamplesPerSec=17.668182700157885, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
809
+ {'loss': 0.0009, 'learning_rate': 7.677777777777778e-06, 'epoch': 13.01}
810
+ [2022-12-18 16:27:18,302] [INFO] [logging.py:68:log_dist] [Rank 0] step=1560, skipped=4, lr=[7.655555555555556e-06], mom=[[0.9, 0.999]]
811
+ [2022-12-18 16:27:18,304] [INFO] [timer.py:196:stop] epoch=0/micro_step=1560/global_step=1560, RunningAvgSamplesPerSec=17.584671210437033, CurrSamplesPerSec=17.72357047648451, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
812
+ [2022-12-18 16:30:12,136] [INFO] [logging.py:68:log_dist] [Rank 0] step=1570, skipped=4, lr=[7.633333333333334e-06], mom=[[0.9, 0.999]]
813
+ [2022-12-18 16:30:12,138] [INFO] [timer.py:196:stop] epoch=0/micro_step=1570/global_step=1570, RunningAvgSamplesPerSec=17.584434792027235, CurrSamplesPerSec=17.67633958453577, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
814
+ {'loss': 0.0012, 'learning_rate': 7.622222222222223e-06, 'epoch': 13.01}
815
+ [2022-12-18 16:33:09,808] [INFO] [logging.py:68:log_dist] [Rank 0] step=1580, skipped=4, lr=[7.611111111111111e-06], mom=[[0.9, 0.999]]
816
+ [2022-12-18 16:33:09,810] [INFO] [timer.py:196:stop] epoch=0/micro_step=1580/global_step=1580, RunningAvgSamplesPerSec=17.584395220473127, CurrSamplesPerSec=17.75029258656175, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
817
+ [2022-12-18 16:36:05,926] [INFO] [logging.py:68:log_dist] [Rank 0] step=1590, skipped=4, lr=[7.588888888888889e-06], mom=[[0.9, 0.999]]
818
+ [2022-12-18 16:36:05,928] [INFO] [timer.py:196:stop] epoch=0/micro_step=1590/global_step=1590, RunningAvgSamplesPerSec=17.58474933416415, CurrSamplesPerSec=17.61421024415819, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
819
+ [2022-12-18 16:39:02,292] [INFO] [logging.py:68:log_dist] [Rank 0] step=1600, skipped=4, lr=[7.566666666666667e-06], mom=[[0.9, 0.999]]
820
+ [2022-12-18 16:39:02,294] [INFO] [timer.py:196:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=17.584912959793066, CurrSamplesPerSec=17.653429636601118, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
821
+ {'loss': 0.0009, 'learning_rate': 7.566666666666667e-06, 'epoch': 13.02}
822
+ [2022-12-18 16:42:00,495] [INFO] [logging.py:68:log_dist] [Rank 0] step=1610, skipped=4, lr=[7.544444444444445e-06], mom=[[0.9, 0.999]]
823
+ [2022-12-18 16:42:00,497] [INFO] [timer.py:196:stop] epoch=0/micro_step=1610/global_step=1610, RunningAvgSamplesPerSec=17.58463914064396, CurrSamplesPerSec=17.55525847226904, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
824
+ [2022-12-18 16:44:04,575] [INFO] [logging.py:68:log_dist] [Rank 0] step=1620, skipped=4, lr=[7.5222222222222226e-06], mom=[[0.9, 0.999]]
825
+ [2022-12-18 16:44:04,578] [INFO] [timer.py:196:stop] epoch=0/micro_step=1620/global_step=1620, RunningAvgSamplesPerSec=17.585002091371685, CurrSamplesPerSec=17.664836496154194, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
826
+ {'loss': 0.0008, 'learning_rate': 7.511111111111111e-06, 'epoch': 14.0}
827
+ [2022-12-18 16:47:55,627] [INFO] [logging.py:68:log_dist] [Rank 0] step=1630, skipped=4, lr=[7.500000000000001e-06], mom=[[0.9, 0.999]]
828
+ [2022-12-18 16:47:55,629] [INFO] [timer.py:196:stop] epoch=0/micro_step=1630/global_step=1630, RunningAvgSamplesPerSec=17.587178112703374, CurrSamplesPerSec=17.63657926828507, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
829
+ [2022-12-18 16:50:51,334] [INFO] [logging.py:68:log_dist] [Rank 0] step=1640, skipped=4, lr=[7.477777777777779e-06], mom=[[0.9, 0.999]]
830
+ [2022-12-18 16:50:51,337] [INFO] [timer.py:196:stop] epoch=0/micro_step=1640/global_step=1640, RunningAvgSamplesPerSec=17.58712243334102, CurrSamplesPerSec=17.347132175313174, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
831
+ [2022-12-18 16:53:53,203] [INFO] [logging.py:68:log_dist] [Rank 0] step=1650, skipped=4, lr=[7.455555555555556e-06], mom=[[0.9, 0.999]]
832
+ [2022-12-18 16:53:53,205] [INFO] [timer.py:196:stop] epoch=0/micro_step=1650/global_step=1650, RunningAvgSamplesPerSec=17.587439830952103, CurrSamplesPerSec=17.622363633296064, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
833
+ {'loss': 0.0008, 'learning_rate': 7.455555555555556e-06, 'epoch': 14.01}
834
+ [2022-12-18 16:56:55,071] [INFO] [logging.py:68:log_dist] [Rank 0] step=1660, skipped=4, lr=[7.433333333333334e-06], mom=[[0.9, 0.999]]
835
+ [2022-12-18 16:56:55,073] [INFO] [timer.py:196:stop] epoch=0/micro_step=1660/global_step=1660, RunningAvgSamplesPerSec=17.58774677231902, CurrSamplesPerSec=17.723122298678486, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
836
+ [2022-12-18 16:59:53,412] [INFO] [logging.py:68:log_dist] [Rank 0] step=1670, skipped=4, lr=[7.411111111111112e-06], mom=[[0.9, 0.999]]
837
+ [2022-12-18 16:59:53,415] [INFO] [timer.py:196:stop] epoch=0/micro_step=1670/global_step=1670, RunningAvgSamplesPerSec=17.586889141711087, CurrSamplesPerSec=17.5252632408432, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
838
+ {'loss': 0.0008, 'learning_rate': 7.4e-06, 'epoch': 14.01}
839
+ [2022-12-18 17:02:55,081] [INFO] [logging.py:68:log_dist] [Rank 0] step=1680, skipped=4, lr=[7.38888888888889e-06], mom=[[0.9, 0.999]]
840
+ [2022-12-18 17:02:55,082] [INFO] [timer.py:196:stop] epoch=0/micro_step=1680/global_step=1680, RunningAvgSamplesPerSec=17.58688935579313, CurrSamplesPerSec=17.783240595339233, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
841
+ [2022-12-18 17:05:57,804] [INFO] [logging.py:68:log_dist] [Rank 0] step=1690, skipped=4, lr=[7.3666666666666676e-06], mom=[[0.9, 0.999]]
842
+ [2022-12-18 17:05:57,806] [INFO] [timer.py:196:stop] epoch=0/micro_step=1690/global_step=1690, RunningAvgSamplesPerSec=17.587471464153204, CurrSamplesPerSec=17.783433805737804, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
843
+ [2022-12-18 17:08:55,981] [INFO] [logging.py:68:log_dist] [Rank 0] step=1700, skipped=4, lr=[7.344444444444445e-06], mom=[[0.9, 0.999]]
844
+ [2022-12-18 17:08:55,982] [INFO] [timer.py:196:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=17.587408570215107, CurrSamplesPerSec=17.451048991964477, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
845
+ {'loss': 0.0008, 'learning_rate': 7.344444444444445e-06, 'epoch': 14.02}
846
+ [2022-12-18 17:11:56,277] [INFO] [logging.py:68:log_dist] [Rank 0] step=1710, skipped=4, lr=[7.322222222222223e-06], mom=[[0.9, 0.999]]
847
+ [2022-12-18 17:11:56,280] [INFO] [timer.py:196:stop] epoch=0/micro_step=1710/global_step=1710, RunningAvgSamplesPerSec=17.5876806833596, CurrSamplesPerSec=17.729723217717872, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
848
+ [2022-12-18 17:14:57,436] [INFO] [logging.py:68:log_dist] [Rank 0] step=1720, skipped=4, lr=[7.3e-06], mom=[[0.9, 0.999]]
849
+ [2022-12-18 17:14:57,438] [INFO] [timer.py:196:stop] epoch=0/micro_step=1720/global_step=1720, RunningAvgSamplesPerSec=17.5872673472387, CurrSamplesPerSec=17.7992642130134, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
850
+ {'loss': 0.0007, 'learning_rate': 7.28888888888889e-06, 'epoch': 14.02}
851
+ [2022-12-18 17:17:53,606] [INFO] [logging.py:68:log_dist] [Rank 0] step=1730, skipped=4, lr=[7.277777777777778e-06], mom=[[0.9, 0.999]]
852
+ [2022-12-18 17:17:53,608] [INFO] [timer.py:196:stop] epoch=0/micro_step=1730/global_step=1730, RunningAvgSamplesPerSec=17.586756975707022, CurrSamplesPerSec=17.563738888918333, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
853
+ [2022-12-18 17:18:56,785] [INFO] [logging.py:68:log_dist] [Rank 0] step=1740, skipped=4, lr=[7.255555555555556e-06], mom=[[0.9, 0.999]]
854
+ [2022-12-18 17:18:56,787] [INFO] [timer.py:196:stop] epoch=0/micro_step=1740/global_step=1740, RunningAvgSamplesPerSec=17.58932825421085, CurrSamplesPerSec=23.433776461872853, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
855
+ [2022-12-18 17:23:37,623] [INFO] [logging.py:68:log_dist] [Rank 0] step=1750, skipped=4, lr=[7.233333333333334e-06], mom=[[0.9, 0.999]]
856
+ [2022-12-18 17:23:37,624] [INFO] [timer.py:196:stop] epoch=0/micro_step=1750/global_step=1750, RunningAvgSamplesPerSec=17.58929052224577, CurrSamplesPerSec=17.722104328870888, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
857
+ {'loss': 0.0007, 'learning_rate': 7.233333333333334e-06, 'epoch': 15.0}
858
+ [2022-12-18 17:26:29,765] [INFO] [logging.py:68:log_dist] [Rank 0] step=1760, skipped=4, lr=[7.211111111111112e-06], mom=[[0.9, 0.999]]
859
+ [2022-12-18 17:26:29,767] [INFO] [timer.py:196:stop] epoch=0/micro_step=1760/global_step=1760, RunningAvgSamplesPerSec=17.589265190884376, CurrSamplesPerSec=17.29575795734955, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
860
+ [2022-12-18 17:29:23,777] [INFO] [logging.py:68:log_dist] [Rank 0] step=1770, skipped=4, lr=[7.188888888888889e-06], mom=[[0.9, 0.999]]
861
+ [2022-12-18 17:29:23,778] [INFO] [timer.py:196:stop] epoch=0/micro_step=1770/global_step=1770, RunningAvgSamplesPerSec=17.589199628233683, CurrSamplesPerSec=17.432689482722747, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
862
+ {'loss': 0.0007, 'learning_rate': 7.177777777777778e-06, 'epoch': 15.01}
863
+ [2022-12-18 17:32:22,814] [INFO] [logging.py:68:log_dist] [Rank 0] step=1780, skipped=4, lr=[7.166666666666667e-06], mom=[[0.9, 0.999]]
864
+ [2022-12-18 17:32:22,815] [INFO] [timer.py:196:stop] epoch=0/micro_step=1780/global_step=1780, RunningAvgSamplesPerSec=17.589156096057113, CurrSamplesPerSec=17.679121929983722, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
865
+ [2022-12-18 17:35:21,130] [INFO] [logging.py:68:log_dist] [Rank 0] step=1790, skipped=4, lr=[7.1444444444444446e-06], mom=[[0.9, 0.999]]
866
+ [2022-12-18 17:35:21,132] [INFO] [timer.py:196:stop] epoch=0/micro_step=1790/global_step=1790, RunningAvgSamplesPerSec=17.589510541041726, CurrSamplesPerSec=17.75061537037282, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
867
+ [2022-12-18 17:38:22,077] [INFO] [logging.py:68:log_dist] [Rank 0] step=1800, skipped=4, lr=[7.122222222222222e-06], mom=[[0.9, 0.999]]
868
+ [2022-12-18 17:38:22,079] [INFO] [timer.py:196:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=17.589853715416716, CurrSamplesPerSec=17.70494516775891, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
869
+ {'loss': 0.0007, 'learning_rate': 7.122222222222222e-06, 'epoch': 15.01}
870
+ [2022-12-18 17:41:22,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=1810, skipped=4, lr=[7.100000000000001e-06], mom=[[0.9, 0.999]]
871
+ [2022-12-18 17:41:22,985] [INFO] [timer.py:196:stop] epoch=0/micro_step=1810/global_step=1810, RunningAvgSamplesPerSec=17.589705689066374, CurrSamplesPerSec=17.502633426748798, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
872
+ [2022-12-18 17:44:18,958] [INFO] [logging.py:68:log_dist] [Rank 0] step=1820, skipped=4, lr=[7.077777777777778e-06], mom=[[0.9, 0.999]]
873
+ [2022-12-18 17:44:18,960] [INFO] [timer.py:196:stop] epoch=0/micro_step=1820/global_step=1820, RunningAvgSamplesPerSec=17.58889554956267, CurrSamplesPerSec=17.227935399411322, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
874
+ {'loss': 0.0007, 'learning_rate': 7.066666666666667e-06, 'epoch': 15.02}
875
+ [2022-12-18 17:47:21,773] [INFO] [logging.py:68:log_dist] [Rank 0] step=1830, skipped=4, lr=[7.055555555555557e-06], mom=[[0.9, 0.999]]
876
+ [2022-12-18 17:47:21,775] [INFO] [timer.py:196:stop] epoch=0/micro_step=1830/global_step=1830, RunningAvgSamplesPerSec=17.588036262189785, CurrSamplesPerSec=17.291913033681215, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
877
+ [2022-12-18 17:50:27,447] [INFO] [logging.py:68:log_dist] [Rank 0] step=1840, skipped=4, lr=[7.033333333333334e-06], mom=[[0.9, 0.999]]
878
+ [2022-12-18 17:50:27,448] [INFO] [timer.py:196:stop] epoch=0/micro_step=1840/global_step=1840, RunningAvgSamplesPerSec=17.587960597286486, CurrSamplesPerSec=17.724241030252486, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
879
+ [2022-12-18 17:52:52,644] [INFO] [logging.py:68:log_dist] [Rank 0] step=1850, skipped=4, lr=[7.011111111111112e-06], mom=[[0.9, 0.999]]
880
+ [2022-12-18 17:52:52,646] [INFO] [timer.py:196:stop] epoch=0/micro_step=1850/global_step=1850, RunningAvgSamplesPerSec=17.587975214747647, CurrSamplesPerSec=17.64940665653735, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
881
+ {'loss': 0.0006, 'learning_rate': 7.011111111111112e-06, 'epoch': 15.02}
882
+ [2022-12-18 17:56:08,287] [INFO] [logging.py:68:log_dist] [Rank 0] step=1860, skipped=4, lr=[6.9888888888888895e-06], mom=[[0.9, 0.999]]
883
+ [2022-12-18 17:56:08,288] [INFO] [timer.py:196:stop] epoch=0/micro_step=1860/global_step=1860, RunningAvgSamplesPerSec=17.589801958416484, CurrSamplesPerSec=17.619758723894087, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
884
+ [2022-12-18 17:59:00,804] [INFO] [logging.py:68:log_dist] [Rank 0] step=1870, skipped=4, lr=[6.966666666666667e-06], mom=[[0.9, 0.999]]
885
+ [2022-12-18 17:59:00,805] [INFO] [timer.py:196:stop] epoch=0/micro_step=1870/global_step=1870, RunningAvgSamplesPerSec=17.59022362418036, CurrSamplesPerSec=17.588560925502627, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
886
+ {'loss': 0.0006, 'learning_rate': 6.955555555555557e-06, 'epoch': 16.0}
887
+ [2022-12-18 18:01:58,641] [INFO] [logging.py:68:log_dist] [Rank 0] step=1880, skipped=4, lr=[6.944444444444445e-06], mom=[[0.9, 0.999]]
888
+ [2022-12-18 18:01:58,643] [INFO] [timer.py:196:stop] epoch=0/micro_step=1880/global_step=1880, RunningAvgSamplesPerSec=17.59061098794626, CurrSamplesPerSec=17.788324368998808, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
889
+ [2022-12-18 18:04:58,587] [INFO] [logging.py:68:log_dist] [Rank 0] step=1890, skipped=4, lr=[6.922222222222222e-06], mom=[[0.9, 0.999]]
890
+ [2022-12-18 18:04:58,588] [INFO] [timer.py:196:stop] epoch=0/micro_step=1890/global_step=1890, RunningAvgSamplesPerSec=17.5910300704761, CurrSamplesPerSec=17.873249456898765, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
891
+ [2022-12-18 18:07:53,171] [INFO] [logging.py:68:log_dist] [Rank 0] step=1900, skipped=4, lr=[6.9e-06], mom=[[0.9, 0.999]]
892
+ [2022-12-18 18:07:53,173] [INFO] [timer.py:196:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=17.590747818093448, CurrSamplesPerSec=17.480120031668438, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
893
+ {'loss': 0.0006, 'learning_rate': 6.9e-06, 'epoch': 16.01}
894
+ [2022-12-18 18:10:47,689] [INFO] [logging.py:68:log_dist] [Rank 0] step=1910, skipped=4, lr=[6.8777777777777785e-06], mom=[[0.9, 0.999]]
895
+ [2022-12-18 18:10:47,691] [INFO] [timer.py:196:stop] epoch=0/micro_step=1910/global_step=1910, RunningAvgSamplesPerSec=17.59104725904212, CurrSamplesPerSec=17.710297408588108, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
896
+ [2022-12-18 18:13:42,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=1920, skipped=4, lr=[6.855555555555556e-06], mom=[[0.9, 0.999]]
897
+ [2022-12-18 18:13:42,985] [INFO] [timer.py:196:stop] epoch=0/micro_step=1920/global_step=1920, RunningAvgSamplesPerSec=17.590833049552817, CurrSamplesPerSec=17.4918464812311, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
898
+ {'loss': 0.0006, 'learning_rate': 6.844444444444445e-06, 'epoch': 16.01}
899
+ [2022-12-18 18:16:38,362] [INFO] [logging.py:68:log_dist] [Rank 0] step=1930, skipped=4, lr=[6.833333333333334e-06], mom=[[0.9, 0.999]]
900
+ [2022-12-18 18:16:38,364] [INFO] [timer.py:196:stop] epoch=0/micro_step=1930/global_step=1930, RunningAvgSamplesPerSec=17.591234933505756, CurrSamplesPerSec=17.439830364828275, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
901
+ [2022-12-18 18:19:28,894] [INFO] [logging.py:68:log_dist] [Rank 0] step=1940, skipped=4, lr=[6.811111111111111e-06], mom=[[0.9, 0.999]]
902
+ [2022-12-18 18:19:28,896] [INFO] [timer.py:196:stop] epoch=0/micro_step=1940/global_step=1940, RunningAvgSamplesPerSec=17.590774338771933, CurrSamplesPerSec=17.599662529740584, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
903
+ [2022-12-18 18:22:22,602] [INFO] [logging.py:68:log_dist] [Rank 0] step=1950, skipped=4, lr=[6.788888888888889e-06], mom=[[0.9, 0.999]]
904
+ [2022-12-18 18:22:22,604] [INFO] [timer.py:196:stop] epoch=0/micro_step=1950/global_step=1950, RunningAvgSamplesPerSec=17.589976661895054, CurrSamplesPerSec=17.679976600177262, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
905
+ {'loss': 0.0006, 'learning_rate': 6.788888888888889e-06, 'epoch': 16.02}
906
+ [2022-12-18 18:25:14,881] [INFO] [logging.py:68:log_dist] [Rank 0] step=1960, skipped=4, lr=[6.7666666666666665e-06], mom=[[0.9, 0.999]]
907
+ [2022-12-18 18:25:14,882] [INFO] [timer.py:196:stop] epoch=0/micro_step=1960/global_step=1960, RunningAvgSamplesPerSec=17.59061118827806, CurrSamplesPerSec=17.789994847282397, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
908
+ [2022-12-18 18:26:44,140] [INFO] [logging.py:68:log_dist] [Rank 0] step=1970, skipped=4, lr=[6.744444444444444e-06], mom=[[0.9, 0.999]]
909
+ [2022-12-18 18:26:44,142] [INFO] [timer.py:196:stop] epoch=0/micro_step=1970/global_step=1970, RunningAvgSamplesPerSec=17.590670368534774, CurrSamplesPerSec=17.667350100092985, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
910
+ {'loss': 0.0005, 'learning_rate': 6.733333333333334e-06, 'epoch': 17.0}
911
+ [2022-12-18 18:30:57,744] [INFO] [logging.py:68:log_dist] [Rank 0] step=1980, skipped=4, lr=[6.7222222222222235e-06], mom=[[0.9, 0.999]]
912
+ [2022-12-18 18:30:57,745] [INFO] [timer.py:196:stop] epoch=0/micro_step=1980/global_step=1980, RunningAvgSamplesPerSec=17.591780103374816, CurrSamplesPerSec=17.696074726111295, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
913
+ [2022-12-18 18:33:47,038] [INFO] [logging.py:68:log_dist] [Rank 0] step=1990, skipped=4, lr=[6.700000000000001e-06], mom=[[0.9, 0.999]]
914
+ [2022-12-18 18:33:47,039] [INFO] [timer.py:196:stop] epoch=0/micro_step=1990/global_step=1990, RunningAvgSamplesPerSec=17.591581591137096, CurrSamplesPerSec=17.78166561596026, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
915
+ [2022-12-18 18:36:39,443] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=4, lr=[6.677777777777779e-06], mom=[[0.9, 0.999]]
916
+ [2022-12-18 18:36:39,444] [INFO] [timer.py:196:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=17.589825563330376, CurrSamplesPerSec=16.981787098279423, MemAllocated=0.53GB, MaxMemAllocated=17.47GB
917
+ {'loss': 0.0006, 'learning_rate': 6.677777777777779e-06, 'epoch': 17.01}
918
+ {'eval_loss': 0.30908203125, 'eval_wer': 17.69160002830656, 'eval_runtime': 1241.2535, 'eval_samples_per_second': 3.109, 'eval_steps_per_second': 0.097, 'epoch': 17.01}
919
+ [2022-12-18 18:57:21,750] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is begin to save!
920
+ [2022-12-18 18:57:21,758] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-2000/global_step2000/mp_rank_00_model_states.pt
921
+ [2022-12-18 18:57:21,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-2000/global_step2000/mp_rank_00_model_states.pt...
922
+ [2022-12-18 18:57:22,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-2000/global_step2000/mp_rank_00_model_states.pt.
923
+ [2022-12-18 18:57:22,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
924
+ [2022-12-18 18:57:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
925
+ [2022-12-18 18:57:27,460] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_optim_states.pt
926
+ [2022-12-18 18:57:27,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now!
runs/Dec18_08-41-04_fe2747a042f0/events.out.tfevents.1671381730.fe2747a042f0.46148.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:648235cf38bdc7d1daa38bae30699b9ba2dfe43a5b53cc2bc710c8ed357c6f54
3
- size 10859
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:28a685a481a4b67604e2aaac9b6de6dfd406c1f2afe4ce0b196060cc15a36104
3
+ size 17457