Multi-GPU support is disabled. Using a single GPU. +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/fineweb10B/fineweb_train_*.bin | | val data pattern | dev/data/fineweb10B/fineweb_val_*.bin | | output log dir | log124M | | checkpoint_every | 5000 | | resume | 0 | | micro batch size B | 32 | | sequence length T | 1024 | | total batch size | 524288 | | LR scheduler | cosine | | learning rate (LR) | 8.000000e-04 | | warmup iterations | 1000 | | final LR fraction | 0.000000e+00 | | weight decay | 1.000000e-01 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 250 | | val_max_steps | 20 | | sample_every | 20000 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-PCIE-40GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ | weight init method | d12 | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ | train_num_batches | 19560 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ | run hellaswag | yes | +-----------------------+----------------------------------------------------+ | num_processes | 1 | | zero_stage | 1 | +-----------------------+----------------------------------------------------+ num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=32 * seq_len T=1024 * num_processes=1 and total_batch_size=524288 => setting grad_accum_steps=16 allocating 237 MiB for parameter gradients allocating 19806 MiB for activations allocating 474 MiB for AdamW optimizer state m allocating 474 MiB for AdamW optimizer state v allocating 474 MiB for master copy of params device memory usage: 22198 MiB / 40338 MiB memory per sequence: 618 MiB -> estimated maximum batch size: 61 val loss 11.008742 step 1/19560 | loss 11.010252 (+nanz)| norm 15.1374 (+nanz)| lr 8.00e-07 | 4118.83 ms | 32.8% bf16 MFU | 127290 tok/s step 2/19560 | loss 10.960444 (+nanz)| norm 15.3064 (+nanz)| lr 1.60e-06 | 4251.37 ms | 31.8% bf16 MFU | 123322 tok/s step 3/19560 | loss 10.858796 (+nanz)| norm 14.8996 (+nanz)| lr 2.40e-06 | 4236.23 ms | 31.9% bf16 MFU | 123548 tok/s step 4/19560 | loss 10.735281 (+nanz)| norm 13.1831 (+nanz)| lr 3.20e-06 | 4061.22 ms | 33.2% bf16 MFU | 125493 tok/s step 5/19560 | loss 10.583439 (+nanz)| norm 10.9296 (+nanz)| lr 4.00e-06 | 4064.08 ms | 33.2% bf16 MFU | 126440 tok/s step 6/19560 | loss 10.441028 (+nanz)| norm 8.9012 (+nanz)| lr 4.80e-06 | 4210.41 ms | 32.1% bf16 MFU | 126016 tok/s step 7/19560 | loss 10.324488 (+nanz)| norm 7.3839 (+nanz)| lr 5.60e-06 | 4152.29 ms | 32.5% bf16 MFU | 126063 tok/s step 8/19560 | loss 10.212008 (+nanz)| norm 6.4902 (+nanz)| lr 6.40e-06 | 4376.58 ms | 30.9% bf16 MFU | 125024 tok/s step 9/19560 | loss 10.089766 (+nanz)| norm 5.6860 (+nanz)| lr 7.20e-06 | 4091.98 ms | 33.0% bf16 MFU | 125485 tok/s step 10/19560 | loss 10.028439 (+nanz)| norm 4.7370 (+nanz)| lr 8.00e-06 | 4086.65 ms | 33.0% bf16 MFU | 125864 tok/s step 11/19560 | loss 9.946451 (+nanz)| norm 4.0750 (+nanz)| lr 8.80e-06 | 4245.88 ms | 31.8% bf16 MFU | 125567 tok/s step 12/19560 | loss 9.865897 (+nanz)| norm 3.5795 (+nanz)| lr 9.60e-06 | 4068.50 ms | 33.2% bf16 MFU | 125950 tok/s step 13/19560 | loss 9.834242 (+nanz)| norm 3.0871 (+nanz)| lr 1.04e-05 | 4062.12 ms | 33.2% bf16 MFU | 126289 tok/s step 14/19560 | loss 9.770098 (+nanz)| norm 2.8175 (+nanz)| lr 1.12e-05 | 4067.34 ms | 33.2% bf16 MFU | 126557 tok/s step 15/19560 | loss 9.726440 (+nanz)| norm 2.5846 (+nanz)| lr 1.20e-05 | 4075.21 ms | 33.1% bf16 MFU | 126762 tok/s step 16/19560 | loss 9.714825 (+nanz)| norm 2.3559 (+nanz)| lr 1.28e-05 | 4103.90 ms | 32.9% bf16 MFU | 126854 tok/s step 17/19560 | loss 9.659826 (+nanz)| norm 2.2976 (+nanz)| lr 1.36e-05 | 4084.53 ms | 33.1% bf16 MFU | 126989 tok/s step 18/19560 | loss 9.662764 (+nanz)| norm 2.1958 (+nanz)| lr 1.44e-05 | 4198.43 ms | 32.2% bf16 MFU | 126807 tok/s step 19/19560 | loss 9.617876 (+nanz)| norm 2.2032 (+nanz)| lr 1.52e-05 | 4130.70 ms | 32.7% bf16 MFU | 126817 tok/s step 20/19560 | loss 9.599512 (+nanz)| norm 2.1809 (+nanz)| lr 1.60e-05 | 4092.42 ms | 33.0% bf16 MFU | 126921 tok/s step 21/19560 | loss 9.588427 (+nanz)| norm 2.1542 (+nanz)| lr 1.68e-05 | 4093.43 ms | 33.0% bf16 MFU | 127011 tok/s step 22/19560 | loss 9.545383 (+nanz)| norm 2.1696 (+nanz)| lr 1.76e-05 | 4098.81 ms | 32.9% bf16 MFU | 127080 tok/s step 23/19560 | loss 9.528460 (+nanz)| norm 2.1315 (+nanz)| lr 1.84e-05 | 4098.19 ms | 32.9% bf16 MFU | 127143 tok/s step 24/19560 | loss 9.490421 (+nanz)| norm 2.1252 (+nanz)| lr 1.92e-05 | 4096.36 ms | 33.0% bf16 MFU | 127204 tok/s step 25/19560 | loss 9.498554 (+nanz)| norm 2.0054 (+nanz)| lr 2.00e-05 | 4113.19 ms | 32.8% bf16 MFU | 127222 tok/s step 26/19560 | loss 9.459247 (+nanz)| norm 2.0086 (+nanz)| lr 2.08e-05 | 4119.02 ms | 32.8% bf16 MFU | 127226 tok/s step 27/19560 | loss 9.429609 (+nanz)| norm 1.9940 (+nanz)| lr 2.16e-05 | 4109.82 ms | 32.9% bf16 MFU | 127250 tok/s step 28/19560 | loss 9.399757 (+nanz)| norm 2.0641 (+nanz)| lr 2.24e-05 | 4114.92 ms | 32.8% bf16 MFU | 127261 tok/s step 29/19560 | loss 9.364097 (+nanz)| norm 1.9625 (+nanz)| lr 2.32e-05 | 4111.45 ms | 32.8% bf16 MFU | 127277 tok/s step 30/19560 | loss 9.308148 (+nanz)| norm 2.0624 (+nanz)| lr 2.40e-05 | 4128.31 ms | 32.7% bf16 MFU | 127259 tok/s step 31/19560 | loss 9.306782 (+nanz)| norm 1.8931 (+nanz)| lr 2.48e-05 | 4122.72 ms | 32.7% bf16 MFU | 127254 tok/s step 32/19560 | loss 9.221861 (+nanz)| norm 2.0090 (+nanz)| lr 2.56e-05 | 4122.90 ms | 32.7% bf16 MFU | 127248 tok/s step 33/19560 | loss 9.208517 (+nanz)| norm 2.1012 (+nanz)| lr 2.64e-05 | 4128.80 ms | 32.7% bf16 MFU | 127232 tok/s step 34/19560 | loss 9.201256 (+nanz)| norm 1.9531 (+nanz)| lr 2.72e-05 | 4119.63 ms | 32.8% bf16 MFU | 127234 tok/s step 35/19560 | loss 9.141226 (+nanz)| norm 1.9688 (+nanz)| lr 2.80e-05 | 4130.76 ms | 32.7% bf16 MFU | 127215 tok/s step 36/19560 | loss 9.097227 (+nanz)| norm 1.8949 (+nanz)| lr 2.88e-05 | 4138.86 ms | 32.6% bf16 MFU | 127183 tok/s step 37/19560 | loss 9.071661 (+nanz)| norm 1.8116 (+nanz)| lr 2.96e-05 | 4134.60 ms | 32.7% bf16 MFU | 127160 tok/s step 38/19560 | loss 9.035854 (+nanz)| norm 1.8791 (+nanz)| lr 3.04e-05 | 4132.48 ms | 32.7% bf16 MFU | 127143 tok/s step 39/19560 | loss 9.035684 (+nanz)| norm 1.7778 (+nanz)| lr 3.12e-05 | 4146.35 ms | 32.6% bf16 MFU | 127102 tok/s step 40/19560 | loss 8.988352 (+nanz)| norm 1.7956 (+nanz)| lr 3.20e-05 | 4141.60 ms | 32.6% bf16 MFU | 127073 tok/s step 41/19560 | loss 8.967684 (+nanz)| norm 1.6771 (+nanz)| lr 3.28e-05 | 4133.09 ms | 32.7% bf16 MFU | 127060 tok/s step 42/19560 | loss 8.886814 (+nanz)| norm 1.6862 (+nanz)| lr 3.36e-05 | 4140.29 ms | 32.6% bf16 MFU | 127036 tok/s step 43/19560 | loss 8.871346 (+nanz)| norm 1.6591 (+nanz)| lr 3.44e-05 | 4141.99 ms | 32.6% bf16 MFU | 127010 tok/s step 44/19560 | loss 8.840160 (+nanz)| norm 1.6621 (+nanz)| lr 3.52e-05 | 4145.28 ms | 32.6% bf16 MFU | 126980 tok/s step 45/19560 | loss 8.800434 (+nanz)| norm 1.9549 (+nanz)| lr 3.60e-05 | 4138.34 ms | 32.6% bf16 MFU | 126964 tok/s step 46/19560 | loss 8.795374 (+nanz)| norm 2.1858 (+nanz)| lr 3.68e-05 | 4149.99 ms | 32.5% bf16 MFU | 126929 tok/s step 47/19560 | loss 8.716420 (+nanz)| norm 1.8067 (+nanz)| lr 3.76e-05 | 4140.48 ms | 32.6% bf16 MFU | 126912 tok/s step 48/19560 | loss 8.675685 (+nanz)| norm 1.7334 (+nanz)| lr 3.84e-05 | 4151.85 ms | 32.5% bf16 MFU | 126877 tok/s step 49/19560 | loss 8.659918 (+nanz)| norm 1.9643 (+nanz)| lr 3.92e-05 | 4142.06 ms | 32.6% bf16 MFU | 126861 tok/s step 50/19560 | loss 8.655099 (+nanz)| norm 1.6912 (+nanz)| lr 4.00e-05 | 4138.16 ms | 32.6% bf16 MFU | 126852 tok/s step 51/19560 | loss 8.610016 (+nanz)| norm 1.5520 (+nanz)| lr 4.08e-05 | 4159.17 ms | 32.5% bf16 MFU | 126809 tok/s step 52/19560 | loss 8.514632 (+nanz)| norm 1.7012 (+nanz)| lr 4.16e-05 | 4147.41 ms | 32.6% bf16 MFU | 126787 tok/s step 53/19560 | loss 8.525326 (+nanz)| norm 1.7607 (+nanz)| lr 4.24e-05 | 4136.28 ms | 32.6% bf16 MFU | 126786 tok/s step 54/19560 | loss 8.493053 (+nanz)| norm 1.5803 (+nanz)| lr 4.32e-05 | 4151.27 ms | 32.5% bf16 MFU | 126759 tok/s step 55/19560 | loss 8.490725 (+nanz)| norm 1.4799 (+nanz)| lr 4.40e-05 | 4147.92 ms | 32.6% bf16 MFU | 126740 tok/s step 56/19560 | loss 8.421271 (+nanz)| norm 1.8690 (+nanz)| lr 4.48e-05 | 4140.65 ms | 32.6% bf16 MFU | 126734 tok/s step 57/19560 | loss 8.361889 (+nanz)| norm 1.6274 (+nanz)| lr 4.56e-05 | 4160.86 ms | 32.4% bf16 MFU | 126695 tok/s step 58/19560 | loss 8.354483 (+nanz)| norm 1.5379 (+nanz)| lr 4.64e-05 | 4147.96 ms | 32.6% bf16 MFU | 126679 tok/s step 59/19560 | loss 8.285403 (+nanz)| norm 1.4609 (+nanz)| lr 4.72e-05 | 4149.19 ms | 32.5% bf16 MFU | 126662 tok/s step 60/19560 | loss 8.308445 (+nanz)| norm 1.4892 (+nanz)| lr 4.80e-05 | 4146.28 ms | 32.6% bf16 MFU | 126651 tok/s step 61/19560 | loss 8.252878 (+nanz)| norm 1.5298 (+nanz)| lr 4.88e-05 | 4145.23 ms | 32.6% bf16 MFU | 126642 tok/s step 62/19560 | loss 8.183454 (+nanz)| norm 1.4921 (+nanz)| lr 4.96e-05 | 4143.14 ms | 32.6% bf16 MFU | 126637 tok/s step 63/19560 | loss 8.169135 (+nanz)| norm 1.3528 (+nanz)| lr 5.04e-05 | 4150.25 ms | 32.5% bf16 MFU | 126621 tok/s step 64/19560 | loss 8.083473 (+nanz)| norm 1.4473 (+nanz)| lr 5.12e-05 | 4149.39 ms | 32.5% bf16 MFU | 126607 tok/s step 65/19560 | loss 8.082701 (+nanz)| norm 1.4074 (+nanz)| lr 5.20e-05 | 4142.93 ms | 32.6% bf16 MFU | 126604 tok/s step 66/19560 | loss 8.082714 (+nanz)| norm 1.3400 (+nanz)| lr 5.28e-05 | 4149.00 ms | 32.5% bf16 MFU | 126592 tok/s step 67/19560 | loss 8.024032 (+nanz)| norm 1.2917 (+nanz)| lr 5.36e-05 | 4147.00 ms | 32.6% bf16 MFU | 126583 tok/s step 68/19560 | loss 8.006344 (+nanz)| norm 1.3408 (+nanz)| lr 5.44e-05 | 4147.95 ms | 32.6% bf16 MFU | 126573 tok/s step 69/19560 | loss 7.957113 (+nanz)| norm 1.6717 (+nanz)| lr 5.52e-05 | 4169.26 ms | 32.4% bf16 MFU | 126531 tok/s step 70/19560 | loss 7.933222 (+nanz)| norm 1.3684 (+nanz)| lr 5.60e-05 | 4151.80 ms | 32.5% bf16 MFU | 126518 tok/s step 71/19560 | loss 7.866104 (+nanz)| norm 1.4582 (+nanz)| lr 5.68e-05 | 4147.68 ms | 32.6% bf16 MFU | 126512 tok/s step 72/19560 | loss 7.856381 (+nanz)| norm 1.5705 (+nanz)| lr 5.76e-05 | 4145.51 ms | 32.6% bf16 MFU | 126510 tok/s step 73/19560 | loss 7.877795 (+nanz)| norm 1.2671 (+nanz)| lr 5.84e-05 | 4141.18 ms | 32.6% bf16 MFU | 126515 tok/s step 74/19560 | loss 7.800503 (+nanz)| norm 1.2606 (+nanz)| lr 5.92e-05 | 4145.85 ms | 32.6% bf16 MFU | 126512 tok/s step 75/19560 | loss 7.822316 (+nanz)| norm 1.2188 (+nanz)| lr 6.00e-05 | 4145.30 ms | 32.6% bf16 MFU | 126510 tok/s step 76/19560 | loss 7.735156 (+nanz)| norm 1.2774 (+nanz)| lr 6.08e-05 | 4185.02 ms | 32.3% bf16 MFU | 126447 tok/s step 77/19560 | loss 7.701460 (+nanz)| norm 1.1801 (+nanz)| lr 6.16e-05 | 4155.32 ms | 32.5% bf16 MFU | 126433 tok/s step 78/19560 | loss 7.674749 (+nanz)| norm 1.2271 (+nanz)| lr 6.24e-05 | 4156.05 ms | 32.5% bf16 MFU | 126419 tok/s step 79/19560 | loss 7.625961 (+nanz)| norm 1.1107 (+nanz)| lr 6.32e-05 | 4139.93 ms | 32.6% bf16 MFU | 126430 tok/s step 80/19560 | loss 7.588456 (+nanz)| norm 1.2504 (+nanz)| lr 6.40e-05 | 4200.98 ms | 32.1% bf16 MFU | 126347 tok/s step 81/19560 | loss 7.583718 (+nanz)| norm 1.3325 (+nanz)| lr 6.48e-05 | 4143.18 ms | 32.6% bf16 MFU | 126357 tok/s step 82/19560 | loss 7.543869 (+nanz)| norm 0.9288 (+nanz)| lr 6.56e-05 | 4146.88 ms | 32.6% bf16 MFU | 126361 tok/s step 83/19560 | loss 7.558131 (+nanz)| norm 1.0903 (+nanz)| lr 6.64e-05 | 4151.50 ms | 32.5% bf16 MFU | 126357 tok/s step 84/19560 | loss 7.567950 (+nanz)| norm 1.0736 (+nanz)| lr 6.72e-05 | 4147.36 ms | 32.6% bf16 MFU | 126360 tok/s step 85/19560 | loss 7.471190 (+nanz)| norm 1.2254 (+nanz)| lr 6.80e-05 | 4143.52 ms | 32.6% bf16 MFU | 126369 tok/s step 86/19560 | loss 7.454552 (+nanz)| norm 0.9348 (+nanz)| lr 6.88e-05 | 4144.84 ms | 32.6% bf16 MFU | 126375 tok/s step 87/19560 | loss 7.399378 (+nanz)| norm 1.0285 (+nanz)| lr 6.96e-05 | 4152.58 ms | 32.5% bf16 MFU | 126369 tok/s step 88/19560 | loss 7.428467 (+nanz)| norm 1.0977 (+nanz)| lr 7.04e-05 | 4145.80 ms | 32.6% bf16 MFU | 126374 tok/s step 89/19560 | loss 7.436365 (+nanz)| norm 0.8791 (+nanz)| lr 7.12e-05 | 4149.27 ms | 32.5% bf16 MFU | 126373 tok/s step 90/19560 | loss 7.382101 (+nanz)| norm 0.8753 (+nanz)| lr 7.20e-05 | 4161.17 ms | 32.4% bf16 MFU | 126354 tok/s step 91/19560 | loss 7.309272 (+nanz)| norm 0.8956 (+nanz)| lr 7.28e-05 | 4153.28 ms | 32.5% bf16 MFU | 126348 tok/s step 92/19560 | loss 7.317262 (+nanz)| norm 1.2818 (+nanz)| lr 7.36e-05 | 4147.81 ms | 32.6% bf16 MFU | 126351 tok/s step 93/19560 | loss 7.315147 (+nanz)| norm 0.9560 (+nanz)| lr 7.44e-05 | 4153.12 ms | 32.5% bf16 MFU | 126345 tok/s step 94/19560 | loss 7.325515 (+nanz)| norm 0.9770 (+nanz)| lr 7.52e-05 | 4158.07 ms | 32.5% bf16 MFU | 126332 tok/s step 95/19560 | loss 7.259615 (+nanz)| norm 1.1683 (+nanz)| lr 7.60e-05 | 4159.84 ms | 32.5% bf16 MFU | 126317 tok/s step 96/19560 | loss 7.257096 (+nanz)| norm 1.2087 (+nanz)| lr 7.68e-05 | 4150.06 ms | 32.5% bf16 MFU | 126318 tok/s step 97/19560 | loss 7.192627 (+nanz)| norm 0.8916 (+nanz)| lr 7.76e-05 | 4193.42 ms | 32.2% bf16 MFU | 126253 tok/s step 98/19560 | loss 7.217953 (+nanz)| norm 0.8371 (+nanz)| lr 7.84e-05 | 4150.64 ms | 32.5% bf16 MFU | 126256 tok/s step 99/19560 | loss 7.201939 (+nanz)| norm 0.9242 (+nanz)| lr 7.92e-05 | 4147.01 ms | 32.6% bf16 MFU | 126264 tok/s step 100/19560 | loss 7.159008 (+nanz)| norm 0.8629 (+nanz)| lr 8.00e-05 | 4147.11 ms | 32.6% bf16 MFU | 126272 tok/s step 101/19560 | loss 7.177856 (+nanz)| norm 0.8955 (+nanz)| lr 8.08e-05 | 4155.94 ms | 32.5% bf16 MFU | 126266 tok/s step 102/19560 | loss 7.144492 (+nanz)| norm 0.8961 (+nanz)| lr 8.16e-05 | 4154.24 ms | 32.5% bf16 MFU | 126263 tok/s step 103/19560 | loss 7.050543 (+nanz)| norm 0.7581 (+nanz)| lr 8.24e-05 | 4146.09 ms | 32.6% bf16 MFU | 126273 tok/s step 104/19560 | loss 7.110675 (+nanz)| norm 0.6167 (+nanz)| lr 8.32e-05 | 4140.47 ms | 32.6% bf16 MFU | 126291 tok/s step 105/19560 | loss 7.082663 (+nanz)| norm 0.6710 (+nanz)| lr 8.40e-05 | 4161.09 ms | 32.4% bf16 MFU | 126276 tok/s step 106/19560 | loss 7.063635 (+nanz)| norm 0.8025 (+nanz)| lr 8.48e-05 | 4157.78 ms | 32.5% bf16 MFU | 126267 tok/s step 107/19560 | loss 7.077119 (+nanz)| norm 0.7568 (+nanz)| lr 8.56e-05 | 4156.38 ms | 32.5% bf16 MFU | 126261 tok/s step 108/19560 | loss 7.051853 (+nanz)| norm 1.3567 (+nanz)| lr 8.64e-05 | 4144.93 ms | 32.6% bf16 MFU | 126272 tok/s step 109/19560 | loss 7.007021 (+nanz)| norm 1.1876 (+nanz)| lr 8.72e-05 | 4160.05 ms | 32.5% bf16 MFU | 126260 tok/s step 110/19560 | loss 7.071463 (+nanz)| norm 0.8776 (+nanz)| lr 8.80e-05 | 4200.62 ms | 32.1% bf16 MFU | 126187 tok/s step 111/19560 | loss 7.032667 (+nanz)| norm 1.1221 (+nanz)| lr 8.88e-05 | 4156.35 ms | 32.5% bf16 MFU | 126185 tok/s step 112/19560 | loss 7.025249 (+nanz)| norm 0.9512 (+nanz)| lr 8.96e-05 | 4152.27 ms | 32.5% bf16 MFU | 126189 tok/s step 113/19560 | loss 7.013064 (+nanz)| norm 0.8446 (+nanz)| lr 9.04e-05 | 4142.29 ms | 32.6% bf16 MFU | 126208 tok/s step 114/19560 | loss 6.911192 (+nanz)| norm 1.3189 (+nanz)| lr 9.12e-05 | 4155.78 ms | 32.5% bf16 MFU | 126206 tok/s step 115/19560 | loss 6.920349 (+nanz)| norm 0.9181 (+nanz)| lr 9.20e-05 | 4179.52 ms | 32.3% bf16 MFU | 126167 tok/s step 116/19560 | loss 6.945459 (+nanz)| norm 0.8464 (+nanz)| lr 9.28e-05 | 4159.59 ms | 32.5% bf16 MFU | 126161 tok/s step 117/19560 | loss 6.909839 (+nanz)| norm 0.9026 (+nanz)| lr 9.36e-05 | 4155.42 ms | 32.5% bf16 MFU | 126162 tok/s step 118/19560 | loss 6.946344 (+nanz)| norm 0.9394 (+nanz)| lr 9.44e-05 | 4152.18 ms | 32.5% bf16 MFU | 126167 tok/s step 119/19560 | loss 6.905879 (+nanz)| norm 0.9522 (+nanz)| lr 9.52e-05 | 4150.38 ms | 32.5% bf16 MFU | 126175 tok/s step 120/19560 | loss 6.873213 (+nanz)| norm 1.4717 (+nanz)| lr 9.60e-05 | 4155.06 ms | 32.5% bf16 MFU | 126175 tok/s step 121/19560 | loss 6.867551 (+nanz)| norm 0.6873 (+nanz)| lr 9.68e-05 | 4152.56 ms | 32.5% bf16 MFU | 126179 tok/s step 122/19560 | loss 6.892013 (+nanz)| norm 0.8919 (+nanz)| lr 9.76e-05 | 4143.27 ms | 32.6% bf16 MFU | 126197 tok/s step 123/19560 | loss 6.894456 (+nanz)| norm 0.8579 (+nanz)| lr 9.84e-05 | 4161.13 ms | 32.4% bf16 MFU | 126187 tok/s step 124/19560 | loss 6.869079 (+nanz)| norm 0.7176 (+nanz)| lr 9.92e-05 | 4155.19 ms | 32.5% bf16 MFU | 126187 tok/s step 125/19560 | loss 6.875314 (+nanz)| norm 0.8161 (+nanz)| lr 1.00e-04 | 4210.50 ms | 32.1% bf16 MFU | 126103 tok/s step 126/19560 | loss 6.893086 (+nanz)| norm 1.1581 (+nanz)| lr 1.01e-04 | 4166.67 ms | 32.4% bf16 MFU | 126089 tok/s step 127/19560 | loss 6.846540 (+nanz)| norm 1.1925 (+nanz)| lr 1.02e-04 | 4162.01 ms | 32.4% bf16 MFU | 126083 tok/s step 128/19560 | loss 6.897065 (+nanz)| norm 1.0194 (+nanz)| lr 1.02e-04 | 4151.14 ms | 32.5% bf16 MFU | 126094 tok/s step 129/19560 | loss 6.837648 (-1.27z)| norm 1.0059 (-0.44z)| lr 1.03e-04 | 4157.33 ms | 32.5% bf16 MFU | 126095 tok/s step 130/19560 | loss 6.794366 (-1.31z)| norm 0.5186 (-0.68z)| lr 1.04e-04 | 4151.87 ms | 32.5% bf16 MFU | 126104 tok/s step 131/19560 | loss 6.771255 (-1.32z)| norm 0.7930 (-0.59z)| lr 1.05e-04 | 4159.69 ms | 32.5% bf16 MFU | 126101 tok/s step 132/19560 | loss 6.808474 (-1.28z)| norm 0.7840 (-0.66z)| lr 1.06e-04 | 4163.14 ms | 32.4% bf16 MFU | 126093 tok/s step 133/19560 | loss 6.836285 (-1.24z)| norm 0.6361 (-0.85z)| lr 1.06e-04 | 4157.13 ms | 32.5% bf16 MFU | 126094 tok/s step 134/19560 | loss 6.773589 (-1.29z)| norm 0.7279 (-0.85z)| lr 1.07e-04 | 4155.68 ms | 32.5% bf16 MFU | 126097 tok/s step 135/19560 | loss 6.742445 (-1.31z)| norm 0.7796 (-0.87z)| lr 1.08e-04 | 4154.80 ms | 32.5% bf16 MFU | 126102 tok/s step 136/19560 | loss 6.694206 (-1.34z)| norm 1.0169 (-0.64z)| lr 1.09e-04 | 4188.64 ms | 32.2% bf16 MFU | 126055 tok/s step 137/19560 | loss 6.729797 (-1.29z)| norm 1.1352 (-0.50z)| lr 1.10e-04 | 4152.05 ms | 32.5% bf16 MFU | 126066 tok/s step 138/19560 | loss 6.712089 (-1.30z)| norm 1.3162 (-0.20z)| lr 1.10e-04 | 4159.96 ms | 32.5% bf16 MFU | 126064 tok/s step 139/19560 | loss 6.631396 (-1.36z)| norm 1.2530 (-0.30z)| lr 1.11e-04 | 4165.30 ms | 32.4% bf16 MFU | 126055 tok/s step 140/19560 | loss 6.652275 (-1.33z)| norm 1.0838 (-0.61z)| lr 1.12e-04 | 4162.85 ms | 32.4% bf16 MFU | 126049 tok/s step 141/19560 | loss 6.643289 (-1.32z)| norm 0.8323 (-1.10z)| lr 1.13e-04 | 4157.97 ms | 32.5% bf16 MFU | 126051 tok/s step 142/19560 | loss 6.661393 (-1.29z)| norm 0.9654 (-0.83z)| lr 1.14e-04 | 4150.62 ms | 32.5% bf16 MFU | 126065 tok/s step 143/19560 | loss 6.691791 (-1.24z)| norm 0.9506 (-0.86z)| lr 1.14e-04 | 4158.97 ms | 32.5% bf16 MFU | 126064 tok/s step 144/19560 | loss 6.610051 (-1.31z)| norm 1.2077 (-0.29z)| lr 1.15e-04 | 4148.90 ms | 32.5% bf16 MFU | 126080 tok/s step 145/19560 | loss 6.630790 (-1.27z)| norm 1.1439 (-0.42z)| lr 1.16e-04 | 4154.38 ms | 32.5% bf16 MFU | 126086 tok/s step 146/19560 | loss 6.627977 (-1.26z)| norm 0.9405 (-0.86z)| lr 1.17e-04 | 4162.97 ms | 32.4% bf16 MFU | 126078 tok/s step 147/19560 | loss 6.623582 (-1.25z)| norm 1.0831 (-0.53z)| lr 1.18e-04 | 4155.93 ms | 32.5% bf16 MFU | 126082 tok/s step 148/19560 | loss 6.654934 (-1.21z)| norm 0.9481 (-0.82z)| lr 1.18e-04 | 4163.14 ms | 32.4% bf16 MFU | 126075 tok/s step 149/19560 | loss 6.632069 (-1.22z)| norm 0.8052 (-1.14z)| lr 1.19e-04 | 4152.91 ms | 32.5% bf16 MFU | 126083 tok/s step 150/19560 | loss 6.618417 (-1.22z)| norm 0.8062 (-1.12z)| lr 1.20e-04 | 4163.02 ms | 32.4% bf16 MFU | 126076 tok/s step 151/19560 | loss 6.628115 (-1.20z)| norm 1.4391 (+0.39z)| lr 1.21e-04 | 4173.42 ms | 32.4% bf16 MFU | 126054 tok/s step 152/19560 | loss 6.591807 (-1.22z)| norm 1.3223 (+0.12z)| lr 1.22e-04 | 4152.38 ms | 32.5% bf16 MFU | 126064 tok/s step 153/19560 | loss 6.566777 (-1.24z)| norm 1.2941 (+0.07z)| lr 1.22e-04 | 4150.18 ms | 32.5% bf16 MFU | 126077 tok/s step 154/19560 | loss 6.575416 (-1.22z)| norm 0.7564 (-1.24z)| lr 1.23e-04 | 4154.41 ms | 32.5% bf16 MFU | 126084 tok/s step 155/19560 | loss 6.612092 (-1.17z)| norm 0.9635 (-0.71z)| lr 1.24e-04 | 4151.10 ms | 32.5% bf16 MFU | 126094 tok/s step 156/19560 | loss 6.601549 (-1.17z)| norm 0.7531 (-1.23z)| lr 1.25e-04 | 4162.89 ms | 32.4% bf16 MFU | 126087 tok/s step 157/19560 | loss 6.589478 (-1.17z)| norm 0.8134 (-1.06z)| lr 1.26e-04 | 4152.81 ms | 32.5% bf16 MFU | 126095 tok/s step 158/19560 | loss 6.549174 (-1.21z)| norm 1.1286 (-0.24z)| lr 1.26e-04 | 4155.78 ms | 32.5% bf16 MFU | 126098 tok/s step 159/19560 | loss 6.534319 (-1.21z)| norm 0.9806 (-0.62z)| lr 1.27e-04 | 4159.21 ms | 32.5% bf16 MFU | 126096 tok/s step 160/19560 | loss 6.551050 (-1.18z)| norm 1.1464 (-0.16z)| lr 1.28e-04 | 4158.04 ms | 32.5% bf16 MFU | 126096 tok/s step 161/19560 | loss 6.567881 (-1.15z)| norm 1.1175 (-0.23z)| lr 1.29e-04 | 4167.34 ms | 32.4% bf16 MFU | 126081 tok/s step 162/19560 | loss 6.568004 (-1.14z)| norm 1.1935 (-0.00z)| lr 1.30e-04 | 4159.45 ms | 32.5% bf16 MFU | 126080 tok/s step 163/19560 | loss 6.462445 (-1.27z)| norm 1.0743 (-0.32z)| lr 1.30e-04 | 4171.41 ms | 32.4% bf16 MFU | 126060 tok/s step 164/19560 | loss 6.600551 (-1.07z)| norm 0.9601 (-0.63z)| lr 1.31e-04 | 4172.01 ms | 32.4% bf16 MFU | 126040 tok/s step 165/19560 | loss 6.520448 (-1.17z)| norm 1.0136 (-0.47z)| lr 1.32e-04 | 4156.56 ms | 32.5% bf16 MFU | 126045 tok/s step 166/19560 | loss 6.501635 (-1.19z)| norm 1.0110 (-0.46z)| lr 1.33e-04 | 4155.36 ms | 32.5% bf16 MFU | 126051 tok/s step 167/19560 | loss 6.620176 (-1.02z)| norm 0.9232 (-0.71z)| lr 1.34e-04 | 4153.44 ms | 32.5% bf16 MFU | 126060 tok/s step 168/19560 | loss 6.426803 (-1.27z)| norm 0.9835 (-0.52z)| lr 1.34e-04 | 4159.75 ms | 32.5% bf16 MFU | 126059 tok/s step 169/19560 | loss 6.472548 (-1.20z)| norm 0.8764 (-0.83z)| lr 1.35e-04 | 4154.84 ms | 32.5% bf16 MFU | 126066 tok/s step 170/19560 | loss 6.529960 (-1.11z)| norm 1.0743 (-0.21z)| lr 1.36e-04 | 4151.14 ms | 32.5% bf16 MFU | 126077 tok/s step 171/19560 | loss 6.497211 (-1.15z)| norm 1.1497 (+0.04z)| lr 1.37e-04 | 4156.16 ms | 32.5% bf16 MFU | 126081 tok/s step 172/19560 | loss 6.494313 (-1.14z)| norm 1.3722 (+0.75z)| lr 1.38e-04 | 4145.66 ms | 32.6% bf16 MFU | 126100 tok/s step 173/19560 | loss 6.489421 (-1.14z)| norm 0.8580 (-0.87z)| lr 1.38e-04 | 4165.84 ms | 32.4% bf16 MFU | 126088 tok/s step 174/19560 | loss 6.444068 (-1.20z)| norm 1.0669 (-0.18z)| lr 1.39e-04 | 4153.99 ms | 32.5% bf16 MFU | 126094 tok/s step 175/19560 | loss 6.480557 (-1.14z)| norm 1.2814 (+0.58z)| lr 1.40e-04 | 4164.31 ms | 32.4% bf16 MFU | 126084 tok/s step 176/19560 | loss 6.436944 (-1.20z)| norm 1.0185 (-0.32z)| lr 1.41e-04 | 4158.38 ms | 32.5% bf16 MFU | 126084 tok/s step 177/19560 | loss 6.417393 (-1.22z)| norm 0.8620 (-0.87z)| lr 1.42e-04 | 4165.66 ms | 32.4% bf16 MFU | 126073 tok/s step 178/19560 | loss 6.444425 (-1.17z)| norm 0.7335 (-1.33z)| lr 1.42e-04 | 4157.60 ms | 32.5% bf16 MFU | 126074 tok/s step 179/19560 | loss 6.507324 (-1.06z)| norm 0.8321 (-0.95z)| lr 1.43e-04 | 4163.73 ms | 32.4% bf16 MFU | 126067 tok/s step 180/19560 | loss 6.436347 (-1.17z)| norm 1.2933 (+0.79z)| lr 1.44e-04 | 4156.27 ms | 32.5% bf16 MFU | 126070 tok/s step 181/19560 | loss 6.447363 (-1.14z)| norm 0.7475 (-1.28z)| lr 1.45e-04 | 4164.53 ms | 32.4% bf16 MFU | 126062 tok/s step 182/19560 | loss 6.448152 (-1.13z)| norm 0.9618 (-0.43z)| lr 1.46e-04 | 4156.01 ms | 32.5% bf16 MFU | 126066 tok/s step 183/19560 | loss 6.481495 (-1.07z)| norm 0.8318 (-0.93z)| lr 1.46e-04 | 4154.48 ms | 32.5% bf16 MFU | 126073 tok/s step 184/19560 | loss 6.431007 (-1.15z)| norm 1.1429 (+0.34z)| lr 1.47e-04 | 4158.03 ms | 32.5% bf16 MFU | 126074 tok/s step 185/19560 | loss 6.491349 (-1.03z)| norm 1.3199 (+1.10z)| lr 1.48e-04 | 4160.84 ms | 32.4% bf16 MFU | 126070 tok/s step 186/19560 | loss 6.425117 (-1.15z)| norm 1.3398 (+1.20z)| lr 1.49e-04 | 4156.44 ms | 32.5% bf16 MFU | 126074 tok/s step 187/19560 | loss 6.492467 (-1.02z)| norm 0.9316 (-0.52z)| lr 1.50e-04 | 4157.90 ms | 32.5% bf16 MFU | 126075 tok/s step 188/19560 | loss 6.418284 (-1.16z)| norm 0.9874 (-0.27z)| lr 1.50e-04 | 4158.54 ms | 32.5% bf16 MFU | 126075 tok/s step 189/19560 | loss 6.425578 (-1.14z)| norm 1.2029 (+0.69z)| lr 1.51e-04 | 4151.23 ms | 32.5% bf16 MFU | 126086 tok/s step 190/19560 | loss 6.385678 (-1.21z)| norm 1.1405 (+0.43z)| lr 1.52e-04 | 4154.52 ms | 32.5% bf16 MFU | 126091 tok/s step 191/19560 | loss 6.419655 (-1.13z)| norm 0.8662 (-0.78z)| lr 1.53e-04 | 4399.46 ms | 30.7% bf16 MFU | 125745 tok/s step 192/19560 | loss 6.412455 (-1.14z)| norm 0.7171 (-1.44z)| lr 1.54e-04 | 4437.91 ms | 30.4% bf16 MFU | 125365 tok/s step 193/19560 | loss 6.393267 (-1.18z)| norm 0.7466 (-1.28z)| lr 1.54e-04 | 4215.96 ms | 32.0% bf16 MFU | 125315 tok/s step 194/19560 | loss 6.389833 (-1.18z)| norm 0.8254 (-0.91z)| lr 1.55e-04 | 4223.12 ms | 32.0% bf16 MFU | 125256 tok/s step 195/19560 | loss 6.438484 (-1.06z)| norm 0.7640 (-1.17z)| lr 1.56e-04 | 4154.67 ms | 32.5% bf16 MFU | 125303 tok/s step 196/19560 | loss 6.431596 (-1.07z)| norm 0.7697 (-1.13z)| lr 1.57e-04 | 4432.13 ms | 30.5% bf16 MFU | 124953 tok/s step 197/19560 | loss 6.408412 (-1.12z)| norm 1.0715 (+0.29z)| lr 1.58e-04 | 4159.79 ms | 32.5% bf16 MFU | 125007 tok/s step 198/19560 | loss 6.396133 (-1.14z)| norm 1.1812 (+0.82z)| lr 1.58e-04 | 4261.18 ms | 31.7% bf16 MFU | 124908 tok/s step 199/19560 | loss 6.328794 (-1.30z)| norm 1.0983 (+0.45z)| lr 1.59e-04 | 4156.37 ms | 32.5% bf16 MFU | 124970 tok/s step 200/19560 | loss 6.328140 (-1.29z)| norm 1.2680 (+1.32z)| lr 1.60e-04 | 4158.44 ms | 32.5% bf16 MFU | 125025 tok/s step 201/19560 | loss 6.350534 (-1.23z)| norm 0.9462 (-0.28z)| lr 1.61e-04 | 4149.00 ms | 32.5% bf16 MFU | 125092 tok/s step 202/19560 | loss 6.352553 (-1.22z)| norm 1.3125 (+1.56z)| lr 1.62e-04 | 4155.95 ms | 32.5% bf16 MFU | 125145 tok/s step 203/19560 | loss 6.325451 (-1.29z)| norm 0.7729 (-1.14z)| lr 1.62e-04 | 4161.62 ms | 32.4% bf16 MFU | 125187 tok/s step 204/19560 | loss 6.338996 (-1.25z)| norm 0.7244 (-1.36z)| lr 1.63e-04 | 4162.80 ms | 32.4% bf16 MFU | 125225 tok/s step 205/19560 | loss 6.313797 (-1.31z)| norm 0.7170 (-1.37z)| lr 1.64e-04 | 4148.55 ms | 32.5% bf16 MFU | 125283 tok/s step 206/19560 | loss 6.296169 (-1.35z)| norm 0.8602 (-0.64z)| lr 1.65e-04 | 4158.39 ms | 32.5% bf16 MFU | 125323 tok/s step 207/19560 | loss 6.383993 (-1.10z)| norm 1.0313 (+0.22z)| lr 1.66e-04 | 4150.88 ms | 32.5% bf16 MFU | 125372 tok/s step 208/19560 | loss 6.295389 (-1.34z)| norm 1.1156 (+0.65z)| lr 1.66e-04 | 4159.66 ms | 32.5% bf16 MFU | 125405 tok/s step 209/19560 | loss 6.367989 (-1.12z)| norm 1.6171 (+3.09z)| lr 1.67e-04 | 4152.46 ms | 32.5% bf16 MFU | 125448 tok/s step 210/19560 | loss 6.264786 (-1.42z)| norm 1.0103 (+0.10z)| lr 1.68e-04 | 4158.48 ms | 32.5% bf16 MFU | 125480 tok/s step 211/19560 | loss 6.312981 (-1.27z)| norm 0.8526 (-0.67z)| lr 1.69e-04 | 4163.79 ms | 32.4% bf16 MFU | 125501 tok/s step 212/19560 | loss 6.387210 (-1.04z)| norm 0.9364 (-0.25z)| lr 1.70e-04 | 4160.94 ms | 32.4% bf16 MFU | 125526 tok/s step 213/19560 | loss 6.344419 (-1.16z)| norm 1.1950 (+1.03z)| lr 1.70e-04 | 4154.42 ms | 32.5% bf16 MFU | 125560 tok/s step 214/19560 | loss 6.302351 (-1.29z)| norm 1.1381 (+0.74z)| lr 1.71e-04 | 4157.67 ms | 32.5% bf16 MFU | 125587 tok/s step 215/19560 | loss 6.263709 (-1.40z)| norm 0.8371 (-0.73z)| lr 1.72e-04 | 4148.78 ms | 32.5% bf16 MFU | 125626 tok/s step 216/19560 | loss 6.233961 (-1.49z)| norm 0.7299 (-1.24z)| lr 1.73e-04 | 4166.06 ms | 32.4% bf16 MFU | 125637 tok/s step 217/19560 | loss 6.240318 (-1.46z)| norm 0.9054 (-0.38z)| lr 1.74e-04 | 4160.54 ms | 32.5% bf16 MFU | 125656 tok/s step 218/19560 | loss 6.323627 (-1.17z)| norm 0.7061 (-1.34z)| lr 1.74e-04 | 4156.19 ms | 32.5% bf16 MFU | 125681 tok/s step 219/19560 | loss 6.307807 (-1.22z)| norm 0.6541 (-1.57z)| lr 1.75e-04 | 4157.27 ms | 32.5% bf16 MFU | 125702 tok/s step 220/19560 | loss 6.313091 (-1.19z)| norm 0.7758 (-0.97z)| lr 1.76e-04 | 4198.45 ms | 32.2% bf16 MFU | 125661 tok/s step 221/19560 | loss 6.263710 (-1.36z)| norm 0.8646 (-0.54z)| lr 1.77e-04 | 4150.94 ms | 32.5% bf16 MFU | 125693 tok/s step 222/19560 | loss 6.260327 (-1.36z)| norm 1.1627 (+0.89z)| lr 1.78e-04 | 4151.52 ms | 32.5% bf16 MFU | 125723 tok/s step 223/19560 | loss 6.400703 (-0.84z)| norm 0.9884 (+0.06z)| lr 1.78e-04 | 4155.43 ms | 32.5% bf16 MFU | 125745 tok/s step 224/19560 | loss 6.229946 (-1.47z)| norm 0.8501 (-0.60z)| lr 1.79e-04 | 4158.57 ms | 32.5% bf16 MFU | 125762 tok/s step 225/19560 | loss 6.214905 (-1.51z)| norm 0.9942 (+0.10z)| lr 1.80e-04 | 4153.95 ms | 32.5% bf16 MFU | 125784 tok/s step 226/19560 | loss 6.261968 (-1.32z)| norm 1.3606 (+1.83z)| lr 1.81e-04 | 4164.06 ms | 32.4% bf16 MFU | 125791 tok/s step 227/19560 | loss 6.227578 (-1.44z)| norm 1.1445 (+0.79z)| lr 1.82e-04 | 4173.24 ms | 32.4% bf16 MFU | 125783 tok/s step 228/19560 | loss 6.255919 (-1.32z)| norm 1.0158 (+0.17z)| lr 1.82e-04 | 4153.46 ms | 32.5% bf16 MFU | 125805 tok/s step 229/19560 | loss 6.267667 (-1.27z)| norm 0.9568 (-0.12z)| lr 1.83e-04 | 4157.88 ms | 32.5% bf16 MFU | 125819 tok/s step 230/19560 | loss 6.216035 (-1.46z)| norm 1.0463 (+0.30z)| lr 1.84e-04 | 4159.01 ms | 32.5% bf16 MFU | 125832 tok/s step 231/19560 | loss 6.206936 (-1.48z)| norm 1.1274 (+0.68z)| lr 1.85e-04 | 4146.67 ms | 32.6% bf16 MFU | 125862 tok/s step 232/19560 | loss 6.184789 (-1.56z)| norm 0.7092 (-1.34z)| lr 1.86e-04 | 4168.48 ms | 32.4% bf16 MFU | 125857 tok/s step 233/19560 | loss 6.280503 (-1.15z)| norm 0.8139 (-0.85z)| lr 1.86e-04 | 4163.22 ms | 32.4% bf16 MFU | 125861 tok/s step 234/19560 | loss 6.198047 (-1.48z)| norm 0.6888 (-1.44z)| lr 1.87e-04 | 4155.36 ms | 32.5% bf16 MFU | 125877 tok/s step 235/19560 | loss 6.202534 (-1.45z)| norm 0.6967 (-1.40z)| lr 1.88e-04 | 4147.89 ms | 32.6% bf16 MFU | 125903 tok/s step 236/19560 | loss 6.213030 (-1.40z)| norm 0.8090 (-0.84z)| lr 1.89e-04 | 4162.93 ms | 32.4% bf16 MFU | 125905 tok/s step 237/19560 | loss 6.214414 (-1.38z)| norm 0.7887 (-0.93z)| lr 1.90e-04 | 4155.37 ms | 32.5% bf16 MFU | 125918 tok/s step 238/19560 | loss 6.180139 (-1.52z)| norm 0.8938 (-0.42z)| lr 1.90e-04 | 4158.29 ms | 32.5% bf16 MFU | 125926 tok/s step 239/19560 | loss 6.169838 (-1.55z)| norm 0.9038 (-0.36z)| lr 1.91e-04 | 4150.51 ms | 32.5% bf16 MFU | 125946 tok/s step 240/19560 | loss 6.212814 (-1.35z)| norm 1.0581 (+0.39z)| lr 1.92e-04 | 4153.95 ms | 32.5% bf16 MFU | 125959 tok/s step 241/19560 | loss 6.179578 (-1.49z)| norm 0.8702 (-0.53z)| lr 1.93e-04 | 4159.22 ms | 32.5% bf16 MFU | 125964 tok/s step 242/19560 | loss 6.194181 (-1.41z)| norm 0.9585 (-0.08z)| lr 1.94e-04 | 4155.12 ms | 32.5% bf16 MFU | 125975 tok/s step 243/19560 | loss 6.173085 (-1.49z)| norm 0.8372 (-0.68z)| lr 1.94e-04 | 4159.04 ms | 32.5% bf16 MFU | 125979 tok/s step 244/19560 | loss 6.237942 (-1.17z)| norm 0.9921 (+0.08z)| lr 1.95e-04 | 4154.03 ms | 32.5% bf16 MFU | 125991 tok/s step 245/19560 | loss 6.183290 (-1.42z)| norm 1.5319 (+2.65z)| lr 1.96e-04 | 4154.22 ms | 32.5% bf16 MFU | 126001 tok/s step 246/19560 | loss 6.267660 (-1.00z)| norm 0.9156 (-0.31z)| lr 1.97e-04 | 4154.08 ms | 32.5% bf16 MFU | 126012 tok/s step 247/19560 | loss 6.225778 (-1.19z)| norm 0.9294 (-0.25z)| lr 1.98e-04 | 4157.63 ms | 32.5% bf16 MFU | 126016 tok/s step 248/19560 | loss 6.242151 (-1.10z)| norm 0.9625 (-0.07z)| lr 1.98e-04 | 4150.60 ms | 32.5% bf16 MFU | 126031 tok/s step 249/19560 | loss 6.134775 (-1.62z)| norm 0.8988 (-0.39z)| lr 1.99e-04 | 4167.18 ms | 32.4% bf16 MFU | 126021 tok/s step 250/19560 | loss 6.177650 (-1.39z)| norm 0.8421 (-0.67z)| lr 2.00e-04 | 4158.24 ms | 32.5% bf16 MFU | 126024 tok/s val loss 6.221948 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2477/10042 = 0.246664 step 251/19560 | loss 6.236647 (-1.08z)| norm 0.7494 (-1.12z)| lr 2.01e-04 | 4192.38 ms | 32.2% bf16 MFU | 125975 tok/s step 252/19560 | loss 6.156516 (-1.48z)| norm 0.8589 (-0.59z)| lr 2.02e-04 | 4160.98 ms | 32.4% bf16 MFU | 125977 tok/s step 253/19560 | loss 6.170504 (-1.40z)| norm 0.8686 (-0.54z)| lr 2.02e-04 | 4153.97 ms | 32.5% bf16 MFU | 125989 tok/s step 254/19560 | loss 6.137454 (-1.56z)| norm 0.7780 (-0.98z)| lr 2.03e-04 | 4170.37 ms | 32.4% bf16 MFU | 125975 tok/s step 255/19560 | loss 6.140139 (-1.54z)| norm 0.9482 (-0.13z)| lr 2.04e-04 | 4161.77 ms | 32.4% bf16 MFU | 125975 tok/s step 256/19560 | loss 6.181392 (-1.31z)| norm 1.1325 (+0.79z)| lr 2.05e-04 | 4160.78 ms | 32.5% bf16 MFU | 125977 tok/s step 257/19560 | loss 6.222062 (-1.08z)| norm 0.9302 (-0.22z)| lr 2.06e-04 | 4154.85 ms | 32.5% bf16 MFU | 125987 tok/s step 258/19560 | loss 6.146850 (-1.48z)| norm 0.8752 (-0.51z)| lr 2.06e-04 | 4155.39 ms | 32.5% bf16 MFU | 125996 tok/s step 259/19560 | loss 6.138995 (-1.51z)| norm 1.1021 (+0.62z)| lr 2.07e-04 | 4264.40 ms | 31.7% bf16 MFU | 125844 tok/s step 260/19560 | loss 6.105643 (-1.69z)| norm 1.1143 (+0.67z)| lr 2.08e-04 | 4151.45 ms | 32.5% bf16 MFU | 125866 tok/s step 261/19560 | loss 6.110498 (-1.65z)| norm 0.6688 (-1.60z)| lr 2.09e-04 | 4160.22 ms | 32.5% bf16 MFU | 125874 tok/s step 262/19560 | loss 6.155584 (-1.38z)| norm 0.6538 (-1.66z)| lr 2.10e-04 | 4151.69 ms | 32.5% bf16 MFU | 125895 tok/s step 263/19560 | loss 6.110058 (-1.63z)| norm 0.9879 (+0.03z)| lr 2.10e-04 | 4155.79 ms | 32.5% bf16 MFU | 125908 tok/s step 264/19560 | loss 6.077377 (-1.80z)| norm 0.9652 (-0.09z)| lr 2.11e-04 | 4152.25 ms | 32.5% bf16 MFU | 125926 tok/s step 265/19560 | loss 6.143983 (-1.39z)| norm 0.7356 (-1.24z)| lr 2.12e-04 | 4158.43 ms | 32.5% bf16 MFU | 125933 tok/s step 266/19560 | loss 6.149437 (-1.34z)| norm 0.9360 (-0.21z)| lr 2.13e-04 | 4149.59 ms | 32.5% bf16 MFU | 125954 tok/s step 267/19560 | loss 6.136714 (-1.40z)| norm 1.1282 (+0.79z)| lr 2.14e-04 | 4154.80 ms | 32.5% bf16 MFU | 125966 tok/s step 268/19560 | loss 6.122228 (-1.47z)| norm 0.9490 (-0.13z)| lr 2.14e-04 | 4157.51 ms | 32.5% bf16 MFU | 125973 tok/s step 269/19560 | loss 6.110961 (-1.52z)| norm 0.9763 (+0.00z)| lr 2.15e-04 | 4222.04 ms | 32.0% bf16 MFU | 125883 tok/s step 270/19560 | loss 6.217076 (-0.85z)| norm 0.8974 (-0.40z)| lr 2.16e-04 | 4147.71 ms | 32.6% bf16 MFU | 125909 tok/s step 271/19560 | loss 6.145933 (-1.28z)| norm 0.9909 (+0.08z)| lr 2.17e-04 | 4165.78 ms | 32.4% bf16 MFU | 125906 tok/s step 272/19560 | loss 6.076261 (-1.69z)| norm 1.1063 (+0.69z)| lr 2.18e-04 | 4161.41 ms | 32.4% bf16 MFU | 125910 tok/s step 273/19560 | loss 6.101329 (-1.52z)| norm 0.6258 (-1.77z)| lr 2.18e-04 | 4171.75 ms | 32.4% bf16 MFU | 125899 tok/s step 274/19560 | loss 6.060071 (-1.75z)| norm 0.7210 (-1.27z)| lr 2.19e-04 | 4150.91 ms | 32.5% bf16 MFU | 125919 tok/s step 275/19560 | loss 6.124584 (-1.32z)| norm 0.7414 (-1.15z)| lr 2.20e-04 | 4158.69 ms | 32.5% bf16 MFU | 125927 tok/s step 276/19560 | loss 6.038493 (-1.85z)| norm 0.6435 (-1.62z)| lr 2.21e-04 | 4150.39 ms | 32.5% bf16 MFU | 125946 tok/s step 277/19560 | loss 6.030230 (-1.88z)| norm 0.5887 (-1.86z)| lr 2.22e-04 | 4148.94 ms | 32.5% bf16 MFU | 125968 tok/s step 278/19560 | loss 6.047566 (-1.74z)| norm 0.7485 (-1.06z)| lr 2.22e-04 | 4221.54 ms | 32.0% bf16 MFU | 125879 tok/s step 279/19560 | loss 6.065534 (-1.61z)| norm 0.7356 (-1.12z)| lr 2.23e-04 | 4157.07 ms | 32.5% bf16 MFU | 125891 tok/s step 280/19560 | loss 6.069631 (-1.56z)| norm 0.8055 (-0.75z)| lr 2.24e-04 | 4149.71 ms | 32.5% bf16 MFU | 125913 tok/s step 281/19560 | loss 6.084097 (-1.44z)| norm 1.0804 (+0.67z)| lr 2.25e-04 | 4154.72 ms | 32.5% bf16 MFU | 125927 tok/s step 282/19560 | loss 6.074366 (-1.49z)| norm 1.2657 (+1.60z)| lr 2.26e-04 | 4164.16 ms | 32.4% bf16 MFU | 125926 tok/s step 283/19560 | loss 6.024372 (-1.79z)| norm 0.9493 (-0.03z)| lr 2.26e-04 | 4155.21 ms | 32.5% bf16 MFU | 125939 tok/s step 284/19560 | loss 6.082330 (-1.39z)| norm 0.9970 (+0.21z)| lr 2.27e-04 | 4146.00 ms | 32.6% bf16 MFU | 125965 tok/s step 285/19560 | loss 6.075396 (-1.42z)| norm 1.0994 (+0.73z)| lr 2.28e-04 | 4161.87 ms | 32.4% bf16 MFU | 125965 tok/s step 286/19560 | loss 5.984171 (-1.99z)| norm 0.9488 (-0.04z)| lr 2.29e-04 | 4160.77 ms | 32.5% bf16 MFU | 125967 tok/s step 287/19560 | loss 6.018698 (-1.74z)| norm 1.2429 (+1.46z)| lr 2.30e-04 | 4151.29 ms | 32.5% bf16 MFU | 125984 tok/s step 288/19560 | loss 6.056719 (-1.46z)| norm 1.0477 (+0.46z)| lr 2.30e-04 | 4162.83 ms | 32.4% bf16 MFU | 125982 tok/s step 289/19560 | loss 6.072867 (-1.34z)| norm 1.2420 (+1.45z)| lr 2.31e-04 | 4148.16 ms | 32.5% bf16 MFU | 126002 tok/s step 290/19560 | loss 6.038412 (-1.55z)| norm 0.9448 (-0.06z)| lr 2.32e-04 | 4161.79 ms | 32.4% bf16 MFU | 126001 tok/s step 291/19560 | loss 6.037633 (-1.53z)| norm 1.0092 (+0.27z)| lr 2.33e-04 | 4160.58 ms | 32.5% bf16 MFU | 126001 tok/s step 292/19560 | loss 6.031017 (-1.56z)| norm 0.8940 (-0.32z)| lr 2.34e-04 | 4158.45 ms | 32.5% bf16 MFU | 126005 tok/s step 293/19560 | loss 6.048209 (-1.43z)| norm 0.7778 (-0.91z)| lr 2.34e-04 | 4161.06 ms | 32.4% bf16 MFU | 126005 tok/s step 294/19560 | loss 6.016115 (-1.62z)| norm 0.9253 (-0.15z)| lr 2.35e-04 | 4147.79 ms | 32.6% bf16 MFU | 126025 tok/s step 295/19560 | loss 6.048116 (-1.40z)| norm 0.9418 (-0.06z)| lr 2.36e-04 | 4157.60 ms | 32.5% bf16 MFU | 126029 tok/s step 296/19560 | loss 6.024153 (-1.54z)| norm 0.8403 (-0.58z)| lr 2.37e-04 | 4153.64 ms | 32.5% bf16 MFU | 126038 tok/s step 297/19560 | loss 5.961102 (-1.94z)| norm 0.6655 (-1.46z)| lr 2.38e-04 | 4673.02 ms | 28.9% bf16 MFU | 125346 tok/s step 298/19560 | loss 6.019097 (-1.52z)| norm 0.6860 (-1.33z)| lr 2.38e-04 | 4709.81 ms | 28.7% bf16 MFU | 124645 tok/s step 299/19560 | loss 6.019897 (-1.49z)| norm 0.5758 (-1.85z)| lr 2.39e-04 | 4420.81 ms | 30.5% bf16 MFU | 124342 tok/s step 300/19560 | loss 5.948205 (-1.96z)| norm 0.8754 (-0.33z)| lr 2.40e-04 | 4330.81 ms | 31.2% bf16 MFU | 124178 tok/s step 301/19560 | loss 6.052737 (-1.22z)| norm 1.8489 (+4.27z)| lr 2.41e-04 | 4232.27 ms | 31.9% bf16 MFU | 124163 tok/s step 302/19560 | loss 6.047921 (-1.23z)| norm 0.7759 (-0.80z)| lr 2.42e-04 | 4224.28 ms | 32.0% bf16 MFU | 124161 tok/s step 303/19560 | loss 6.045883 (-1.23z)| norm 0.9029 (-0.19z)| lr 2.42e-04 | 4202.72 ms | 32.1% bf16 MFU | 124190 tok/s step 304/19560 | loss 6.011118 (-1.46z)| norm 0.9368 (-0.02z)| lr 2.43e-04 | 4186.60 ms | 32.2% bf16 MFU | 124242 tok/s step 305/19560 | loss 5.974907 (-1.68z)| norm 0.9516 (+0.04z)| lr 2.44e-04 | 4305.08 ms | 31.4% bf16 MFU | 124119 tok/s step 306/19560 | loss 5.980365 (-1.62z)| norm 1.1802 (+1.12z)| lr 2.45e-04 | 4220.94 ms | 32.0% bf16 MFU | 124124 tok/s step 307/19560 | loss 5.960108 (-1.74z)| norm 1.0193 (+0.34z)| lr 2.46e-04 | 4151.10 ms | 32.5% bf16 MFU | 124233 tok/s step 308/19560 | loss 5.935138 (-1.89z)| norm 1.4922 (+2.56z)| lr 2.46e-04 | 4212.93 ms | 32.0% bf16 MFU | 124243 tok/s step 309/19560 | loss 6.022835 (-1.25z)| norm 1.1306 (+0.84z)| lr 2.47e-04 | 4191.67 ms | 32.2% bf16 MFU | 124285 tok/s step 310/19560 | loss 5.973405 (-1.58z)| norm 1.2514 (+1.39z)| lr 2.48e-04 | 4150.98 ms | 32.5% bf16 MFU | 124386 tok/s step 311/19560 | loss 5.982403 (-1.50z)| norm 0.9082 (-0.22z)| lr 2.49e-04 | 4201.53 ms | 32.1% bf16 MFU | 124406 tok/s step 312/19560 | loss 5.950314 (-1.70z)| norm 0.8667 (-0.40z)| lr 2.50e-04 | 4162.84 ms | 32.4% bf16 MFU | 124483 tok/s step 313/19560 | loss 5.887350 (-2.12z)| norm 0.8895 (-0.28z)| lr 2.50e-04 | 4152.39 ms | 32.5% bf16 MFU | 124572 tok/s step 314/19560 | loss 5.931644 (-1.78z)| norm 1.1348 (+0.90z)| lr 2.51e-04 | 4208.10 ms | 32.1% bf16 MFU | 124573 tok/s step 315/19560 | loss 6.027246 (-1.08z)| norm 1.3651 (+1.96z)| lr 2.52e-04 | 4250.65 ms | 31.8% bf16 MFU | 124511 tok/s step 316/19560 | loss 5.872153 (-2.17z)| norm 0.6631 (-1.34z)| lr 2.53e-04 | 4154.99 ms | 32.5% bf16 MFU | 124595 tok/s step 317/19560 | loss 5.963795 (-1.48z)| norm 0.7532 (-0.90z)| lr 2.54e-04 | 4214.07 ms | 32.0% bf16 MFU | 124586 tok/s step 318/19560 | loss 5.986773 (-1.30z)| norm 0.9417 (-0.01z)| lr 2.54e-04 | 4152.73 ms | 32.5% bf16 MFU | 124669 tok/s step 319/19560 | loss 5.948637 (-1.56z)| norm 1.0214 (+0.36z)| lr 2.55e-04 | 4160.02 ms | 32.5% bf16 MFU | 124737 tok/s step 320/19560 | loss 5.932926 (-1.65z)| norm 1.4418 (+2.28z)| lr 2.56e-04 | 4146.06 ms | 32.6% bf16 MFU | 124823 tok/s step 321/19560 | loss 5.971394 (-1.35z)| norm 0.6459 (-1.40z)| lr 2.57e-04 | 4150.89 ms | 32.5% bf16 MFU | 124897 tok/s step 322/19560 | loss 5.985882 (-1.23z)| norm 0.8844 (-0.30z)| lr 2.58e-04 | 4228.86 ms | 31.9% bf16 MFU | 124851 tok/s step 323/19560 | loss 5.955591 (-1.44z)| norm 0.8070 (-0.66z)| lr 2.58e-04 | 4326.82 ms | 31.2% bf16 MFU | 124667 tok/s step 324/19560 | loss 5.948941 (-1.47z)| norm 1.0503 (+0.45z)| lr 2.59e-04 | 4155.75 ms | 32.5% bf16 MFU | 124742 tok/s step 325/19560 | loss 5.984352 (-1.19z)| norm 1.3143 (+1.66z)| lr 2.60e-04 | 4152.38 ms | 32.5% bf16 MFU | 124818 tok/s step 326/19560 | loss 5.890656 (-1.89z)| norm 0.9932 (+0.19z)| lr 2.61e-04 | 4233.70 ms | 31.9% bf16 MFU | 124769 tok/s step 327/19560 | loss 5.933491 (-1.53z)| norm 0.7520 (-0.91z)| lr 2.62e-04 | 4151.29 ms | 32.5% bf16 MFU | 124845 tok/s step 328/19560 | loss 5.895838 (-1.79z)| norm 0.7105 (-1.09z)| lr 2.62e-04 | 4347.81 ms | 31.1% bf16 MFU | 124632 tok/s step 329/19560 | loss 5.883585 (-1.85z)| norm 0.8851 (-0.28z)| lr 2.63e-04 | 4210.50 ms | 32.1% bf16 MFU | 124627 tok/s step 330/19560 | loss 5.858470 (-2.01z)| norm 1.0029 (+0.28z)| lr 2.64e-04 | 4188.98 ms | 32.2% bf16 MFU | 124653 tok/s step 331/19560 | loss 5.907010 (-1.61z)| norm 1.0706 (+0.59z)| lr 2.65e-04 | 4153.17 ms | 32.5% bf16 MFU | 124732 tok/s step 332/19560 | loss 5.833210 (-2.13z)| norm 1.0682 (+0.57z)| lr 2.66e-04 | 4149.83 ms | 32.5% bf16 MFU | 124813 tok/s step 333/19560 | loss 5.903569 (-1.57z)| norm 1.0669 (+0.55z)| lr 2.66e-04 | 4154.20 ms | 32.5% bf16 MFU | 124883 tok/s step 334/19560 | loss 5.932883 (-1.33z)| norm 0.9608 (+0.04z)| lr 2.67e-04 | 4270.27 ms | 31.6% bf16 MFU | 124777 tok/s step 335/19560 | loss 5.954212 (-1.15z)| norm 1.0124 (+0.29z)| lr 2.68e-04 | 4258.07 ms | 31.7% bf16 MFU | 124695 tok/s step 336/19560 | loss 6.005848 (-0.74z)| norm 1.0502 (+0.47z)| lr 2.69e-04 | 4154.23 ms | 32.5% bf16 MFU | 124770 tok/s step 337/19560 | loss 5.850863 (-1.92z)| norm 1.0679 (+0.60z)| lr 2.70e-04 | 4164.64 ms | 32.4% bf16 MFU | 124826 tok/s step 338/19560 | loss 5.904289 (-1.48z)| norm 1.1995 (+1.23z)| lr 2.70e-04 | 4292.11 ms | 31.5% bf16 MFU | 124693 tok/s step 339/19560 | loss 5.851914 (-1.85z)| norm 0.8882 (-0.29z)| lr 2.71e-04 | 4158.22 ms | 32.5% bf16 MFU | 124762 tok/s step 340/19560 | loss 5.859500 (-1.78z)| norm 0.8630 (-0.41z)| lr 2.72e-04 | 4146.12 ms | 32.6% bf16 MFU | 124847 tok/s step 341/19560 | loss 5.891283 (-1.51z)| norm 0.9195 (-0.13z)| lr 2.73e-04 | 4261.39 ms | 31.7% bf16 MFU | 124756 tok/s step 342/19560 | loss 5.862164 (-1.71z)| norm 0.9537 (+0.05z)| lr 2.74e-04 | 4149.07 ms | 32.5% bf16 MFU | 124836 tok/s step 343/19560 | loss 5.875343 (-1.58z)| norm 0.9778 (+0.16z)| lr 2.74e-04 | 4164.57 ms | 32.4% bf16 MFU | 124889 tok/s step 344/19560 | loss 5.851410 (-1.74z)| norm 1.0083 (+0.30z)| lr 2.75e-04 | 4151.39 ms | 32.5% bf16 MFU | 124959 tok/s step 345/19560 | loss 5.865264 (-1.60z)| norm 0.9533 (+0.03z)| lr 2.76e-04 | 4182.10 ms | 32.3% bf16 MFU | 124980 tok/s step 346/19560 | loss 5.875500 (-1.50z)| norm 0.8789 (-0.35z)| lr 2.77e-04 | 4153.01 ms | 32.5% bf16 MFU | 125043 tok/s step 347/19560 | loss 5.835314 (-1.79z)| norm 0.9727 (+0.11z)| lr 2.78e-04 | 4257.25 ms | 31.7% bf16 MFU | 124948 tok/s step 348/19560 | loss 5.923630 (-1.09z)| norm 0.8666 (-0.43z)| lr 2.78e-04 | 4168.43 ms | 32.4% bf16 MFU | 124990 tok/s step 349/19560 | loss 5.800867 (-2.01z)| norm 1.0954 (+0.71z)| lr 2.79e-04 | 4185.49 ms | 32.3% bf16 MFU | 125003 tok/s step 350/19560 | loss 5.863042 (-1.50z)| norm 1.0058 (+0.27z)| lr 2.80e-04 | 4155.46 ms | 32.5% bf16 MFU | 125062 tok/s step 351/19560 | loss 5.866401 (-1.48z)| norm 0.9877 (+0.18z)| lr 2.81e-04 | 4152.00 ms | 32.5% bf16 MFU | 125122 tok/s step 352/19560 | loss 5.893405 (-1.24z)| norm 1.3022 (+1.73z)| lr 2.82e-04 | 4165.40 ms | 32.4% bf16 MFU | 125159 tok/s step 353/19560 | loss 5.812806 (-1.85z)| norm 1.1029 (+0.73z)| lr 2.82e-04 | 4168.54 ms | 32.4% bf16 MFU | 125190 tok/s step 354/19560 | loss 5.847738 (-1.55z)| norm 0.8404 (-0.57z)| lr 2.83e-04 | 4154.48 ms | 32.5% bf16 MFU | 125240 tok/s step 355/19560 | loss 5.839805 (-1.59z)| norm 1.0083 (+0.29z)| lr 2.84e-04 | 4156.67 ms | 32.5% bf16 MFU | 125285 tok/s step 356/19560 | loss 5.893526 (-1.14z)| norm 1.5014 (+2.69z)| lr 2.85e-04 | 4237.09 ms | 31.9% bf16 MFU | 125208 tok/s step 357/19560 | loss 5.842193 (-1.54z)| norm 1.1899 (+1.14z)| lr 2.86e-04 | 4205.33 ms | 32.1% bf16 MFU | 125181 tok/s step 358/19560 | loss 5.798263 (-1.86z)| norm 1.0210 (+0.31z)| lr 2.86e-04 | 4151.65 ms | 32.5% bf16 MFU | 125236 tok/s step 359/19560 | loss 5.861135 (-1.33z)| norm 0.9560 (-0.00z)| lr 2.87e-04 | 4154.91 ms | 32.5% bf16 MFU | 125283 tok/s step 360/19560 | loss 5.850984 (-1.39z)| norm 1.2326 (+1.34z)| lr 2.88e-04 | 4159.55 ms | 32.5% bf16 MFU | 125322 tok/s step 361/19560 | loss 5.810874 (-1.69z)| norm 0.9990 (+0.18z)| lr 2.89e-04 | 4164.13 ms | 32.4% bf16 MFU | 125351 tok/s step 362/19560 | loss 5.882360 (-1.10z)| norm 0.8926 (-0.35z)| lr 2.90e-04 | 4287.36 ms | 31.5% bf16 MFU | 125198 tok/s step 363/19560 | loss 5.759752 (-2.05z)| norm 0.8730 (-0.46z)| lr 2.90e-04 | 4153.08 ms | 32.5% bf16 MFU | 125250 tok/s step 364/19560 | loss 5.793723 (-1.75z)| norm 1.1321 (+0.82z)| lr 2.91e-04 | 4165.91 ms | 32.4% bf16 MFU | 125280 tok/s step 365/19560 | loss 5.837183 (-1.38z)| norm 0.9643 (-0.02z)| lr 2.92e-04 | 4177.29 ms | 32.3% bf16 MFU | 125291 tok/s step 366/19560 | loss 5.810830 (-1.56z)| norm 0.8660 (-0.51z)| lr 2.93e-04 | 4152.34 ms | 32.5% bf16 MFU | 125340 tok/s step 367/19560 | loss 5.841353 (-1.30z)| norm 1.0245 (+0.28z)| lr 2.94e-04 | 4272.54 ms | 31.6% bf16 MFU | 125208 tok/s step 368/19560 | loss 5.885303 (-0.93z)| norm 1.0111 (+0.21z)| lr 2.94e-04 | 4152.69 ms | 32.5% bf16 MFU | 125261 tok/s step 369/19560 | loss 5.739049 (-2.07z)| norm 1.0061 (+0.18z)| lr 2.95e-04 | 4335.60 ms | 31.1% bf16 MFU | 125044 tok/s step 370/19560 | loss 5.745680 (-1.98z)| norm 0.9652 (-0.02z)| lr 2.96e-04 | 4200.14 ms | 32.1% bf16 MFU | 125033 tok/s step 371/19560 | loss 5.733153 (-2.03z)| norm 0.7702 (-1.00z)| lr 2.97e-04 | 4210.47 ms | 32.1% bf16 MFU | 125007 tok/s step 372/19560 | loss 5.729033 (-2.03z)| norm 0.8405 (-0.64z)| lr 2.98e-04 | 4153.11 ms | 32.5% bf16 MFU | 125069 tok/s step 373/19560 | loss 5.757112 (-1.78z)| norm 0.9307 (-0.17z)| lr 2.98e-04 | 4199.45 ms | 32.2% bf16 MFU | 125058 tok/s step 374/19560 | loss 5.829852 (-1.19z)| norm 1.2167 (+1.28z)| lr 2.99e-04 | 4221.79 ms | 32.0% bf16 MFU | 125014 tok/s step 375/19560 | loss 5.801025 (-1.41z)| norm 0.8141 (-0.77z)| lr 3.00e-04 | 4158.81 ms | 32.5% bf16 MFU | 125067 tok/s step 376/19560 | loss 5.716305 (-2.07z)| norm 0.9776 (+0.06z)| lr 3.01e-04 | 4264.92 ms | 31.7% bf16 MFU | 124960 tok/s step 377/19560 | loss 5.757268 (-1.70z)| norm 1.0424 (+0.39z)| lr 3.02e-04 | 4171.50 ms | 32.4% bf16 MFU | 124996 tok/s step 378/19560 | loss 5.855805 (-0.89z)| norm 1.1116 (+0.73z)| lr 3.02e-04 | 4157.30 ms | 32.5% bf16 MFU | 125052 tok/s step 379/19560 | loss 5.903653 (-0.49z)| norm 1.6031 (+3.10z)| lr 3.03e-04 | 4183.30 ms | 32.3% bf16 MFU | 125066 tok/s step 380/19560 | loss 5.796880 (-1.36z)| norm 0.8096 (-0.81z)| lr 3.04e-04 | 4150.56 ms | 32.5% bf16 MFU | 125128 tok/s step 381/19560 | loss 5.738654 (-1.81z)| norm 0.8856 (-0.44z)| lr 3.05e-04 | 4155.91 ms | 32.5% bf16 MFU | 125180 tok/s step 382/19560 | loss 5.777521 (-1.47z)| norm 1.1443 (+0.82z)| lr 3.06e-04 | 4173.06 ms | 32.4% bf16 MFU | 125203 tok/s step 383/19560 | loss 5.807165 (-1.20z)| norm 1.2783 (+1.46z)| lr 3.06e-04 | 4146.52 ms | 32.6% bf16 MFU | 125265 tok/s step 384/19560 | loss 5.726371 (-1.85z)| norm 0.9270 (-0.25z)| lr 3.07e-04 | 4150.48 ms | 32.5% bf16 MFU | 125317 tok/s step 385/19560 | loss 5.838579 (-0.90z)| norm 1.1557 (+0.86z)| lr 3.08e-04 | 4168.63 ms | 32.4% bf16 MFU | 125340 tok/s step 386/19560 | loss 5.776128 (-1.41z)| norm 0.8629 (-0.57z)| lr 3.09e-04 | 4151.81 ms | 32.5% bf16 MFU | 125387 tok/s step 387/19560 | loss 5.696269 (-2.05z)| norm 0.6649 (-1.51z)| lr 3.10e-04 | 4153.01 ms | 32.5% bf16 MFU | 125430 tok/s step 388/19560 | loss 5.723637 (-1.79z)| norm 0.7931 (-0.88z)| lr 3.10e-04 | 4173.86 ms | 32.3% bf16 MFU | 125439 tok/s step 389/19560 | loss 5.695736 (-1.98z)| norm 1.1806 (+0.98z)| lr 3.11e-04 | 4159.91 ms | 32.5% bf16 MFU | 125469 tok/s step 390/19560 | loss 5.781735 (-1.25z)| norm 1.3579 (+1.82z)| lr 3.12e-04 | 4179.58 ms | 32.3% bf16 MFU | 125467 tok/s step 391/19560 | loss 5.692769 (-1.96z)| norm 1.2215 (+1.14z)| lr 3.13e-04 | 4298.39 ms | 31.4% bf16 MFU | 125292 tok/s step 392/19560 | loss 5.767385 (-1.31z)| norm 0.9559 (-0.14z)| lr 3.14e-04 | 4152.96 ms | 32.5% bf16 MFU | 125340 tok/s step 393/19560 | loss 5.714556 (-1.73z)| norm 0.9161 (-0.34z)| lr 3.14e-04 | 4154.09 ms | 32.5% bf16 MFU | 125384 tok/s step 394/19560 | loss 5.739913 (-1.50z)| norm 1.0952 (+0.52z)| lr 3.15e-04 | 4164.64 ms | 32.4% bf16 MFU | 125409 tok/s step 395/19560 | loss 5.732868 (-1.54z)| norm 1.2935 (+1.46z)| lr 3.16e-04 | 4168.54 ms | 32.4% bf16 MFU | 125427 tok/s step 396/19560 | loss 5.729807 (-1.54z)| norm 0.9829 (-0.03z)| lr 3.17e-04 | 4154.56 ms | 32.5% bf16 MFU | 125465 tok/s step 397/19560 | loss 5.765484 (-1.22z)| norm 0.9871 (-0.01z)| lr 3.18e-04 | 4180.15 ms | 32.3% bf16 MFU | 125463 tok/s step 398/19560 | loss 5.790380 (-1.01z)| norm 1.1744 (+0.88z)| lr 3.18e-04 | 4185.51 ms | 32.3% bf16 MFU | 125453 tok/s step 399/19560 | loss 5.804310 (-0.87z)| norm 1.4872 (+2.31z)| lr 3.19e-04 | 4162.20 ms | 32.4% bf16 MFU | 125479 tok/s step 400/19560 | loss 5.681216 (-1.93z)| norm 0.8174 (-0.83z)| lr 3.20e-04 | 4152.63 ms | 32.5% bf16 MFU | 125518 tok/s step 401/19560 | loss 5.723643 (-1.53z)| norm 0.7632 (-1.10z)| lr 3.21e-04 | 4166.47 ms | 32.4% bf16 MFU | 125534 tok/s step 402/19560 | loss 5.648262 (-2.15z)| norm 0.7738 (-1.05z)| lr 3.22e-04 | 4165.95 ms | 32.4% bf16 MFU | 125549 tok/s step 403/19560 | loss 5.673753 (-1.90z)| norm 0.9322 (-0.31z)| lr 3.22e-04 | 4165.74 ms | 32.4% bf16 MFU | 125565 tok/s step 404/19560 | loss 5.718718 (-1.48z)| norm 1.2051 (+0.98z)| lr 3.23e-04 | 4157.10 ms | 32.5% bf16 MFU | 125592 tok/s step 405/19560 | loss 5.778452 (-0.94z)| norm 1.1867 (+0.88z)| lr 3.24e-04 | 4167.03 ms | 32.4% bf16 MFU | 125604 tok/s step 406/19560 | loss 5.709277 (-1.53z)| norm 1.1049 (+0.47z)| lr 3.25e-04 | 4167.33 ms | 32.4% bf16 MFU | 125614 tok/s step 407/19560 | loss 5.723170 (-1.38z)| norm 1.1240 (+0.55z)| lr 3.26e-04 | 4154.86 ms | 32.5% bf16 MFU | 125643 tok/s step 408/19560 | loss 5.695767 (-1.60z)| norm 0.8636 (-0.73z)| lr 3.26e-04 | 4168.67 ms | 32.4% bf16 MFU | 125649 tok/s step 409/19560 | loss 5.710608 (-1.45z)| norm 0.7686 (-1.18z)| lr 3.27e-04 | 4169.54 ms | 32.4% bf16 MFU | 125654 tok/s step 410/19560 | loss 5.675664 (-1.73z)| norm 0.7195 (-1.40z)| lr 3.28e-04 | 4150.25 ms | 32.5% bf16 MFU | 125687 tok/s step 411/19560 | loss 5.705617 (-1.45z)| norm 0.8248 (-0.88z)| lr 3.29e-04 | 4156.01 ms | 32.5% bf16 MFU | 125710 tok/s step 412/19560 | loss 5.726638 (-1.24z)| norm 1.0425 (+0.18z)| lr 3.30e-04 | 4170.76 ms | 32.4% bf16 MFU | 125710 tok/s step 413/19560 | loss 5.654635 (-1.86z)| norm 1.1014 (+0.47z)| lr 3.30e-04 | 4154.91 ms | 32.5% bf16 MFU | 125734 tok/s step 414/19560 | loss 5.702621 (-1.41z)| norm 0.9854 (-0.10z)| lr 3.31e-04 | 4166.25 ms | 32.4% bf16 MFU | 125739 tok/s step 415/19560 | loss 5.721054 (-1.22z)| norm 1.1289 (+0.61z)| lr 3.32e-04 | 4171.09 ms | 32.4% bf16 MFU | 125737 tok/s step 416/19560 | loss 5.640738 (-1.91z)| norm 0.8461 (-0.77z)| lr 3.33e-04 | 4153.11 ms | 32.5% bf16 MFU | 125762 tok/s step 417/19560 | loss 5.641305 (-1.88z)| norm 0.9293 (-0.35z)| lr 3.34e-04 | 4152.12 ms | 32.5% bf16 MFU | 125788 tok/s step 418/19560 | loss 5.632195 (-1.92z)| norm 0.8481 (-0.74z)| lr 3.34e-04 | 4154.07 ms | 32.5% bf16 MFU | 125809 tok/s step 419/19560 | loss 5.565566 (-2.45z)| norm 0.7941 (-0.99z)| lr 3.35e-04 | 4152.29 ms | 32.5% bf16 MFU | 125832 tok/s step 420/19560 | loss 5.663323 (-1.57z)| norm 0.9061 (-0.45z)| lr 3.36e-04 | 4154.96 ms | 32.5% bf16 MFU | 125849 tok/s step 421/19560 | loss 5.691756 (-1.30z)| norm 1.3216 (+1.55z)| lr 3.37e-04 | 4278.94 ms | 31.6% bf16 MFU | 125683 tok/s step 422/19560 | loss 5.598409 (-2.08z)| norm 1.1702 (+0.81z)| lr 3.38e-04 | 4160.64 ms | 32.5% bf16 MFU | 125700 tok/s step 423/19560 | loss 5.546146 (-2.48z)| norm 0.9855 (-0.09z)| lr 3.38e-04 | 4153.29 ms | 32.5% bf16 MFU | 125726 tok/s step 424/19560 | loss 5.662140 (-1.45z)| norm 0.9454 (-0.29z)| lr 3.39e-04 | 4169.17 ms | 32.4% bf16 MFU | 125728 tok/s step 425/19560 | loss 5.658128 (-1.46z)| norm 1.0338 (+0.13z)| lr 3.40e-04 | 4167.67 ms | 32.4% bf16 MFU | 125731 tok/s step 426/19560 | loss 5.615448 (-1.80z)| norm 1.1230 (+0.55z)| lr 3.41e-04 | 4159.89 ms | 32.5% bf16 MFU | 125746 tok/s step 427/19560 | loss 5.646542 (-1.51z)| norm 1.0879 (+0.37z)| lr 3.42e-04 | 4158.92 ms | 32.5% bf16 MFU | 125762 tok/s step 428/19560 | loss 5.637419 (-1.56z)| norm 0.9192 (-0.49z)| lr 3.42e-04 | 4167.76 ms | 32.4% bf16 MFU | 125764 tok/s step 429/19560 | loss 5.620566 (-1.69z)| norm 1.0762 (+0.36z)| lr 3.43e-04 | 4160.77 ms | 32.5% bf16 MFU | 125776 tok/s step 430/19560 | loss 5.615842 (-1.71z)| norm 1.0863 (+0.41z)| lr 3.44e-04 | 4160.48 ms | 32.5% bf16 MFU | 125788 tok/s step 431/19560 | loss 5.647958 (-1.41z)| norm 0.9957 (-0.09z)| lr 3.45e-04 | 4220.37 ms | 32.0% bf16 MFU | 125710 tok/s step 432/19560 | loss 5.611071 (-1.71z)| norm 1.1269 (+0.62z)| lr 3.46e-04 | 4150.91 ms | 32.5% bf16 MFU | 125740 tok/s step 433/19560 | loss 5.577942 (-1.96z)| norm 1.0052 (-0.05z)| lr 3.46e-04 | 4155.15 ms | 32.5% bf16 MFU | 125762 tok/s step 434/19560 | loss 5.600244 (-1.74z)| norm 1.1396 (+0.69z)| lr 3.47e-04 | 4166.72 ms | 32.4% bf16 MFU | 125765 tok/s step 435/19560 | loss 5.653098 (-1.26z)| norm 1.1628 (+0.81z)| lr 3.48e-04 | 4159.13 ms | 32.5% bf16 MFU | 125780 tok/s step 436/19560 | loss 5.634659 (-1.39z)| norm 0.8734 (-0.77z)| lr 3.49e-04 | 4151.73 ms | 32.5% bf16 MFU | 125805 tok/s step 437/19560 | loss 5.628964 (-1.43z)| norm 0.8221 (-1.04z)| lr 3.50e-04 | 4166.92 ms | 32.4% bf16 MFU | 125806 tok/s step 438/19560 | loss 5.637209 (-1.34z)| norm 0.7929 (-1.19z)| lr 3.50e-04 | 4152.35 ms | 32.5% bf16 MFU | 125828 tok/s step 439/19560 | loss 5.517962 (-2.34z)| norm 0.8733 (-0.74z)| lr 3.51e-04 | 4155.29 ms | 32.5% bf16 MFU | 125846 tok/s step 440/19560 | loss 5.576565 (-1.79z)| norm 1.1068 (+0.57z)| lr 3.52e-04 | 4158.50 ms | 32.5% bf16 MFU | 125857 tok/s step 441/19560 | loss 5.580826 (-1.72z)| norm 0.9288 (-0.44z)| lr 3.53e-04 | 4161.60 ms | 32.4% bf16 MFU | 125864 tok/s step 442/19560 | loss 5.576472 (-1.72z)| norm 0.9421 (-0.36z)| lr 3.54e-04 | 4165.52 ms | 32.4% bf16 MFU | 125864 tok/s step 443/19560 | loss 5.619856 (-1.34z)| norm 0.9441 (-0.33z)| lr 3.54e-04 | 4154.36 ms | 32.5% bf16 MFU | 125880 tok/s step 444/19560 | loss 5.551109 (-1.89z)| norm 1.1404 (+0.78z)| lr 3.55e-04 | 4163.35 ms | 32.4% bf16 MFU | 125883 tok/s step 445/19560 | loss 5.563831 (-1.75z)| norm 0.8237 (-1.06z)| lr 3.56e-04 | 4158.72 ms | 32.5% bf16 MFU | 125892 tok/s step 446/19560 | loss 5.554285 (-1.81z)| norm 0.8007 (-1.18z)| lr 3.57e-04 | 4160.69 ms | 32.5% bf16 MFU | 125898 tok/s step 447/19560 | loss 5.528479 (-1.99z)| norm 0.7743 (-1.32z)| lr 3.58e-04 | 4159.41 ms | 32.5% bf16 MFU | 125906 tok/s step 448/19560 | loss 5.518400 (-2.03z)| norm 0.6459 (-2.04z)| lr 3.58e-04 | 4160.50 ms | 32.5% bf16 MFU | 125911 tok/s step 449/19560 | loss 5.515499 (-2.02z)| norm 0.6197 (-2.18z)| lr 3.59e-04 | 4159.19 ms | 32.5% bf16 MFU | 125918 tok/s step 450/19560 | loss 5.501762 (-2.10z)| norm 0.7564 (-1.38z)| lr 3.60e-04 | 4208.98 ms | 32.1% bf16 MFU | 125851 tok/s step 451/19560 | loss 5.552153 (-1.65z)| norm 1.0099 (+0.07z)| lr 3.61e-04 | 4149.73 ms | 32.5% bf16 MFU | 125875 tok/s step 452/19560 | loss 5.508256 (-1.98z)| norm 1.1118 (+0.66z)| lr 3.62e-04 | 4156.28 ms | 32.5% bf16 MFU | 125889 tok/s step 453/19560 | loss 5.520172 (-1.86z)| norm 1.0290 (+0.20z)| lr 3.62e-04 | 4154.01 ms | 32.5% bf16 MFU | 125905 tok/s step 454/19560 | loss 5.563824 (-1.46z)| norm 1.0085 (+0.08z)| lr 3.63e-04 | 4155.58 ms | 32.5% bf16 MFU | 125918 tok/s step 455/19560 | loss 5.568202 (-1.41z)| norm 1.0467 (+0.29z)| lr 3.64e-04 | 4220.96 ms | 32.0% bf16 MFU | 125832 tok/s step 456/19560 | loss 5.565301 (-1.41z)| norm 0.9767 (-0.14z)| lr 3.65e-04 | 4159.36 ms | 32.5% bf16 MFU | 125843 tok/s step 457/19560 | loss 5.534144 (-1.64z)| norm 1.1669 (+0.98z)| lr 3.66e-04 | 4158.00 ms | 32.5% bf16 MFU | 125856 tok/s step 458/19560 | loss 5.498360 (-1.90z)| norm 0.9120 (-0.53z)| lr 3.66e-04 | 4160.02 ms | 32.5% bf16 MFU | 125864 tok/s step 459/19560 | loss 5.518845 (-1.70z)| norm 1.0870 (+0.51z)| lr 3.67e-04 | 4151.03 ms | 32.5% bf16 MFU | 125886 tok/s step 460/19560 | loss 5.481588 (-1.96z)| norm 1.0224 (+0.13z)| lr 3.68e-04 | 4149.77 ms | 32.5% bf16 MFU | 125909 tok/s step 461/19560 | loss 5.535184 (-1.50z)| norm 1.0368 (+0.21z)| lr 3.69e-04 | 4163.36 ms | 32.4% bf16 MFU | 125910 tok/s step 462/19560 | loss 5.561986 (-1.26z)| norm 1.0858 (+0.50z)| lr 3.70e-04 | 4152.10 ms | 32.5% bf16 MFU | 125928 tok/s step 463/19560 | loss 5.490435 (-1.83z)| norm 1.0208 (+0.11z)| lr 3.70e-04 | 4167.38 ms | 32.4% bf16 MFU | 125922 tok/s step 464/19560 | loss 5.502698 (-1.72z)| norm 1.1513 (+0.88z)| lr 3.71e-04 | 4184.80 ms | 32.3% bf16 MFU | 125890 tok/s step 465/19560 | loss 5.588634 (-0.98z)| norm 0.7104 (-1.70z)| lr 3.72e-04 | 4164.01 ms | 32.4% bf16 MFU | 125891 tok/s step 466/19560 | loss 5.514344 (-1.58z)| norm 0.7186 (-1.62z)| lr 3.73e-04 | 4152.67 ms | 32.5% bf16 MFU | 125909 tok/s step 467/19560 | loss 5.566411 (-1.13z)| norm 1.0148 (+0.10z)| lr 3.74e-04 | 4168.18 ms | 32.4% bf16 MFU | 125903 tok/s step 468/19560 | loss 5.547062 (-1.27z)| norm 1.0772 (+0.46z)| lr 3.74e-04 | 4151.84 ms | 32.5% bf16 MFU | 125922 tok/s step 469/19560 | loss 5.495914 (-1.67z)| norm 0.8545 (-0.84z)| lr 3.75e-04 | 4166.54 ms | 32.4% bf16 MFU | 125917 tok/s step 470/19560 | loss 5.479702 (-1.77z)| norm 0.9585 (-0.23z)| lr 3.76e-04 | 4154.51 ms | 32.5% bf16 MFU | 125931 tok/s step 471/19560 | loss 5.498496 (-1.59z)| norm 1.1530 (+0.89z)| lr 3.77e-04 | 4151.01 ms | 32.5% bf16 MFU | 125950 tok/s step 472/19560 | loss 5.520200 (-1.39z)| norm 1.0664 (+0.38z)| lr 3.78e-04 | 4158.98 ms | 32.5% bf16 MFU | 125956 tok/s step 473/19560 | loss 5.454091 (-1.90z)| norm 0.9260 (-0.43z)| lr 3.78e-04 | 4155.80 ms | 32.5% bf16 MFU | 125966 tok/s step 474/19560 | loss 5.505556 (-1.45z)| norm 0.8110 (-1.09z)| lr 3.79e-04 | 4162.90 ms | 32.4% bf16 MFU | 125964 tok/s step 475/19560 | loss 5.510126 (-1.39z)| norm 0.8843 (-0.66z)| lr 3.80e-04 | 4166.17 ms | 32.4% bf16 MFU | 125958 tok/s step 476/19560 | loss 5.489036 (-1.55z)| norm 0.8311 (-0.97z)| lr 3.81e-04 | 4205.47 ms | 32.1% bf16 MFU | 125894 tok/s step 477/19560 | loss 5.473602 (-1.65z)| norm 0.9502 (-0.27z)| lr 3.82e-04 | 4147.86 ms | 32.6% bf16 MFU | 125919 tok/s step 478/19560 | loss 5.476744 (-1.60z)| norm 1.0829 (+0.49z)| lr 3.82e-04 | 4158.73 ms | 32.5% bf16 MFU | 125927 tok/s step 479/19560 | loss 5.446274 (-1.82z)| norm 0.8504 (-0.85z)| lr 3.83e-04 | 4155.61 ms | 32.5% bf16 MFU | 125939 tok/s step 480/19560 | loss 5.467449 (-1.62z)| norm 0.9305 (-0.37z)| lr 3.84e-04 | 4148.78 ms | 32.5% bf16 MFU | 125960 tok/s step 481/19560 | loss 5.530330 (-1.08z)| norm 1.1661 (+1.00z)| lr 3.85e-04 | 4153.94 ms | 32.5% bf16 MFU | 125973 tok/s step 482/19560 | loss 5.480200 (-1.48z)| norm 1.0604 (+0.37z)| lr 3.86e-04 | 4163.67 ms | 32.4% bf16 MFU | 125970 tok/s step 483/19560 | loss 5.512932 (-1.19z)| norm 0.9421 (-0.31z)| lr 3.86e-04 | 4162.06 ms | 32.4% bf16 MFU | 125970 tok/s step 484/19560 | loss 5.643858 (-0.07z)| norm 1.0849 (+0.56z)| lr 3.87e-04 | 4148.55 ms | 32.5% bf16 MFU | 125991 tok/s step 485/19560 | loss 5.433805 (-1.83z)| norm 1.0316 (+0.24z)| lr 3.88e-04 | 4163.20 ms | 32.4% bf16 MFU | 125988 tok/s step 486/19560 | loss 5.461987 (-1.56z)| norm 1.0461 (+0.33z)| lr 3.89e-04 | 4162.28 ms | 32.4% bf16 MFU | 125986 tok/s step 487/19560 | loss 5.423137 (-1.86z)| norm 1.0622 (+0.42z)| lr 3.90e-04 | 4241.55 ms | 31.8% bf16 MFU | 125868 tok/s step 488/19560 | loss 5.448490 (-1.62z)| norm 1.2195 (+1.38z)| lr 3.90e-04 | 5050.39 ms | 26.7% bf16 MFU | 124765 tok/s step 489/19560 | loss 5.356545 (-2.34z)| norm 1.1275 (+0.81z)| lr 3.91e-04 | 4551.89 ms | 29.7% bf16 MFU | 124286 tok/s step 490/19560 | loss 5.383628 (-2.08z)| norm 1.1597 (+0.99z)| lr 3.92e-04 | 4226.83 ms | 31.9% bf16 MFU | 124273 tok/s step 491/19560 | loss 5.441872 (-1.57z)| norm 1.1107 (+0.69z)| lr 3.93e-04 | 4426.23 ms | 30.5% bf16 MFU | 123982 tok/s step 492/19560 | loss 5.404675 (-1.84z)| norm 1.0777 (+0.49z)| lr 3.94e-04 | 4396.74 ms | 30.7% bf16 MFU | 123745 tok/s step 493/19560 | loss 5.402283 (-1.83z)| norm 0.9315 (-0.39z)| lr 3.94e-04 | 4244.65 ms | 31.8% bf16 MFU | 123734 tok/s step 494/19560 | loss 5.393301 (-1.87z)| norm 1.2214 (+1.34z)| lr 3.95e-04 | 4212.52 ms | 32.1% bf16 MFU | 123770 tok/s step 495/19560 | loss 5.368796 (-2.03z)| norm 0.8380 (-0.96z)| lr 3.96e-04 | 4202.27 ms | 32.1% bf16 MFU | 123820 tok/s step 496/19560 | loss 5.419535 (-1.60z)| norm 0.9752 (-0.13z)| lr 3.97e-04 | 4308.95 ms | 31.3% bf16 MFU | 123712 tok/s step 497/19560 | loss 5.373464 (-1.94z)| norm 1.1071 (+0.65z)| lr 3.98e-04 | 4158.11 ms | 32.5% bf16 MFU | 123831 tok/s step 498/19560 | loss 5.443227 (-1.34z)| norm 0.9136 (-0.51z)| lr 3.98e-04 | 4230.50 ms | 31.9% bf16 MFU | 123836 tok/s step 499/19560 | loss 5.399869 (-1.66z)| norm 1.1117 (+0.67z)| lr 3.99e-04 | 4189.55 ms | 32.2% bf16 MFU | 123901 tok/s step 500/19560 | loss 5.431656 (-1.38z)| norm 1.0069 (+0.03z)| lr 4.00e-04 | 4164.48 ms | 32.4% bf16 MFU | 124001 tok/s val loss 5.458455 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2433/10042 = 0.242282 step 501/19560 | loss 5.447559 (-1.24z)| norm 0.9788 (-0.14z)| lr 4.01e-04 | 4219.15 ms | 32.0% bf16 MFU | 124014 tok/s step 502/19560 | loss 5.422552 (-1.42z)| norm 0.8901 (-0.67z)| lr 4.02e-04 | 4292.38 ms | 31.5% bf16 MFU | 123921 tok/s step 503/19560 | loss 5.382331 (-1.72z)| norm 0.8277 (-1.05z)| lr 4.02e-04 | 4185.10 ms | 32.3% bf16 MFU | 123988 tok/s step 504/19560 | loss 5.459687 (-1.07z)| norm 0.8647 (-0.81z)| lr 4.03e-04 | 4155.38 ms | 32.5% bf16 MFU | 124098 tok/s step 505/19560 | loss 5.352477 (-1.90z)| norm 0.8353 (-0.98z)| lr 4.04e-04 | 4159.13 ms | 32.5% bf16 MFU | 124196 tok/s step 506/19560 | loss 5.406460 (-1.45z)| norm 0.7436 (-1.51z)| lr 4.05e-04 | 4160.29 ms | 32.5% bf16 MFU | 124287 tok/s step 507/19560 | loss 5.357667 (-1.84z)| norm 0.8131 (-1.11z)| lr 4.06e-04 | 4156.63 ms | 32.5% bf16 MFU | 124379 tok/s step 508/19560 | loss 5.366881 (-1.74z)| norm 0.9120 (-0.49z)| lr 4.06e-04 | 4153.55 ms | 32.5% bf16 MFU | 124472 tok/s step 509/19560 | loss 5.327592 (-2.02z)| norm 0.9220 (-0.43z)| lr 4.07e-04 | 4252.20 ms | 31.8% bf16 MFU | 124413 tok/s step 510/19560 | loss 5.269790 (-2.43z)| norm 0.8898 (-0.62z)| lr 4.08e-04 | 4158.68 ms | 32.5% bf16 MFU | 124496 tok/s step 511/19560 | loss 5.339281 (-1.84z)| norm 0.9651 (-0.13z)| lr 4.09e-04 | 4219.23 ms | 32.0% bf16 MFU | 124484 tok/s step 512/19560 | loss 5.383578 (-1.46z)| norm 0.8412 (-0.92z)| lr 4.10e-04 | 4156.18 ms | 32.5% bf16 MFU | 124567 tok/s step 513/19560 | loss 5.386390 (-1.42z)| norm 0.7764 (-1.32z)| lr 4.10e-04 | 4149.96 ms | 32.5% bf16 MFU | 124656 tok/s step 514/19560 | loss 5.389949 (-1.37z)| norm 0.9311 (-0.33z)| lr 4.11e-04 | 4149.90 ms | 32.5% bf16 MFU | 124740 tok/s step 515/19560 | loss 5.362449 (-1.57z)| norm 0.8709 (-0.73z)| lr 4.12e-04 | 4232.24 ms | 31.9% bf16 MFU | 124697 tok/s step 516/19560 | loss 5.337644 (-1.74z)| norm 0.7536 (-1.49z)| lr 4.13e-04 | 4154.60 ms | 32.5% bf16 MFU | 124772 tok/s step 517/19560 | loss 5.338497 (-1.70z)| norm 0.7002 (-1.81z)| lr 4.14e-04 | 4155.65 ms | 32.5% bf16 MFU | 124841 tok/s step 518/19560 | loss 5.272606 (-2.19z)| norm 0.7737 (-1.32z)| lr 4.14e-04 | 4208.13 ms | 32.1% bf16 MFU | 124829 tok/s step 519/19560 | loss 5.253887 (-2.28z)| norm 0.7518 (-1.45z)| lr 4.15e-04 | 4175.31 ms | 32.3% bf16 MFU | 124866 tok/s step 520/19560 | loss 5.350475 (-1.49z)| norm 0.8311 (-0.91z)| lr 4.16e-04 | 4163.34 ms | 32.4% bf16 MFU | 124919 tok/s step 521/19560 | loss 5.291549 (-1.92z)| norm 1.1242 (+1.00z)| lr 4.17e-04 | 4154.78 ms | 32.5% bf16 MFU | 124982 tok/s step 522/19560 | loss 5.313787 (-1.72z)| norm 1.1932 (+1.44z)| lr 4.18e-04 | 4196.93 ms | 32.2% bf16 MFU | 124979 tok/s step 523/19560 | loss 5.399389 (-1.03z)| norm 1.1496 (+1.17z)| lr 4.18e-04 | 4155.32 ms | 32.5% bf16 MFU | 125039 tok/s step 524/19560 | loss 5.337441 (-1.50z)| norm 1.0215 (+0.33z)| lr 4.19e-04 | 4152.86 ms | 32.5% bf16 MFU | 125099 tok/s step 525/19560 | loss 5.308125 (-1.70z)| norm 0.8532 (-0.77z)| lr 4.20e-04 | 4159.69 ms | 32.5% bf16 MFU | 125146 tok/s step 526/19560 | loss 5.293043 (-1.80z)| norm 0.9256 (-0.29z)| lr 4.21e-04 | 4237.99 ms | 31.9% bf16 MFU | 125075 tok/s step 527/19560 | loss 5.298958 (-1.74z)| norm 0.8808 (-0.58z)| lr 4.22e-04 | 4155.68 ms | 32.5% bf16 MFU | 125129 tok/s step 528/19560 | loss 5.332407 (-1.45z)| norm 0.7903 (-1.20z)| lr 4.22e-04 | 4150.67 ms | 32.5% bf16 MFU | 125188 tok/s step 529/19560 | loss 5.283911 (-1.81z)| norm 0.7958 (-1.17z)| lr 4.23e-04 | 4162.75 ms | 32.4% bf16 MFU | 125226 tok/s step 530/19560 | loss 5.287255 (-1.74z)| norm 0.8415 (-0.86z)| lr 4.24e-04 | 4263.85 ms | 31.7% bf16 MFU | 125113 tok/s step 531/19560 | loss 5.225375 (-2.18z)| norm 0.9044 (-0.42z)| lr 4.25e-04 | 4299.26 ms | 31.4% bf16 MFU | 124955 tok/s step 532/19560 | loss 5.268886 (-1.81z)| norm 0.9198 (-0.30z)| lr 4.26e-04 | 4168.32 ms | 32.4% bf16 MFU | 124996 tok/s step 533/19560 | loss 5.307319 (-1.49z)| norm 0.9732 (+0.09z)| lr 4.26e-04 | 4279.40 ms | 31.6% bf16 MFU | 124872 tok/s step 534/19560 | loss 5.271500 (-1.75z)| norm 0.8757 (-0.59z)| lr 4.27e-04 | 4150.91 ms | 32.5% bf16 MFU | 124944 tok/s step 535/19560 | loss 5.271821 (-1.72z)| norm 0.8220 (-0.96z)| lr 4.28e-04 | 4162.60 ms | 32.4% bf16 MFU | 124994 tok/s step 536/19560 | loss 5.256241 (-1.81z)| norm 0.7093 (-1.74z)| lr 4.29e-04 | 4177.50 ms | 32.3% bf16 MFU | 125019 tok/s step 537/19560 | loss 5.321399 (-1.28z)| norm 0.7114 (-1.71z)| lr 4.30e-04 | 4153.50 ms | 32.5% bf16 MFU | 125080 tok/s step 538/19560 | loss 5.301356 (-1.42z)| norm 0.6892 (-1.86z)| lr 4.30e-04 | 4273.07 ms | 31.6% bf16 MFU | 124961 tok/s step 539/19560 | loss 5.245135 (-1.84z)| norm 0.8403 (-0.80z)| lr 4.31e-04 | 4147.55 ms | 32.6% bf16 MFU | 125033 tok/s step 540/19560 | loss 5.285217 (-1.50z)| norm 1.0094 (+0.39z)| lr 4.32e-04 | 4249.25 ms | 31.8% bf16 MFU | 124951 tok/s step 541/19560 | loss 5.293919 (-1.41z)| norm 1.0855 (+0.92z)| lr 4.33e-04 | 4151.00 ms | 32.5% bf16 MFU | 125018 tok/s step 542/19560 | loss 5.324265 (-1.15z)| norm 1.2106 (+1.77z)| lr 4.34e-04 | 4421.07 ms | 30.5% bf16 MFU | 124697 tok/s step 543/19560 | loss 5.217544 (-1.99z)| norm 0.9113 (-0.30z)| lr 4.34e-04 | 4149.51 ms | 32.5% bf16 MFU | 124779 tok/s step 544/19560 | loss 5.267195 (-1.56z)| norm 0.7438 (-1.46z)| lr 4.35e-04 | 4149.39 ms | 32.5% bf16 MFU | 124858 tok/s step 545/19560 | loss 5.275258 (-1.47z)| norm 0.6579 (-2.00z)| lr 4.36e-04 | 4234.81 ms | 31.9% bf16 MFU | 124805 tok/s step 546/19560 | loss 5.218865 (-1.89z)| norm 0.6839 (-1.80z)| lr 4.37e-04 | 4153.83 ms | 32.5% bf16 MFU | 124876 tok/s step 547/19560 | loss 5.178976 (-2.16z)| norm 0.7060 (-1.63z)| lr 4.38e-04 | 4164.21 ms | 32.4% bf16 MFU | 124927 tok/s step 548/19560 | loss 5.261782 (-1.48z)| norm 0.8947 (-0.37z)| lr 4.38e-04 | 4152.17 ms | 32.5% bf16 MFU | 124994 tok/s step 549/19560 | loss 5.258509 (-1.49z)| norm 1.0079 (+0.42z)| lr 4.39e-04 | 4156.67 ms | 32.5% bf16 MFU | 125051 tok/s step 550/19560 | loss 5.205052 (-1.88z)| norm 1.0239 (+0.54z)| lr 4.40e-04 | 4150.48 ms | 32.5% bf16 MFU | 125115 tok/s step 551/19560 | loss 5.194716 (-1.91z)| norm 1.1769 (+1.58z)| lr 4.41e-04 | 4154.12 ms | 32.5% bf16 MFU | 125169 tok/s step 552/19560 | loss 5.275570 (-1.26z)| norm 1.0276 (+0.55z)| lr 4.42e-04 | 4165.91 ms | 32.4% bf16 MFU | 125204 tok/s step 553/19560 | loss 5.338021 (-0.75z)| norm 1.0834 (+0.93z)| lr 4.42e-04 | 4164.88 ms | 32.4% bf16 MFU | 125238 tok/s step 554/19560 | loss 5.247628 (-1.46z)| norm 0.9072 (-0.27z)| lr 4.43e-04 | 4153.65 ms | 32.5% bf16 MFU | 125287 tok/s step 555/19560 | loss 5.194890 (-1.85z)| norm 1.1142 (+1.15z)| lr 4.44e-04 | 4206.04 ms | 32.1% bf16 MFU | 125255 tok/s step 556/19560 | loss 5.281323 (-1.14z)| norm 0.6959 (-1.69z)| lr 4.45e-04 | 4150.30 ms | 32.5% bf16 MFU | 125309 tok/s step 557/19560 | loss 5.290802 (-1.05z)| norm 0.7593 (-1.24z)| lr 4.46e-04 | 4158.40 ms | 32.5% bf16 MFU | 125347 tok/s step 558/19560 | loss 5.268219 (-1.21z)| norm 0.8183 (-0.83z)| lr 4.46e-04 | 4152.27 ms | 32.5% bf16 MFU | 125393 tok/s step 559/19560 | loss 5.308938 (-0.87z)| norm 0.9579 (+0.12z)| lr 4.47e-04 | 4163.31 ms | 32.4% bf16 MFU | 125420 tok/s step 560/19560 | loss 5.217582 (-1.60z)| norm 0.8655 (-0.50z)| lr 4.48e-04 | 4157.95 ms | 32.5% bf16 MFU | 125454 tok/s step 561/19560 | loss 5.274250 (-1.12z)| norm 0.8724 (-0.44z)| lr 4.49e-04 | 4176.68 ms | 32.3% bf16 MFU | 125457 tok/s step 562/19560 | loss 5.299081 (-0.90z)| norm 0.9833 (+0.33z)| lr 4.50e-04 | 4153.62 ms | 32.5% bf16 MFU | 125496 tok/s step 563/19560 | loss 5.241518 (-1.36z)| norm 0.9690 (+0.24z)| lr 4.50e-04 | 4177.41 ms | 32.3% bf16 MFU | 125496 tok/s step 564/19560 | loss 5.193925 (-1.73z)| norm 0.9468 (+0.08z)| lr 4.51e-04 | 4188.66 ms | 32.2% bf16 MFU | 125480 tok/s step 565/19560 | loss 5.268488 (-1.10z)| norm 0.8771 (-0.40z)| lr 4.52e-04 | 4151.91 ms | 32.5% bf16 MFU | 125520 tok/s step 566/19560 | loss 5.169794 (-1.90z)| norm 0.8708 (-0.45z)| lr 4.53e-04 | 4162.31 ms | 32.4% bf16 MFU | 125542 tok/s step 567/19560 | loss 5.202047 (-1.60z)| norm 1.0913 (+1.07z)| lr 4.54e-04 | 4361.41 ms | 31.0% bf16 MFU | 125275 tok/s step 568/19560 | loss 5.259467 (-1.10z)| norm 1.0497 (+0.79z)| lr 4.54e-04 | 4146.50 ms | 32.6% bf16 MFU | 125333 tok/s step 569/19560 | loss 5.171307 (-1.82z)| norm 0.8161 (-0.84z)| lr 4.55e-04 | 4153.98 ms | 32.5% bf16 MFU | 125377 tok/s step 570/19560 | loss 5.200739 (-1.54z)| norm 0.7889 (-1.01z)| lr 4.56e-04 | 4147.56 ms | 32.6% bf16 MFU | 125429 tok/s step 571/19560 | loss 5.276581 (-0.89z)| norm 0.8685 (-0.46z)| lr 4.57e-04 | 4161.03 ms | 32.4% bf16 MFU | 125457 tok/s step 572/19560 | loss 5.171286 (-1.76z)| norm 0.7624 (-1.17z)| lr 4.58e-04 | 4172.50 ms | 32.4% bf16 MFU | 125467 tok/s step 573/19560 | loss 5.124434 (-2.11z)| norm 0.7290 (-1.39z)| lr 4.58e-04 | 4147.84 ms | 32.6% bf16 MFU | 125514 tok/s step 574/19560 | loss 5.181648 (-1.60z)| norm 0.8649 (-0.46z)| lr 4.59e-04 | 4152.47 ms | 32.5% bf16 MFU | 125551 tok/s step 575/19560 | loss 5.169766 (-1.67z)| norm 0.7845 (-1.02z)| lr 4.60e-04 | 4151.62 ms | 32.5% bf16 MFU | 125588 tok/s step 576/19560 | loss 5.157884 (-1.74z)| norm 0.6032 (-2.26z)| lr 4.61e-04 | 4167.16 ms | 32.4% bf16 MFU | 125599 tok/s step 577/19560 | loss 5.151235 (-1.76z)| norm 0.5969 (-2.30z)| lr 4.62e-04 | 4149.70 ms | 32.5% bf16 MFU | 125636 tok/s step 578/19560 | loss 5.206903 (-1.28z)| norm 0.5682 (-2.44z)| lr 4.62e-04 | 4189.77 ms | 32.2% bf16 MFU | 125611 tok/s step 579/19560 | loss 5.109899 (-2.04z)| norm 0.6204 (-2.03z)| lr 4.63e-04 | 4150.38 ms | 32.5% bf16 MFU | 125647 tok/s step 580/19560 | loss 5.117043 (-1.94z)| norm 0.8480 (-0.51z)| lr 4.64e-04 | 4149.76 ms | 32.5% bf16 MFU | 125682 tok/s step 581/19560 | loss 5.175622 (-1.44z)| norm 0.9441 (+0.14z)| lr 4.65e-04 | 4185.05 ms | 32.3% bf16 MFU | 125661 tok/s step 582/19560 | loss 5.120954 (-1.85z)| norm 1.0026 (+0.53z)| lr 4.66e-04 | 4177.17 ms | 32.3% bf16 MFU | 125654 tok/s step 583/19560 | loss 5.205832 (-1.15z)| norm 0.7969 (-0.83z)| lr 4.66e-04 | 4166.33 ms | 32.4% bf16 MFU | 125663 tok/s step 584/19560 | loss 5.173118 (-1.40z)| norm 0.6402 (-1.84z)| lr 4.67e-04 | 4149.69 ms | 32.5% bf16 MFU | 125697 tok/s step 585/19560 | loss 5.118388 (-1.82z)| norm 0.7232 (-1.28z)| lr 4.68e-04 | 4149.13 ms | 32.5% bf16 MFU | 125730 tok/s step 586/19560 | loss 5.181590 (-1.28z)| norm 0.8998 (-0.10z)| lr 4.69e-04 | 4161.60 ms | 32.4% bf16 MFU | 125743 tok/s step 587/19560 | loss 5.153596 (-1.48z)| norm 0.9460 (+0.21z)| lr 4.70e-04 | 4145.36 ms | 32.6% bf16 MFU | 125780 tok/s step 588/19560 | loss 5.153595 (-1.46z)| norm 0.9528 (+0.26z)| lr 4.70e-04 | 4235.62 ms | 31.9% bf16 MFU | 125680 tok/s step 589/19560 | loss 5.177920 (-1.24z)| norm 1.0844 (+1.13z)| lr 4.71e-04 | 4160.28 ms | 32.5% bf16 MFU | 125697 tok/s step 590/19560 | loss 5.168429 (-1.31z)| norm 0.9491 (+0.24z)| lr 4.72e-04 | 4181.87 ms | 32.3% bf16 MFU | 125681 tok/s step 591/19560 | loss 5.131578 (-1.59z)| norm 1.0973 (+1.22z)| lr 4.73e-04 | 4150.81 ms | 32.5% bf16 MFU | 125712 tok/s step 592/19560 | loss 5.155410 (-1.37z)| norm 0.8256 (-0.58z)| lr 4.74e-04 | 4166.28 ms | 32.4% bf16 MFU | 125718 tok/s step 593/19560 | loss 5.145681 (-1.44z)| norm 0.8522 (-0.41z)| lr 4.74e-04 | 4154.59 ms | 32.5% bf16 MFU | 125742 tok/s step 594/19560 | loss 5.154209 (-1.35z)| norm 0.8071 (-0.72z)| lr 4.75e-04 | 4154.53 ms | 32.5% bf16 MFU | 125765 tok/s step 595/19560 | loss 5.133386 (-1.51z)| norm 0.8179 (-0.64z)| lr 4.76e-04 | 4165.05 ms | 32.4% bf16 MFU | 125771 tok/s step 596/19560 | loss 5.066788 (-2.05z)| norm 0.8231 (-0.59z)| lr 4.77e-04 | 4148.68 ms | 32.5% bf16 MFU | 125801 tok/s step 597/19560 | loss 5.116151 (-1.60z)| norm 0.8057 (-0.70z)| lr 4.78e-04 | 4146.98 ms | 32.6% bf16 MFU | 125832 tok/s step 598/19560 | loss 5.117731 (-1.56z)| norm 0.7246 (-1.24z)| lr 4.78e-04 | 4155.70 ms | 32.5% bf16 MFU | 125849 tok/s step 599/19560 | loss 5.093991 (-1.74z)| norm 0.6919 (-1.44z)| lr 4.79e-04 | 4157.82 ms | 32.5% bf16 MFU | 125861 tok/s step 600/19560 | loss 5.131043 (-1.40z)| norm 0.7795 (-0.83z)| lr 4.80e-04 | 4155.87 ms | 32.5% bf16 MFU | 125876 tok/s step 601/19560 | loss 5.099367 (-1.65z)| norm 0.7166 (-1.24z)| lr 4.81e-04 | 4167.36 ms | 32.4% bf16 MFU | 125872 tok/s step 602/19560 | loss 5.090969 (-1.69z)| norm 0.6420 (-1.72z)| lr 4.82e-04 | 4187.32 ms | 32.2% bf16 MFU | 125839 tok/s step 603/19560 | loss 5.088835 (-1.69z)| norm 0.7247 (-1.15z)| lr 4.82e-04 | 4150.36 ms | 32.5% bf16 MFU | 125863 tok/s step 604/19560 | loss 5.079950 (-1.74z)| norm 0.7424 (-1.02z)| lr 4.83e-04 | 4161.83 ms | 32.4% bf16 MFU | 125869 tok/s step 605/19560 | loss 5.073762 (-1.76z)| norm 0.6467 (-1.63z)| lr 4.84e-04 | 4153.73 ms | 32.5% bf16 MFU | 125887 tok/s step 606/19560 | loss 5.084582 (-1.64z)| norm 0.6654 (-1.48z)| lr 4.85e-04 | 4163.18 ms | 32.4% bf16 MFU | 125889 tok/s step 607/19560 | loss 5.001378 (-2.30z)| norm 0.6329 (-1.66z)| lr 4.86e-04 | 4153.70 ms | 32.5% bf16 MFU | 125906 tok/s step 608/19560 | loss 5.071647 (-1.68z)| norm 0.6779 (-1.35z)| lr 4.86e-04 | 4159.42 ms | 32.5% bf16 MFU | 125913 tok/s step 609/19560 | loss 5.090419 (-1.50z)| norm 0.8198 (-0.42z)| lr 4.87e-04 | 4247.97 ms | 31.8% bf16 MFU | 125788 tok/s step 610/19560 | loss 5.053893 (-1.79z)| norm 0.8908 (+0.05z)| lr 4.88e-04 | 4148.70 ms | 32.5% bf16 MFU | 125817 tok/s step 611/19560 | loss 5.035499 (-1.93z)| norm 0.9270 (+0.29z)| lr 4.89e-04 | 4150.45 ms | 32.5% bf16 MFU | 125843 tok/s step 612/19560 | loss 5.069952 (-1.65z)| norm 1.0274 (+0.95z)| lr 4.90e-04 | 4151.12 ms | 32.5% bf16 MFU | 125866 tok/s step 613/19560 | loss 5.047873 (-1.81z)| norm 1.0546 (+1.13z)| lr 4.90e-04 | 4204.25 ms | 32.1% bf16 MFU | 125807 tok/s step 614/19560 | loss 5.056558 (-1.71z)| norm 1.0094 (+0.84z)| lr 4.91e-04 | 4153.94 ms | 32.5% bf16 MFU | 125828 tok/s step 615/19560 | loss 5.129561 (-1.04z)| norm 0.9334 (+0.34z)| lr 4.92e-04 | 4148.79 ms | 32.5% bf16 MFU | 125855 tok/s step 616/19560 | loss 5.129975 (-1.02z)| norm 0.8278 (-0.34z)| lr 4.93e-04 | 4148.32 ms | 32.5% bf16 MFU | 125882 tok/s step 617/19560 | loss 5.081786 (-1.44z)| norm 0.7768 (-0.67z)| lr 4.94e-04 | 4147.69 ms | 32.6% bf16 MFU | 125908 tok/s step 618/19560 | loss 5.064498 (-1.57z)| norm 0.7834 (-0.62z)| lr 4.94e-04 | 4166.64 ms | 32.4% bf16 MFU | 125904 tok/s step 619/19560 | loss 5.009770 (-2.02z)| norm 0.8078 (-0.44z)| lr 4.95e-04 | 4152.02 ms | 32.5% bf16 MFU | 125922 tok/s step 620/19560 | loss 5.132806 (-0.90z)| norm 0.6335 (-1.63z)| lr 4.96e-04 | 4191.03 ms | 32.2% bf16 MFU | 125881 tok/s step 621/19560 | loss 4.990609 (-2.14z)| norm 0.5972 (-1.84z)| lr 4.97e-04 | 4156.09 ms | 32.5% bf16 MFU | 125894 tok/s step 622/19560 | loss 5.041188 (-1.66z)| norm 0.5540 (-2.11z)| lr 4.98e-04 | 4146.63 ms | 32.6% bf16 MFU | 125922 tok/s step 623/19560 | loss 5.041911 (-1.62z)| norm 0.5735 (-1.93z)| lr 4.98e-04 | 4146.34 ms | 32.6% bf16 MFU | 125948 tok/s step 624/19560 | loss 5.037806 (-1.64z)| norm 0.6148 (-1.62z)| lr 4.99e-04 | 4162.49 ms | 32.4% bf16 MFU | 125948 tok/s step 625/19560 | loss 5.046964 (-1.53z)| norm 0.6870 (-1.12z)| lr 5.00e-04 | 4148.54 ms | 32.5% bf16 MFU | 125970 tok/s step 626/19560 | loss 5.049431 (-1.49z)| norm 0.7750 (-0.51z)| lr 5.01e-04 | 4160.81 ms | 32.4% bf16 MFU | 125972 tok/s step 627/19560 | loss 5.049375 (-1.47z)| norm 0.7729 (-0.51z)| lr 5.02e-04 | 4265.70 ms | 31.7% bf16 MFU | 125818 tok/s step 628/19560 | loss 5.036025 (-1.57z)| norm 0.9537 (+0.74z)| lr 5.02e-04 | 4147.10 ms | 32.6% bf16 MFU | 125849 tok/s step 629/19560 | loss 5.083213 (-1.13z)| norm 1.2677 (+2.81z)| lr 5.03e-04 | 4147.06 ms | 32.6% bf16 MFU | 125877 tok/s step 630/19560 | loss 5.071023 (-1.23z)| norm 1.0710 (+1.47z)| lr 5.04e-04 | 4159.80 ms | 32.5% bf16 MFU | 125885 tok/s step 631/19560 | loss 5.063769 (-1.28z)| norm 0.8934 (+0.28z)| lr 5.05e-04 | 4146.82 ms | 32.6% bf16 MFU | 125913 tok/s step 632/19560 | loss 5.025205 (-1.63z)| norm 0.7700 (-0.53z)| lr 5.06e-04 | 4164.93 ms | 32.4% bf16 MFU | 125911 tok/s step 633/19560 | loss 5.045343 (-1.42z)| norm 0.7527 (-0.64z)| lr 5.06e-04 | 4160.50 ms | 32.5% bf16 MFU | 125916 tok/s step 634/19560 | loss 5.089947 (-0.99z)| norm 0.8164 (-0.22z)| lr 5.07e-04 | 4145.35 ms | 32.6% bf16 MFU | 125944 tok/s step 635/19560 | loss 5.054217 (-1.31z)| norm 0.7041 (-0.96z)| lr 5.08e-04 | 4161.88 ms | 32.4% bf16 MFU | 125946 tok/s step 636/19560 | loss 5.034381 (-1.48z)| norm 0.7480 (-0.66z)| lr 5.09e-04 | 4159.91 ms | 32.5% bf16 MFU | 125950 tok/s step 637/19560 | loss 5.014096 (-1.64z)| norm 0.6831 (-1.08z)| lr 5.10e-04 | 4151.66 ms | 32.5% bf16 MFU | 125967 tok/s step 638/19560 | loss 5.105497 (-0.76z)| norm 0.5566 (-1.87z)| lr 5.10e-04 | 4147.66 ms | 32.6% bf16 MFU | 125989 tok/s step 639/19560 | loss 5.026446 (-1.49z)| norm 0.7081 (-0.87z)| lr 5.11e-04 | 4254.36 ms | 31.7% bf16 MFU | 125851 tok/s step 640/19560 | loss 5.059443 (-1.16z)| norm 0.8307 (-0.07z)| lr 5.12e-04 | 4145.94 ms | 32.6% bf16 MFU | 125881 tok/s step 641/19560 | loss 5.030376 (-1.42z)| norm 0.9124 (+0.46z)| lr 5.13e-04 | 4168.41 ms | 32.4% bf16 MFU | 125876 tok/s step 642/19560 | loss 5.042162 (-1.30z)| norm 0.8879 (+0.30z)| lr 5.14e-04 | 4173.55 ms | 32.4% bf16 MFU | 125864 tok/s step 643/19560 | loss 5.089275 (-0.82z)| norm 0.9035 (+0.40z)| lr 5.14e-04 | 4145.39 ms | 32.6% bf16 MFU | 125894 tok/s step 644/19560 | loss 5.040041 (-1.29z)| norm 0.8424 (-0.00z)| lr 5.15e-04 | 4148.97 ms | 32.5% bf16 MFU | 125918 tok/s step 645/19560 | loss 5.018931 (-1.48z)| norm 0.7079 (-0.88z)| lr 5.16e-04 | 4144.72 ms | 32.6% bf16 MFU | 125947 tok/s step 646/19560 | loss 5.035751 (-1.29z)| norm 0.6639 (-1.16z)| lr 5.17e-04 | 4181.31 ms | 32.3% bf16 MFU | 125919 tok/s step 647/19560 | loss 4.993684 (-1.68z)| norm 0.7363 (-0.69z)| lr 5.18e-04 | 4168.70 ms | 32.4% bf16 MFU | 125911 tok/s step 648/19560 | loss 4.985196 (-1.74z)| norm 0.8178 (-0.16z)| lr 5.18e-04 | 4149.49 ms | 32.5% bf16 MFU | 125933 tok/s step 649/19560 | loss 4.999304 (-1.57z)| norm 0.8381 (-0.01z)| lr 5.19e-04 | 4158.94 ms | 32.5% bf16 MFU | 125940 tok/s step 650/19560 | loss 4.946501 (-2.05z)| norm 0.9608 (+0.82z)| lr 5.20e-04 | 4161.46 ms | 32.4% bf16 MFU | 125942 tok/s step 651/19560 | loss 4.998567 (-1.53z)| norm 0.9083 (+0.49z)| lr 5.21e-04 | 4147.00 ms | 32.6% bf16 MFU | 125966 tok/s step 652/19560 | loss 5.004243 (-1.45z)| norm 0.7739 (-0.41z)| lr 5.22e-04 | 4163.67 ms | 32.4% bf16 MFU | 125964 tok/s step 653/19560 | loss 5.019742 (-1.28z)| norm 0.6484 (-1.25z)| lr 5.22e-04 | 4161.76 ms | 32.4% bf16 MFU | 125964 tok/s step 654/19560 | loss 4.941238 (-2.02z)| norm 0.6351 (-1.32z)| lr 5.23e-04 | 4144.66 ms | 32.6% bf16 MFU | 125991 tok/s step 655/19560 | loss 4.986800 (-1.55z)| norm 0.5945 (-1.57z)| lr 5.24e-04 | 4155.68 ms | 32.5% bf16 MFU | 126000 tok/s step 656/19560 | loss 5.016012 (-1.24z)| norm 0.6361 (-1.27z)| lr 5.25e-04 | 4146.43 ms | 32.6% bf16 MFU | 126022 tok/s step 657/19560 | loss 4.974348 (-1.63z)| norm 0.7538 (-0.48z)| lr 5.26e-04 | 4160.88 ms | 32.4% bf16 MFU | 126021 tok/s step 658/19560 | loss 4.952614 (-1.81z)| norm 0.8185 (-0.05z)| lr 5.26e-04 | 4151.92 ms | 32.5% bf16 MFU | 126034 tok/s step 659/19560 | loss 5.016888 (-1.15z)| norm 0.9212 (+0.63z)| lr 5.27e-04 | 4164.56 ms | 32.4% bf16 MFU | 126027 tok/s step 660/19560 | loss 4.987505 (-1.42z)| norm 0.9215 (+0.63z)| lr 5.28e-04 | 4148.88 ms | 32.5% bf16 MFU | 126044 tok/s step 661/19560 | loss 4.978813 (-1.49z)| norm 0.7649 (-0.40z)| lr 5.29e-04 | 4155.95 ms | 32.5% bf16 MFU | 126049 tok/s step 662/19560 | loss 4.955841 (-1.69z)| norm 0.8428 (+0.12z)| lr 5.30e-04 | 4159.04 ms | 32.5% bf16 MFU | 126050 tok/s step 663/19560 | loss 4.957823 (-1.64z)| norm 0.9086 (+0.56z)| lr 5.30e-04 | 4153.39 ms | 32.5% bf16 MFU | 126059 tok/s step 664/19560 | loss 4.955750 (-1.63z)| norm 0.9402 (+0.76z)| lr 5.31e-04 | 4208.07 ms | 32.1% bf16 MFU | 125985 tok/s step 665/19560 | loss 4.940260 (-1.76z)| norm 0.9393 (+0.74z)| lr 5.32e-04 | 4145.08 ms | 32.6% bf16 MFU | 126010 tok/s step 666/19560 | loss 5.055259 (-0.61z)| norm 0.8167 (-0.09z)| lr 5.33e-04 | 4147.68 ms | 32.6% bf16 MFU | 126030 tok/s step 667/19560 | loss 4.947882 (-1.66z)| norm 0.7387 (-0.60z)| lr 5.34e-04 | 4148.12 ms | 32.5% bf16 MFU | 126048 tok/s step 668/19560 | loss 4.914584 (-1.95z)| norm 0.7335 (-0.63z)| lr 5.34e-04 | 4142.52 ms | 32.6% bf16 MFU | 126074 tok/s step 669/19560 | loss 4.979545 (-1.29z)| norm 0.6747 (-1.01z)| lr 5.35e-04 | 4148.94 ms | 32.5% bf16 MFU | 126089 tok/s step 670/19560 | loss 4.929076 (-1.78z)| norm 0.6783 (-0.98z)| lr 5.36e-04 | 4143.56 ms | 32.6% bf16 MFU | 126111 tok/s step 671/19560 | loss 4.977240 (-1.27z)| norm 0.6846 (-0.92z)| lr 5.37e-04 | 4215.64 ms | 32.0% bf16 MFU | 126023 tok/s step 672/19560 | loss 4.964290 (-1.38z)| norm 0.6269 (-1.31z)| lr 5.38e-04 | 4162.35 ms | 32.4% bf16 MFU | 126020 tok/s step 673/19560 | loss 4.912988 (-1.87z)| norm 0.5776 (-1.63z)| lr 5.38e-04 | 4145.96 ms | 32.6% bf16 MFU | 126042 tok/s step 674/19560 | loss 4.956793 (-1.40z)| norm 0.6036 (-1.44z)| lr 5.39e-04 | 4228.58 ms | 31.9% bf16 MFU | 125939 tok/s step 675/19560 | loss 4.910854 (-1.82z)| norm 0.7486 (-0.46z)| lr 5.40e-04 | 4146.03 ms | 32.6% bf16 MFU | 125965 tok/s step 676/19560 | loss 4.979309 (-1.13z)| norm 0.8983 (+0.56z)| lr 5.41e-04 | 4213.45 ms | 32.0% bf16 MFU | 125889 tok/s step 677/19560 | loss 5.031005 (-0.60z)| norm 0.7300 (-0.57z)| lr 5.42e-04 | 4208.60 ms | 32.1% bf16 MFU | 125823 tok/s step 678/19560 | loss 4.963340 (-1.26z)| norm 0.7535 (-0.40z)| lr 5.42e-04 | 4599.40 ms | 29.4% bf16 MFU | 125231 tok/s step 679/19560 | loss 4.948413 (-1.39z)| norm 0.7841 (-0.17z)| lr 5.43e-04 | 4807.72 ms | 28.1% bf16 MFU | 124422 tok/s step 680/19560 | loss 4.970386 (-1.15z)| norm 0.7664 (-0.29z)| lr 5.44e-04 | 4424.37 ms | 30.5% bf16 MFU | 124126 tok/s step 681/19560 | loss 4.922281 (-1.64z)| norm 0.7777 (-0.19z)| lr 5.45e-04 | 4359.61 ms | 31.0% bf16 MFU | 123933 tok/s step 682/19560 | loss 4.969556 (-1.13z)| norm 0.9196 (+0.84z)| lr 5.46e-04 | 4240.69 ms | 31.8% bf16 MFU | 123918 tok/s step 683/19560 | loss 4.982138 (-0.99z)| norm 0.9361 (+0.98z)| lr 5.46e-04 | 4328.71 ms | 31.2% bf16 MFU | 123778 tok/s step 684/19560 | loss 4.938523 (-1.43z)| norm 0.7391 (-0.47z)| lr 5.47e-04 | 4231.99 ms | 31.9% bf16 MFU | 123783 tok/s step 685/19560 | loss 4.924671 (-1.56z)| norm 0.6747 (-0.94z)| lr 5.48e-04 | 4197.34 ms | 32.2% bf16 MFU | 123840 tok/s step 686/19560 | loss 4.935737 (-1.43z)| norm 0.6372 (-1.20z)| lr 5.49e-04 | 4184.53 ms | 32.3% bf16 MFU | 123912 tok/s step 687/19560 | loss 4.935828 (-1.42z)| norm 0.7367 (-0.46z)| lr 5.50e-04 | 4161.72 ms | 32.4% bf16 MFU | 124016 tok/s step 688/19560 | loss 4.917593 (-1.60z)| norm 0.8229 (+0.18z)| lr 5.50e-04 | 4162.62 ms | 32.4% bf16 MFU | 124112 tok/s step 689/19560 | loss 4.971920 (-1.00z)| norm 0.8775 (+0.58z)| lr 5.51e-04 | 4242.26 ms | 31.8% bf16 MFU | 124086 tok/s step 690/19560 | loss 5.008677 (-0.57z)| norm 0.8372 (+0.29z)| lr 5.52e-04 | 4160.65 ms | 32.5% bf16 MFU | 124182 tok/s step 691/19560 | loss 4.881976 (-2.00z)| norm 0.6820 (-0.85z)| lr 5.53e-04 | 4224.02 ms | 32.0% bf16 MFU | 124179 tok/s step 692/19560 | loss 4.978492 (-0.88z)| norm 0.6747 (-0.89z)| lr 5.54e-04 | 4209.14 ms | 32.1% bf16 MFU | 124198 tok/s step 693/19560 | loss 4.889892 (-1.89z)| norm 0.8047 (+0.09z)| lr 5.54e-04 | 4342.21 ms | 31.1% bf16 MFU | 124025 tok/s step 694/19560 | loss 4.930634 (-1.39z)| norm 0.8505 (+0.43z)| lr 5.55e-04 | 4160.26 ms | 32.5% bf16 MFU | 124125 tok/s step 695/19560 | loss 4.936935 (-1.30z)| norm 0.8152 (+0.19z)| lr 5.56e-04 | 4258.73 ms | 31.7% bf16 MFU | 124075 tok/s step 696/19560 | loss 4.879137 (-1.96z)| norm 0.7691 (-0.15z)| lr 5.57e-04 | 4254.45 ms | 31.7% bf16 MFU | 124032 tok/s step 697/19560 | loss 4.924022 (-1.41z)| norm 0.7135 (-0.57z)| lr 5.58e-04 | 4168.73 ms | 32.4% bf16 MFU | 124119 tok/s step 698/19560 | loss 4.952232 (-1.06z)| norm 0.8263 (+0.30z)| lr 5.58e-04 | 4265.02 ms | 31.7% bf16 MFU | 124060 tok/s step 699/19560 | loss 4.908970 (-1.58z)| norm 0.9065 (+0.91z)| lr 5.59e-04 | 4162.79 ms | 32.4% bf16 MFU | 124154 tok/s step 700/19560 | loss 4.846983 (-2.29z)| norm 0.9971 (+1.58z)| lr 5.60e-04 | 4150.07 ms | 32.5% bf16 MFU | 124263 tok/s step 701/19560 | loss 4.974933 (-0.72z)| norm 0.9783 (+1.41z)| lr 5.61e-04 | 4183.94 ms | 32.3% bf16 MFU | 124315 tok/s step 702/19560 | loss 4.909004 (-1.50z)| norm 0.8179 (+0.20z)| lr 5.62e-04 | 4223.59 ms | 32.0% bf16 MFU | 124306 tok/s step 703/19560 | loss 4.899970 (-1.59z)| norm 0.7767 (-0.11z)| lr 5.62e-04 | 4210.41 ms | 32.1% bf16 MFU | 124317 tok/s step 704/19560 | loss 4.926623 (-1.25z)| norm 0.8141 (+0.16z)| lr 5.63e-04 | 4224.73 ms | 32.0% bf16 MFU | 124306 tok/s step 705/19560 | loss 4.886713 (-1.71z)| norm 0.7635 (-0.24z)| lr 5.64e-04 | 4162.92 ms | 32.4% bf16 MFU | 124388 tok/s step 706/19560 | loss 4.882518 (-1.74z)| norm 0.6995 (-0.75z)| lr 5.65e-04 | 4156.25 ms | 32.5% bf16 MFU | 124476 tok/s step 707/19560 | loss 4.934500 (-1.08z)| norm 0.7455 (-0.40z)| lr 5.66e-04 | 4160.09 ms | 32.5% bf16 MFU | 124553 tok/s step 708/19560 | loss 4.923108 (-1.20z)| norm 0.7151 (-0.63z)| lr 5.66e-04 | 4183.72 ms | 32.3% bf16 MFU | 124591 tok/s step 709/19560 | loss 4.848630 (-2.09z)| norm 0.6799 (-0.89z)| lr 5.67e-04 | 4228.79 ms | 31.9% bf16 MFU | 124561 tok/s step 710/19560 | loss 4.874998 (-1.73z)| norm 0.6915 (-0.79z)| lr 5.68e-04 | 4216.25 ms | 32.0% bf16 MFU | 124550 tok/s step 711/19560 | loss 4.995300 (-0.24z)| norm 0.6364 (-1.21z)| lr 5.69e-04 | 4152.83 ms | 32.5% bf16 MFU | 124635 tok/s step 712/19560 | loss 4.831094 (-2.25z)| norm 0.6824 (-0.85z)| lr 5.70e-04 | 4177.26 ms | 32.3% bf16 MFU | 124679 tok/s step 713/19560 | loss 4.866951 (-1.77z)| norm 0.6860 (-0.82z)| lr 5.70e-04 | 4156.08 ms | 32.5% bf16 MFU | 124752 tok/s step 714/19560 | loss 4.855264 (-1.89z)| norm 0.7072 (-0.64z)| lr 5.71e-04 | 4192.71 ms | 32.2% bf16 MFU | 124767 tok/s step 715/19560 | loss 4.888916 (-1.45z)| norm 0.6613 (-0.99z)| lr 5.72e-04 | 4225.71 ms | 32.0% bf16 MFU | 124732 tok/s step 716/19560 | loss 4.854022 (-1.86z)| norm 0.6286 (-1.23z)| lr 5.73e-04 | 4248.93 ms | 31.8% bf16 MFU | 124665 tok/s step 717/19560 | loss 4.821091 (-2.24z)| norm 0.7012 (-0.64z)| lr 5.74e-04 | 4165.78 ms | 32.4% bf16 MFU | 124725 tok/s step 718/19560 | loss 4.850543 (-1.84z)| norm 0.8901 (+0.89z)| lr 5.74e-04 | 4161.32 ms | 32.4% bf16 MFU | 124788 tok/s step 719/19560 | loss 4.813461 (-2.26z)| norm 0.9626 (+1.52z)| lr 5.75e-04 | 4159.41 ms | 32.5% bf16 MFU | 124851 tok/s step 720/19560 | loss 4.809918 (-2.26z)| norm 0.6733 (-0.86z)| lr 5.76e-04 | 4207.57 ms | 32.1% bf16 MFU | 124839 tok/s step 721/19560 | loss 4.826997 (-2.01z)| norm 0.7565 (-0.17z)| lr 5.77e-04 | 4156.83 ms | 32.5% bf16 MFU | 124903 tok/s step 722/19560 | loss 4.819886 (-2.07z)| norm 0.6796 (-0.80z)| lr 5.78e-04 | 4160.66 ms | 32.5% bf16 MFU | 124959 tok/s step 723/19560 | loss 4.870213 (-1.43z)| norm 0.6813 (-0.77z)| lr 5.78e-04 | 4163.49 ms | 32.4% bf16 MFU | 125007 tok/s step 724/19560 | loss 4.848959 (-1.66z)| norm 0.8958 (+0.99z)| lr 5.79e-04 | 4229.28 ms | 31.9% bf16 MFU | 124955 tok/s step 725/19560 | loss 4.899851 (-1.02z)| norm 0.8831 (+0.87z)| lr 5.80e-04 | 4166.67 ms | 32.4% bf16 MFU | 124999 tok/s step 726/19560 | loss 4.861985 (-1.47z)| norm 0.7480 (-0.23z)| lr 5.81e-04 | 4276.71 ms | 31.6% bf16 MFU | 124878 tok/s step 727/19560 | loss 4.799066 (-2.20z)| norm 0.7195 (-0.47z)| lr 5.82e-04 | 4241.53 ms | 31.8% bf16 MFU | 124815 tok/s step 728/19560 | loss 4.878001 (-1.21z)| norm 0.9692 (+1.55z)| lr 5.82e-04 | 4175.25 ms | 32.3% bf16 MFU | 124853 tok/s step 729/19560 | loss 4.875249 (-1.23z)| norm 0.9332 (+1.24z)| lr 5.83e-04 | 4151.18 ms | 32.5% bf16 MFU | 124925 tok/s step 730/19560 | loss 4.803393 (-2.08z)| norm 0.7982 (+0.14z)| lr 5.84e-04 | 4165.48 ms | 32.4% bf16 MFU | 124972 tok/s step 731/19560 | loss 4.873191 (-1.20z)| norm 0.7509 (-0.25z)| lr 5.85e-04 | 4260.83 ms | 31.7% bf16 MFU | 124876 tok/s step 732/19560 | loss 4.842121 (-1.55z)| norm 0.8342 (+0.42z)| lr 5.86e-04 | 4194.60 ms | 32.2% bf16 MFU | 124882 tok/s step 733/19560 | loss 4.837381 (-1.58z)| norm 0.8446 (+0.50z)| lr 5.86e-04 | 4182.91 ms | 32.3% bf16 MFU | 124904 tok/s step 734/19560 | loss 4.793472 (-2.08z)| norm 0.8375 (+0.43z)| lr 5.87e-04 | 4158.75 ms | 32.5% bf16 MFU | 124963 tok/s step 735/19560 | loss 4.803083 (-1.92z)| norm 0.6033 (-1.48z)| lr 5.88e-04 | 4182.25 ms | 32.3% bf16 MFU | 124983 tok/s step 736/19560 | loss 4.805135 (-1.85z)| norm 0.5300 (-2.04z)| lr 5.89e-04 | 4206.58 ms | 32.1% bf16 MFU | 124965 tok/s step 737/19560 | loss 4.807300 (-1.80z)| norm 0.5124 (-2.12z)| lr 5.90e-04 | 4223.70 ms | 32.0% bf16 MFU | 124923 tok/s step 738/19560 | loss 4.817965 (-1.64z)| norm 0.5590 (-1.72z)| lr 5.90e-04 | 4162.28 ms | 32.4% bf16 MFU | 124975 tok/s step 739/19560 | loss 4.730377 (-2.59z)| norm 0.5286 (-1.92z)| lr 5.91e-04 | 4149.51 ms | 32.5% bf16 MFU | 125044 tok/s step 740/19560 | loss 4.818728 (-1.54z)| norm 0.7758 (+0.02z)| lr 5.92e-04 | 4175.32 ms | 32.3% bf16 MFU | 125070 tok/s step 741/19560 | loss 4.833300 (-1.35z)| norm 1.4873 (+5.10z)| lr 5.93e-04 | 4161.51 ms | 32.4% bf16 MFU | 125116 tok/s step 742/19560 | loss 4.849831 (-1.14z)| norm 0.9309 (+1.12z)| lr 5.94e-04 | 4162.14 ms | 32.4% bf16 MFU | 125159 tok/s step 743/19560 | loss 4.909370 (-0.44z)| norm 1.1242 (+2.45z)| lr 5.94e-04 | 4168.82 ms | 32.4% bf16 MFU | 125189 tok/s step 744/19560 | loss 4.772235 (-2.02z)| norm 0.9714 (+1.35z)| lr 5.95e-04 | 4195.36 ms | 32.2% bf16 MFU | 125178 tok/s step 745/19560 | loss 4.779819 (-1.89z)| norm 0.9392 (+1.11z)| lr 5.96e-04 | 4177.23 ms | 32.3% bf16 MFU | 125194 tok/s step 746/19560 | loss 4.809926 (-1.52z)| norm 0.8197 (+0.28z)| lr 5.97e-04 | 4160.36 ms | 32.5% bf16 MFU | 125236 tok/s step 747/19560 | loss 4.808164 (-1.51z)| norm 0.9616 (+1.25z)| lr 5.98e-04 | 4158.17 ms | 32.5% bf16 MFU | 125278 tok/s step 748/19560 | loss 4.809600 (-1.48z)| norm 0.6822 (-0.69z)| lr 5.98e-04 | 4163.75 ms | 32.4% bf16 MFU | 125310 tok/s step 749/19560 | loss 4.777506 (-1.82z)| norm 0.6506 (-0.92z)| lr 5.99e-04 | 4171.99 ms | 32.4% bf16 MFU | 125328 tok/s step 750/19560 | loss 4.827761 (-1.21z)| norm 0.5973 (-1.30z)| lr 6.00e-04 | 4209.69 ms | 32.1% bf16 MFU | 125289 tok/s val loss 4.774537 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2503/10042 = 0.249253 step 751/19560 | loss 4.778469 (-1.75z)| norm 0.6124 (-1.20z)| lr 6.01e-04 | 4322.36 ms | 31.2% bf16 MFU | 125089 tok/s step 752/19560 | loss 4.821561 (-1.24z)| norm 0.5676 (-1.51z)| lr 6.02e-04 | 4214.49 ms | 32.0% bf16 MFU | 125055 tok/s step 753/19560 | loss 4.785932 (-1.62z)| norm 0.5658 (-1.50z)| lr 6.02e-04 | 4160.31 ms | 32.5% bf16 MFU | 125103 tok/s step 754/19560 | loss 4.766896 (-1.80z)| norm 0.5488 (-1.59z)| lr 6.03e-04 | 4165.49 ms | 32.4% bf16 MFU | 125141 tok/s step 755/19560 | loss 4.784978 (-1.57z)| norm 0.7323 (-0.33z)| lr 6.04e-04 | 4178.38 ms | 32.3% bf16 MFU | 125158 tok/s step 756/19560 | loss 4.785163 (-1.54z)| norm 0.9558 (+1.21z)| lr 6.05e-04 | 4164.70 ms | 32.4% bf16 MFU | 125195 tok/s step 757/19560 | loss 4.762406 (-1.78z)| norm 0.8605 (+0.61z)| lr 6.06e-04 | 4253.16 ms | 31.7% bf16 MFU | 125098 tok/s step 758/19560 | loss 4.764597 (-1.72z)| norm 0.8342 (+0.44z)| lr 6.06e-04 | 4176.09 ms | 32.3% bf16 MFU | 125121 tok/s step 759/19560 | loss 4.774536 (-1.59z)| norm 0.8180 (+0.32z)| lr 6.07e-04 | 4178.64 ms | 32.3% bf16 MFU | 125138 tok/s step 760/19560 | loss 4.746428 (-1.87z)| norm 0.8234 (+0.36z)| lr 6.08e-04 | 4159.96 ms | 32.5% bf16 MFU | 125183 tok/s step 761/19560 | loss 4.742938 (-1.87z)| norm 0.8721 (+0.71z)| lr 6.09e-04 | 4171.92 ms | 32.4% bf16 MFU | 125207 tok/s step 762/19560 | loss 4.768251 (-1.57z)| norm 0.7053 (-0.51z)| lr 6.10e-04 | 4220.51 ms | 32.0% bf16 MFU | 125158 tok/s step 763/19560 | loss 4.815065 (-1.02z)| norm 0.6576 (-0.85z)| lr 6.10e-04 | 4171.33 ms | 32.4% bf16 MFU | 125184 tok/s step 764/19560 | loss 4.762228 (-1.60z)| norm 0.7633 (-0.08z)| lr 6.11e-04 | 4288.22 ms | 31.5% bf16 MFU | 125038 tok/s step 765/19560 | loss 4.740409 (-1.82z)| norm 0.8187 (+0.32z)| lr 6.12e-04 | 4175.92 ms | 32.3% bf16 MFU | 125064 tok/s step 766/19560 | loss 4.748540 (-1.71z)| norm 0.7971 (+0.15z)| lr 6.13e-04 | 4288.39 ms | 31.5% bf16 MFU | 124924 tok/s step 767/19560 | loss 4.757026 (-1.59z)| norm 0.6153 (-1.19z)| lr 6.14e-04 | 4161.25 ms | 32.4% bf16 MFU | 124977 tok/s step 768/19560 | loss 4.773850 (-1.38z)| norm 0.5394 (-1.71z)| lr 6.14e-04 | 4219.56 ms | 32.0% bf16 MFU | 124941 tok/s step 769/19560 | loss 4.711832 (-2.05z)| norm 0.5156 (-1.84z)| lr 6.15e-04 | 4261.05 ms | 31.7% bf16 MFU | 124846 tok/s step 770/19560 | loss 4.688118 (-2.27z)| norm 0.4575 (-2.20z)| lr 6.16e-04 | 4190.12 ms | 32.2% bf16 MFU | 124860 tok/s step 771/19560 | loss 4.639513 (-2.76z)| norm 0.4853 (-1.96z)| lr 6.17e-04 | 4166.76 ms | 32.4% bf16 MFU | 124908 tok/s step 772/19560 | loss 4.674007 (-2.32z)| norm 0.4752 (-1.98z)| lr 6.18e-04 | 4314.65 ms | 31.3% bf16 MFU | 124738 tok/s step 773/19560 | loss 4.678781 (-2.21z)| norm 0.5753 (-1.27z)| lr 6.18e-04 | 4177.69 ms | 32.3% bf16 MFU | 124776 tok/s step 774/19560 | loss 4.727099 (-1.65z)| norm 0.6179 (-0.98z)| lr 6.19e-04 | 4158.55 ms | 32.5% bf16 MFU | 124841 tok/s step 775/19560 | loss 4.681107 (-2.11z)| norm 0.6871 (-0.50z)| lr 6.20e-04 | 4171.05 ms | 32.4% bf16 MFU | 124884 tok/s step 776/19560 | loss 4.684444 (-2.03z)| norm 0.7819 (+0.15z)| lr 6.21e-04 | 4163.84 ms | 32.4% bf16 MFU | 124936 tok/s step 777/19560 | loss 4.639415 (-2.45z)| norm 0.7326 (-0.18z)| lr 6.22e-04 | 4152.18 ms | 32.5% bf16 MFU | 125002 tok/s step 778/19560 | loss 4.670554 (-2.06z)| norm 0.7837 (+0.18z)| lr 6.22e-04 | 4165.41 ms | 32.4% bf16 MFU | 125045 tok/s step 779/19560 | loss 4.674607 (-1.98z)| norm 0.7529 (-0.02z)| lr 6.23e-04 | 4163.03 ms | 32.4% bf16 MFU | 125090 tok/s step 780/19560 | loss 4.632906 (-2.35z)| norm 0.7614 (+0.04z)| lr 6.24e-04 | 4288.05 ms | 31.5% bf16 MFU | 124949 tok/s step 781/19560 | loss 4.707112 (-1.56z)| norm 0.6915 (-0.45z)| lr 6.25e-04 | 4175.97 ms | 32.3% bf16 MFU | 124979 tok/s step 782/19560 | loss 4.669050 (-1.91z)| norm 0.6442 (-0.78z)| lr 6.26e-04 | 4160.66 ms | 32.5% bf16 MFU | 125031 tok/s step 783/19560 | loss 4.691822 (-1.65z)| norm 0.6577 (-0.69z)| lr 6.26e-04 | 4164.24 ms | 32.4% bf16 MFU | 125074 tok/s step 784/19560 | loss 4.660362 (-1.93z)| norm 0.7092 (-0.34z)| lr 6.27e-04 | 4160.92 ms | 32.4% bf16 MFU | 125121 tok/s step 785/19560 | loss 4.675730 (-1.74z)| norm 0.8226 (+0.45z)| lr 6.28e-04 | 4201.28 ms | 32.1% bf16 MFU | 125104 tok/s step 786/19560 | loss 4.708588 (-1.39z)| norm 0.7721 (+0.10z)| lr 6.29e-04 | 4163.93 ms | 32.4% bf16 MFU | 125145 tok/s step 787/19560 | loss 4.687268 (-1.58z)| norm 0.7252 (-0.22z)| lr 6.30e-04 | 4181.70 ms | 32.3% bf16 MFU | 125156 tok/s step 788/19560 | loss 4.708984 (-1.34z)| norm 0.7474 (-0.05z)| lr 6.30e-04 | 4173.20 ms | 32.4% bf16 MFU | 125180 tok/s step 789/19560 | loss 4.736519 (-1.05z)| norm 0.8329 (+0.55z)| lr 6.31e-04 | 4187.23 ms | 32.2% bf16 MFU | 125182 tok/s step 790/19560 | loss 4.694128 (-1.45z)| norm 1.0393 (+1.97z)| lr 6.32e-04 | 4190.68 ms | 32.2% bf16 MFU | 125178 tok/s step 791/19560 | loss 4.666558 (-1.69z)| norm 0.8190 (+0.44z)| lr 6.33e-04 | 4165.29 ms | 32.4% bf16 MFU | 125212 tok/s step 792/19560 | loss 4.697896 (-1.36z)| norm 0.6795 (-0.52z)| lr 6.34e-04 | 4170.68 ms | 32.4% bf16 MFU | 125237 tok/s step 793/19560 | loss 4.700431 (-1.31z)| norm 0.6240 (-0.90z)| lr 6.34e-04 | 4174.32 ms | 32.3% bf16 MFU | 125255 tok/s step 794/19560 | loss 4.679413 (-1.51z)| norm 0.6856 (-0.46z)| lr 6.35e-04 | 4159.37 ms | 32.5% bf16 MFU | 125295 tok/s step 795/19560 | loss 4.686420 (-1.41z)| norm 0.7203 (-0.21z)| lr 6.36e-04 | 4290.56 ms | 31.5% bf16 MFU | 125140 tok/s step 796/19560 | loss 4.690137 (-1.35z)| norm 0.6424 (-0.76z)| lr 6.37e-04 | 4166.92 ms | 32.4% bf16 MFU | 125174 tok/s step 797/19560 | loss 4.668926 (-1.54z)| norm 0.6153 (-0.94z)| lr 6.38e-04 | 4305.38 ms | 31.4% bf16 MFU | 125004 tok/s step 798/19560 | loss 4.685796 (-1.35z)| norm 0.5414 (-1.44z)| lr 6.38e-04 | 4219.17 ms | 32.0% bf16 MFU | 124967 tok/s step 799/19560 | loss 4.655886 (-1.62z)| norm 0.5349 (-1.47z)| lr 6.39e-04 | 4171.16 ms | 32.4% bf16 MFU | 125003 tok/s step 800/19560 | loss 4.610274 (-2.03z)| norm 0.5033 (-1.66z)| lr 6.40e-04 | 4171.22 ms | 32.4% bf16 MFU | 125038 tok/s step 801/19560 | loss 4.638105 (-1.72z)| norm 0.5063 (-1.63z)| lr 6.41e-04 | 4146.83 ms | 32.6% bf16 MFU | 125108 tok/s step 802/19560 | loss 4.654684 (-1.53z)| norm 0.5071 (-1.61z)| lr 6.42e-04 | 4192.11 ms | 32.2% bf16 MFU | 125105 tok/s step 803/19560 | loss 4.637156 (-1.67z)| norm 0.6418 (-0.69z)| lr 6.42e-04 | 4181.15 ms | 32.3% bf16 MFU | 125120 tok/s step 804/19560 | loss 4.626716 (-1.74z)| norm 0.7380 (-0.03z)| lr 6.43e-04 | 4160.10 ms | 32.5% bf16 MFU | 125165 tok/s step 805/19560 | loss 4.649947 (-1.50z)| norm 0.6516 (-0.61z)| lr 6.44e-04 | 4174.39 ms | 32.3% bf16 MFU | 125187 tok/s step 806/19560 | loss 4.660238 (-1.38z)| norm 0.5978 (-0.97z)| lr 6.45e-04 | 4161.46 ms | 32.4% bf16 MFU | 125227 tok/s step 807/19560 | loss 4.592956 (-2.00z)| norm 0.5439 (-1.31z)| lr 6.46e-04 | 4177.53 ms | 32.3% bf16 MFU | 125241 tok/s step 808/19560 | loss 4.644852 (-1.47z)| norm 0.5343 (-1.35z)| lr 6.46e-04 | 4191.44 ms | 32.2% bf16 MFU | 125233 tok/s step 809/19560 | loss 4.619428 (-1.69z)| norm 0.5961 (-0.93z)| lr 6.47e-04 | 4170.96 ms | 32.4% bf16 MFU | 125256 tok/s step 810/19560 | loss 4.663431 (-1.25z)| norm 0.6051 (-0.85z)| lr 6.48e-04 | 4158.12 ms | 32.5% bf16 MFU | 125298 tok/s step 811/19560 | loss 4.614120 (-1.70z)| norm 0.6658 (-0.44z)| lr 6.49e-04 | 4207.29 ms | 32.1% bf16 MFU | 125263 tok/s step 812/19560 | loss 4.674767 (-1.09z)| norm 0.7131 (-0.12z)| lr 6.50e-04 | 4175.41 ms | 32.3% bf16 MFU | 125279 tok/s step 813/19560 | loss 4.582102 (-1.96z)| norm 0.6252 (-0.71z)| lr 6.50e-04 | 4171.80 ms | 32.4% bf16 MFU | 125298 tok/s step 814/19560 | loss 4.574955 (-1.99z)| norm 0.6241 (-0.71z)| lr 6.51e-04 | 4178.50 ms | 32.3% bf16 MFU | 125307 tok/s step 815/19560 | loss 4.637428 (-1.37z)| norm 0.7947 (+0.43z)| lr 6.52e-04 | 4197.28 ms | 32.2% bf16 MFU | 125287 tok/s step 816/19560 | loss 4.586499 (-1.82z)| norm 0.8071 (+0.51z)| lr 6.53e-04 | 4162.54 ms | 32.4% bf16 MFU | 125321 tok/s step 817/19560 | loss 4.576790 (-1.89z)| norm 0.7490 (+0.13z)| lr 6.54e-04 | 4167.04 ms | 32.4% bf16 MFU | 125345 tok/s step 818/19560 | loss 4.675198 (-0.93z)| norm 0.8427 (+0.76z)| lr 6.54e-04 | 4185.44 ms | 32.3% bf16 MFU | 125341 tok/s step 819/19560 | loss 4.632705 (-1.32z)| norm 0.7151 (-0.10z)| lr 6.55e-04 | 4160.21 ms | 32.5% bf16 MFU | 125376 tok/s step 820/19560 | loss 4.676918 (-0.88z)| norm 0.6077 (-0.82z)| lr 6.56e-04 | 4181.40 ms | 32.3% bf16 MFU | 125376 tok/s step 821/19560 | loss 4.562176 (-1.98z)| norm 0.6402 (-0.59z)| lr 6.57e-04 | 4154.90 ms | 32.5% bf16 MFU | 125417 tok/s step 822/19560 | loss 4.619846 (-1.39z)| norm 0.5568 (-1.13z)| lr 6.58e-04 | 4173.21 ms | 32.4% bf16 MFU | 125427 tok/s step 823/19560 | loss 4.597594 (-1.59z)| norm 0.6219 (-0.69z)| lr 6.58e-04 | 4174.30 ms | 32.3% bf16 MFU | 125436 tok/s step 824/19560 | loss 4.621654 (-1.33z)| norm 0.5829 (-0.93z)| lr 6.59e-04 | 4194.94 ms | 32.2% bf16 MFU | 125413 tok/s step 825/19560 | loss 4.729202 (-0.25z)| norm 0.5998 (-0.81z)| lr 6.60e-04 | 4165.70 ms | 32.4% bf16 MFU | 125435 tok/s step 826/19560 | loss 4.631137 (-1.22z)| norm 0.6529 (-0.45z)| lr 6.61e-04 | 4262.88 ms | 31.7% bf16 MFU | 125313 tok/s step 827/19560 | loss 4.601118 (-1.50z)| norm 0.7346 (+0.10z)| lr 6.62e-04 | 4167.33 ms | 32.4% bf16 MFU | 125338 tok/s step 828/19560 | loss 4.596626 (-1.52z)| norm 0.6134 (-0.70z)| lr 6.62e-04 | 4163.11 ms | 32.4% bf16 MFU | 125368 tok/s step 829/19560 | loss 4.579836 (-1.67z)| norm 0.6095 (-0.71z)| lr 6.63e-04 | 4155.80 ms | 32.5% bf16 MFU | 125407 tok/s step 830/19560 | loss 4.563219 (-1.81z)| norm 0.7208 (+0.05z)| lr 6.64e-04 | 4261.30 ms | 31.7% bf16 MFU | 125289 tok/s step 831/19560 | loss 4.570433 (-1.71z)| norm 0.6805 (-0.22z)| lr 6.65e-04 | 4177.66 ms | 32.3% bf16 MFU | 125299 tok/s step 832/19560 | loss 4.602357 (-1.37z)| norm 0.6671 (-0.30z)| lr 6.66e-04 | 4172.68 ms | 32.4% bf16 MFU | 125317 tok/s step 833/19560 | loss 4.631524 (-1.05z)| norm 0.6646 (-0.31z)| lr 6.66e-04 | 4161.33 ms | 32.4% bf16 MFU | 125350 tok/s step 834/19560 | loss 4.620246 (-1.15z)| norm 0.6549 (-0.38z)| lr 6.67e-04 | 4193.39 ms | 32.2% bf16 MFU | 125334 tok/s step 835/19560 | loss 4.518282 (-2.16z)| norm 0.5271 (-1.24z)| lr 6.68e-04 | 4196.08 ms | 32.2% bf16 MFU | 125315 tok/s step 836/19560 | loss 4.616009 (-1.15z)| norm 0.5105 (-1.33z)| lr 6.69e-04 | 4178.59 ms | 32.3% bf16 MFU | 125323 tok/s step 837/19560 | loss 4.681556 (-0.46z)| norm 0.4985 (-1.39z)| lr 6.70e-04 | 4155.98 ms | 32.5% bf16 MFU | 125364 tok/s step 838/19560 | loss 4.578411 (-1.51z)| norm 0.5627 (-0.95z)| lr 6.70e-04 | 4165.45 ms | 32.4% bf16 MFU | 125389 tok/s step 839/19560 | loss 4.558648 (-1.72z)| norm 0.6027 (-0.68z)| lr 6.71e-04 | 4188.25 ms | 32.2% bf16 MFU | 125379 tok/s step 840/19560 | loss 4.584711 (-1.42z)| norm 0.5320 (-1.14z)| lr 6.72e-04 | 4161.08 ms | 32.4% bf16 MFU | 125410 tok/s step 841/19560 | loss 4.609302 (-1.14z)| norm 0.5326 (-1.12z)| lr 6.73e-04 | 4260.91 ms | 31.7% bf16 MFU | 125292 tok/s step 842/19560 | loss 4.565956 (-1.58z)| norm 0.4895 (-1.38z)| lr 6.74e-04 | 4150.52 ms | 32.5% bf16 MFU | 125343 tok/s step 843/19560 | loss 4.569363 (-1.52z)| norm 0.5265 (-1.13z)| lr 6.74e-04 | 4168.38 ms | 32.4% bf16 MFU | 125365 tok/s step 844/19560 | loss 4.563829 (-1.55z)| norm 0.5652 (-0.87z)| lr 6.75e-04 | 4151.60 ms | 32.5% bf16 MFU | 125411 tok/s step 845/19560 | loss 4.549452 (-1.68z)| norm 0.6084 (-0.58z)| lr 6.76e-04 | 4169.72 ms | 32.4% bf16 MFU | 125427 tok/s step 846/19560 | loss 4.591236 (-1.21z)| norm 0.6573 (-0.25z)| lr 6.77e-04 | 4152.91 ms | 32.5% bf16 MFU | 125468 tok/s step 847/19560 | loss 4.558561 (-1.54z)| norm 0.7415 (+0.32z)| lr 6.78e-04 | 4265.29 ms | 31.7% bf16 MFU | 125341 tok/s step 848/19560 | loss 4.567895 (-1.41z)| norm 0.6833 (-0.07z)| lr 6.78e-04 | 4151.86 ms | 32.5% bf16 MFU | 125387 tok/s step 849/19560 | loss 4.558532 (-1.49z)| norm 0.6551 (-0.25z)| lr 6.79e-04 | 4198.88 ms | 32.2% bf16 MFU | 125361 tok/s step 850/19560 | loss 4.542081 (-1.63z)| norm 0.6598 (-0.22z)| lr 6.80e-04 | 4204.09 ms | 32.1% bf16 MFU | 125329 tok/s step 851/19560 | loss 4.518017 (-1.86z)| norm 0.6607 (-0.21z)| lr 6.81e-04 | 4219.19 ms | 32.0% bf16 MFU | 125275 tok/s step 852/19560 | loss 4.538621 (-1.61z)| norm 0.6417 (-0.33z)| lr 6.82e-04 | 4156.18 ms | 32.5% bf16 MFU | 125319 tok/s step 853/19560 | loss 4.619905 (-0.74z)| norm 0.6506 (-0.26z)| lr 6.82e-04 | 4199.47 ms | 32.2% bf16 MFU | 125295 tok/s step 854/19560 | loss 4.581282 (-1.15z)| norm 0.6791 (-0.06z)| lr 6.83e-04 | 4192.55 ms | 32.2% bf16 MFU | 125283 tok/s step 855/19560 | loss 4.595481 (-0.98z)| norm 0.7541 (+0.44z)| lr 6.84e-04 | 4166.67 ms | 32.4% bf16 MFU | 125310 tok/s step 856/19560 | loss 4.549773 (-1.46z)| norm 0.7634 (+0.52z)| lr 6.85e-04 | 4195.56 ms | 32.2% bf16 MFU | 125293 tok/s step 857/19560 | loss 4.534484 (-1.61z)| norm 0.5933 (-0.63z)| lr 6.86e-04 | 4160.99 ms | 32.4% bf16 MFU | 125328 tok/s step 858/19560 | loss 4.554863 (-1.36z)| norm 0.5570 (-0.87z)| lr 6.86e-04 | 4187.17 ms | 32.2% bf16 MFU | 125323 tok/s step 859/19560 | loss 4.545123 (-1.46z)| norm 0.5918 (-0.62z)| lr 6.87e-04 | 4163.86 ms | 32.4% bf16 MFU | 125352 tok/s step 860/19560 | loss 4.574536 (-1.11z)| norm 0.5506 (-0.89z)| lr 6.88e-04 | 4184.02 ms | 32.3% bf16 MFU | 125350 tok/s step 861/19560 | loss 4.465695 (-2.29z)| norm 0.5866 (-0.63z)| lr 6.89e-04 | 4177.31 ms | 32.3% bf16 MFU | 125358 tok/s step 862/19560 | loss 4.504282 (-1.82z)| norm 0.6552 (-0.14z)| lr 6.90e-04 | 4162.33 ms | 32.4% bf16 MFU | 125388 tok/s step 863/19560 | loss 4.572845 (-1.05z)| norm 0.7342 (+0.40z)| lr 6.90e-04 | 4172.29 ms | 32.4% bf16 MFU | 125402 tok/s step 864/19560 | loss 4.551353 (-1.27z)| norm 0.6345 (-0.30z)| lr 6.91e-04 | 4164.11 ms | 32.4% bf16 MFU | 125427 tok/s step 865/19560 | loss 4.561305 (-1.14z)| norm 0.5649 (-0.79z)| lr 6.92e-04 | 4159.28 ms | 32.5% bf16 MFU | 125458 tok/s step 866/19560 | loss 4.539996 (-1.36z)| norm 0.5552 (-0.86z)| lr 6.93e-04 | 4180.30 ms | 32.3% bf16 MFU | 125456 tok/s step 867/19560 | loss 4.535639 (-1.39z)| norm 0.5676 (-0.78z)| lr 6.94e-04 | 4166.07 ms | 32.4% bf16 MFU | 125476 tok/s step 868/19560 | loss 4.588253 (-0.78z)| norm 0.6328 (-0.31z)| lr 6.94e-04 | 4169.33 ms | 32.4% bf16 MFU | 125489 tok/s step 869/19560 | loss 4.536111 (-1.36z)| norm 0.6318 (-0.31z)| lr 6.95e-04 | 4827.67 ms | 28.0% bf16 MFU | 124645 tok/s step 870/19560 | loss 4.466614 (-2.13z)| norm 0.5343 (-1.10z)| lr 6.96e-04 | 4546.99 ms | 29.7% bf16 MFU | 124178 tok/s step 871/19560 | loss 4.542026 (-1.26z)| norm 0.4889 (-1.50z)| lr 6.97e-04 | 4594.44 ms | 29.4% bf16 MFU | 123675 tok/s step 872/19560 | loss 4.484527 (-1.91z)| norm 0.4538 (-1.80z)| lr 6.98e-04 | 4427.86 ms | 30.5% bf16 MFU | 123411 tok/s step 873/19560 | loss 4.537972 (-1.26z)| norm 0.4287 (-2.00z)| lr 6.98e-04 | 4369.84 ms | 30.9% bf16 MFU | 123240 tok/s step 874/19560 | loss 4.467872 (-2.05z)| norm 0.5016 (-1.34z)| lr 6.99e-04 | 4276.75 ms | 31.6% bf16 MFU | 123207 tok/s step 875/19560 | loss 4.580858 (-0.71z)| norm 0.6662 (+0.15z)| lr 7.00e-04 | 4210.03 ms | 32.1% bf16 MFU | 123273 tok/s step 876/19560 | loss 4.509547 (-1.54z)| norm 0.6365 (-0.12z)| lr 7.01e-04 | 4330.18 ms | 31.2% bf16 MFU | 123164 tok/s step 877/19560 | loss 4.525856 (-1.33z)| norm 0.5897 (-0.54z)| lr 7.02e-04 | 4240.39 ms | 31.8% bf16 MFU | 123188 tok/s step 878/19560 | loss 4.483490 (-1.82z)| norm 0.5613 (-0.80z)| lr 7.02e-04 | 4299.12 ms | 31.4% bf16 MFU | 123126 tok/s step 879/19560 | loss 4.520597 (-1.35z)| norm 0.5842 (-0.59z)| lr 7.03e-04 | 4197.78 ms | 32.2% bf16 MFU | 123214 tok/s step 880/19560 | loss 4.491661 (-1.69z)| norm 0.6109 (-0.35z)| lr 7.04e-04 | 4246.06 ms | 31.8% bf16 MFU | 123227 tok/s step 881/19560 | loss 4.497468 (-1.60z)| norm 0.7300 (+0.73z)| lr 7.05e-04 | 4185.59 ms | 32.3% bf16 MFU | 123329 tok/s step 882/19560 | loss 4.473304 (-1.87z)| norm 0.9290 (+2.47z)| lr 7.06e-04 | 4204.68 ms | 32.1% bf16 MFU | 123397 tok/s step 883/19560 | loss 4.545970 (-0.96z)| norm 0.7865 (+1.19z)| lr 7.06e-04 | 4173.11 ms | 32.4% bf16 MFU | 123509 tok/s step 884/19560 | loss 4.509681 (-1.40z)| norm 0.6637 (+0.12z)| lr 7.07e-04 | 4192.58 ms | 32.2% bf16 MFU | 123586 tok/s step 885/19560 | loss 4.524345 (-1.20z)| norm 0.5683 (-0.75z)| lr 7.08e-04 | 4183.49 ms | 32.3% bf16 MFU | 123673 tok/s step 886/19560 | loss 4.540132 (-0.98z)| norm 0.5737 (-0.69z)| lr 7.09e-04 | 4169.74 ms | 32.4% bf16 MFU | 123776 tok/s step 887/19560 | loss 4.557403 (-0.75z)| norm 0.7527 (+1.01z)| lr 7.10e-04 | 4160.43 ms | 32.5% bf16 MFU | 123888 tok/s step 888/19560 | loss 4.518888 (-1.24z)| norm 0.7877 (+1.35z)| lr 7.10e-04 | 4225.29 ms | 32.0% bf16 MFU | 123898 tok/s step 889/19560 | loss 4.551944 (-0.79z)| norm 0.7622 (+1.13z)| lr 7.11e-04 | 4167.46 ms | 32.4% bf16 MFU | 123993 tok/s step 890/19560 | loss 4.655784 (+0.60z)| norm 0.6215 (-0.22z)| lr 7.12e-04 | 4236.23 ms | 31.9% bf16 MFU | 123982 tok/s step 891/19560 | loss 4.502502 (-1.46z)| norm 0.5251 (-1.14z)| lr 7.13e-04 | 4176.98 ms | 32.3% bf16 MFU | 124059 tok/s step 892/19560 | loss 4.441628 (-2.25z)| norm 0.5032 (-1.32z)| lr 7.14e-04 | 4218.69 ms | 32.0% bf16 MFU | 124070 tok/s step 893/19560 | loss 4.534604 (-0.96z)| norm 0.5253 (-1.10z)| lr 7.14e-04 | 4173.08 ms | 32.4% bf16 MFU | 124148 tok/s step 894/19560 | loss 4.487700 (-1.60z)| norm 0.5368 (-0.97z)| lr 7.15e-04 | 4167.50 ms | 32.4% bf16 MFU | 124231 tok/s step 895/19560 | loss 4.488434 (-1.57z)| norm 0.4587 (-1.70z)| lr 7.16e-04 | 4179.13 ms | 32.3% bf16 MFU | 124292 tok/s step 896/19560 | loss 4.526225 (-1.03z)| norm 0.5371 (-0.95z)| lr 7.17e-04 | 4168.77 ms | 32.4% bf16 MFU | 124366 tok/s step 897/19560 | loss 4.451459 (-2.06z)| norm 0.5709 (-0.63z)| lr 7.18e-04 | 4170.33 ms | 32.4% bf16 MFU | 124433 tok/s step 898/19560 | loss 4.454444 (-1.97z)| norm 0.6051 (-0.31z)| lr 7.18e-04 | 4166.38 ms | 32.4% bf16 MFU | 124503 tok/s step 899/19560 | loss 4.554613 (-0.55z)| norm 0.7144 (+0.74z)| lr 7.19e-04 | 4162.14 ms | 32.4% bf16 MFU | 124577 tok/s step 900/19560 | loss 4.602521 (+0.13z)| norm 0.6610 (+0.20z)| lr 7.20e-04 | 4173.46 ms | 32.4% bf16 MFU | 124629 tok/s step 901/19560 | loss 4.450830 (-1.97z)| norm 0.6784 (+0.37z)| lr 7.21e-04 | 4197.16 ms | 32.2% bf16 MFU | 124643 tok/s step 902/19560 | loss 4.485016 (-1.48z)| norm 0.7195 (+0.76z)| lr 7.22e-04 | 4168.13 ms | 32.4% bf16 MFU | 124700 tok/s step 903/19560 | loss 4.490767 (-1.37z)| norm 0.7055 (+0.62z)| lr 7.22e-04 | 4178.12 ms | 32.3% bf16 MFU | 124739 tok/s step 904/19560 | loss 4.494521 (-1.30z)| norm 0.5903 (-0.50z)| lr 7.23e-04 | 4231.11 ms | 31.9% bf16 MFU | 124698 tok/s step 905/19560 | loss 4.456745 (-1.79z)| norm 0.6028 (-0.37z)| lr 7.24e-04 | 4171.92 ms | 32.4% bf16 MFU | 124747 tok/s step 906/19560 | loss 4.418434 (-2.27z)| norm 0.6199 (-0.19z)| lr 7.25e-04 | 4181.57 ms | 32.3% bf16 MFU | 124778 tok/s step 907/19560 | loss 4.461708 (-1.64z)| norm 0.4457 (-1.90z)| lr 7.26e-04 | 4167.08 ms | 32.4% bf16 MFU | 124830 tok/s step 908/19560 | loss 4.501718 (-1.08z)| norm 0.4132 (-2.17z)| lr 7.26e-04 | 4186.29 ms | 32.3% bf16 MFU | 124851 tok/s step 909/19560 | loss 4.472439 (-1.46z)| norm 0.3599 (-2.60z)| lr 7.27e-04 | 4173.94 ms | 32.3% bf16 MFU | 124889 tok/s step 910/19560 | loss 4.446550 (-1.78z)| norm 0.3637 (-2.48z)| lr 7.28e-04 | 4178.42 ms | 32.3% bf16 MFU | 124918 tok/s step 911/19560 | loss 4.486189 (-1.22z)| norm 0.4350 (-1.77z)| lr 7.29e-04 | 4157.78 ms | 32.5% bf16 MFU | 124977 tok/s step 912/19560 | loss 4.431569 (-1.92z)| norm 0.5353 (-0.83z)| lr 7.30e-04 | 4215.68 ms | 32.0% bf16 MFU | 124947 tok/s step 913/19560 | loss 4.511168 (-0.83z)| norm 0.6287 (+0.04z)| lr 7.30e-04 | 4198.79 ms | 32.2% bf16 MFU | 124943 tok/s step 914/19560 | loss 4.512395 (-0.80z)| norm 0.6881 (+0.61z)| lr 7.31e-04 | 4163.80 ms | 32.4% bf16 MFU | 124991 tok/s step 915/19560 | loss 4.487620 (-1.13z)| norm 0.6715 (+0.46z)| lr 7.32e-04 | 4180.09 ms | 32.3% bf16 MFU | 125013 tok/s step 916/19560 | loss 4.502373 (-0.91z)| norm 0.6321 (+0.10z)| lr 7.33e-04 | 4167.10 ms | 32.4% bf16 MFU | 125053 tok/s step 917/19560 | loss 4.491939 (-1.05z)| norm 0.6106 (-0.09z)| lr 7.34e-04 | 4173.28 ms | 32.4% bf16 MFU | 125082 tok/s step 918/19560 | loss 4.516711 (-0.69z)| norm 0.5114 (-1.08z)| lr 7.34e-04 | 4166.88 ms | 32.4% bf16 MFU | 125119 tok/s step 919/19560 | loss 4.432543 (-1.86z)| norm 0.5043 (-1.14z)| lr 7.35e-04 | 4170.94 ms | 32.4% bf16 MFU | 125148 tok/s step 920/19560 | loss 4.495738 (-0.94z)| norm 0.5526 (-0.63z)| lr 7.36e-04 | 4166.34 ms | 32.4% bf16 MFU | 125183 tok/s step 921/19560 | loss 4.438362 (-1.76z)| norm 0.5431 (-0.72z)| lr 7.37e-04 | 4177.32 ms | 32.3% bf16 MFU | 125199 tok/s step 922/19560 | loss 4.434662 (-1.78z)| norm 0.5463 (-0.67z)| lr 7.38e-04 | 4163.71 ms | 32.4% bf16 MFU | 125235 tok/s step 923/19560 | loss 4.437167 (-1.72z)| norm 0.5619 (-0.50z)| lr 7.38e-04 | 4167.46 ms | 32.4% bf16 MFU | 125263 tok/s step 924/19560 | loss 4.428723 (-1.82z)| norm 0.6097 (+0.00z)| lr 7.39e-04 | 4168.25 ms | 32.4% bf16 MFU | 125289 tok/s step 925/19560 | loss 4.452988 (-1.44z)| norm 0.6535 (+0.46z)| lr 7.40e-04 | 4178.60 ms | 32.3% bf16 MFU | 125298 tok/s step 926/19560 | loss 4.445318 (-1.54z)| norm 0.6335 (+0.24z)| lr 7.41e-04 | 4175.75 ms | 32.3% bf16 MFU | 125311 tok/s step 927/19560 | loss 4.446301 (-1.50z)| norm 0.5978 (-0.14z)| lr 7.42e-04 | 4169.80 ms | 32.4% bf16 MFU | 125332 tok/s step 928/19560 | loss 4.462555 (-1.24z)| norm 0.5808 (-0.33z)| lr 7.42e-04 | 4170.31 ms | 32.4% bf16 MFU | 125352 tok/s step 929/19560 | loss 4.444210 (-1.49z)| norm 0.5748 (-0.39z)| lr 7.43e-04 | 4165.65 ms | 32.4% bf16 MFU | 125377 tok/s step 930/19560 | loss 4.427887 (-1.70z)| norm 0.6004 (-0.13z)| lr 7.44e-04 | 4172.96 ms | 32.4% bf16 MFU | 125390 tok/s step 931/19560 | loss 4.454129 (-1.29z)| norm 0.5275 (-0.90z)| lr 7.45e-04 | 4164.92 ms | 32.4% bf16 MFU | 125415 tok/s step 932/19560 | loss 4.421069 (-1.75z)| norm 0.4510 (-1.68z)| lr 7.46e-04 | 4166.16 ms | 32.4% bf16 MFU | 125436 tok/s step 933/19560 | loss 4.462124 (-1.13z)| norm 0.4463 (-1.69z)| lr 7.46e-04 | 4184.20 ms | 32.3% bf16 MFU | 125429 tok/s step 934/19560 | loss 4.389355 (-2.16z)| norm 0.4339 (-1.78z)| lr 7.47e-04 | 4171.80 ms | 32.4% bf16 MFU | 125442 tok/s step 935/19560 | loss 4.421536 (-1.65z)| norm 0.4583 (-1.51z)| lr 7.48e-04 | 4187.08 ms | 32.2% bf16 MFU | 125430 tok/s step 936/19560 | loss 4.423028 (-1.60z)| norm 0.4966 (-1.12z)| lr 7.49e-04 | 4170.44 ms | 32.4% bf16 MFU | 125445 tok/s step 937/19560 | loss 4.437092 (-1.38z)| norm 0.5181 (-0.89z)| lr 7.50e-04 | 4168.11 ms | 32.4% bf16 MFU | 125462 tok/s step 938/19560 | loss 4.440048 (-1.32z)| norm 0.5530 (-0.53z)| lr 7.50e-04 | 4178.25 ms | 32.3% bf16 MFU | 125463 tok/s step 939/19560 | loss 4.429322 (-1.45z)| norm 0.5410 (-0.64z)| lr 7.51e-04 | 4179.16 ms | 32.3% bf16 MFU | 125462 tok/s step 940/19560 | loss 4.456115 (-1.05z)| norm 0.5969 (-0.06z)| lr 7.52e-04 | 4168.19 ms | 32.4% bf16 MFU | 125478 tok/s step 941/19560 | loss 4.442075 (-1.24z)| norm 0.7004 (+0.98z)| lr 7.53e-04 | 4171.44 ms | 32.4% bf16 MFU | 125489 tok/s step 942/19560 | loss 4.426468 (-1.44z)| norm 0.5952 (-0.08z)| lr 7.54e-04 | 4180.90 ms | 32.3% bf16 MFU | 125484 tok/s step 943/19560 | loss 4.388001 (-1.97z)| norm 0.6366 (+0.36z)| lr 7.54e-04 | 4181.91 ms | 32.3% bf16 MFU | 125478 tok/s step 944/19560 | loss 4.446301 (-1.10z)| norm 0.7223 (+1.26z)| lr 7.55e-04 | 4160.07 ms | 32.5% bf16 MFU | 125506 tok/s step 945/19560 | loss 4.437746 (-1.20z)| norm 0.6973 (+1.01z)| lr 7.56e-04 | 4199.03 ms | 32.2% bf16 MFU | 125474 tok/s step 946/19560 | loss 4.455983 (-0.93z)| norm 0.6975 (+1.05z)| lr 7.57e-04 | 4167.21 ms | 32.4% bf16 MFU | 125491 tok/s step 947/19560 | loss 4.438318 (-1.18z)| norm 0.5992 (+0.00z)| lr 7.58e-04 | 4168.47 ms | 32.4% bf16 MFU | 125505 tok/s step 948/19560 | loss 4.484137 (-0.48z)| norm 0.5955 (-0.04z)| lr 7.58e-04 | 4173.53 ms | 32.4% bf16 MFU | 125511 tok/s step 949/19560 | loss 4.416857 (-1.48z)| norm 0.5683 (-0.33z)| lr 7.59e-04 | 4161.67 ms | 32.4% bf16 MFU | 125534 tok/s step 950/19560 | loss 4.493985 (-0.30z)| norm 0.5855 (-0.14z)| lr 7.60e-04 | 4181.83 ms | 32.3% bf16 MFU | 125526 tok/s step 951/19560 | loss 4.320972 (-2.83z)| norm 0.5807 (-0.19z)| lr 7.61e-04 | 4173.20 ms | 32.4% bf16 MFU | 125531 tok/s step 952/19560 | loss 4.454673 (-0.83z)| norm 0.5499 (-0.52z)| lr 7.62e-04 | 4177.38 ms | 32.3% bf16 MFU | 125530 tok/s step 953/19560 | loss 4.383013 (-1.92z)| norm 0.5012 (-1.03z)| lr 7.62e-04 | 4186.64 ms | 32.2% bf16 MFU | 125515 tok/s step 954/19560 | loss 4.423600 (-1.28z)| norm 0.4989 (-1.04z)| lr 7.63e-04 | 4169.20 ms | 32.4% bf16 MFU | 125527 tok/s step 955/19560 | loss 4.405912 (-1.53z)| norm 0.4391 (-1.65z)| lr 7.64e-04 | 4157.28 ms | 32.5% bf16 MFU | 125556 tok/s step 956/19560 | loss 4.372148 (-2.01z)| norm 0.4541 (-1.47z)| lr 7.65e-04 | 4175.64 ms | 32.3% bf16 MFU | 125556 tok/s step 957/19560 | loss 4.426527 (-1.15z)| norm 0.4772 (-1.21z)| lr 7.66e-04 | 4188.40 ms | 32.2% bf16 MFU | 125537 tok/s step 958/19560 | loss 4.424277 (-1.17z)| norm 0.4688 (-1.28z)| lr 7.66e-04 | 4169.44 ms | 32.4% bf16 MFU | 125548 tok/s step 959/19560 | loss 4.420020 (-1.22z)| norm 0.4521 (-1.43z)| lr 7.67e-04 | 4262.01 ms | 31.7% bf16 MFU | 125421 tok/s step 960/19560 | loss 4.397873 (-1.53z)| norm 0.5171 (-0.73z)| lr 7.68e-04 | 4170.16 ms | 32.4% bf16 MFU | 125436 tok/s step 961/19560 | loss 4.411617 (-1.31z)| norm 0.5162 (-0.73z)| lr 7.69e-04 | 4220.70 ms | 32.0% bf16 MFU | 125375 tok/s step 962/19560 | loss 4.469806 (-0.39z)| norm 0.4499 (-1.40z)| lr 7.70e-04 | 4184.18 ms | 32.3% bf16 MFU | 125372 tok/s step 963/19560 | loss 4.418027 (-1.19z)| norm 0.4648 (-1.23z)| lr 7.70e-04 | 4169.09 ms | 32.4% bf16 MFU | 125391 tok/s step 964/19560 | loss 4.459504 (-0.53z)| norm 0.5178 (-0.68z)| lr 7.71e-04 | 4164.96 ms | 32.4% bf16 MFU | 125415 tok/s step 965/19560 | loss 4.369368 (-1.96z)| norm 0.4134 (-1.75z)| lr 7.72e-04 | 4166.91 ms | 32.4% bf16 MFU | 125436 tok/s step 966/19560 | loss 4.355897 (-2.12z)| norm 0.4068 (-1.78z)| lr 7.73e-04 | 4171.12 ms | 32.4% bf16 MFU | 125449 tok/s step 967/19560 | loss 4.374527 (-1.79z)| norm 0.4201 (-1.62z)| lr 7.74e-04 | 4175.74 ms | 32.3% bf16 MFU | 125454 tok/s step 968/19560 | loss 4.403804 (-1.31z)| norm 0.4180 (-1.61z)| lr 7.74e-04 | 4167.15 ms | 32.4% bf16 MFU | 125472 tok/s step 969/19560 | loss 4.393148 (-1.46z)| norm 0.4576 (-1.20z)| lr 7.75e-04 | 4167.26 ms | 32.4% bf16 MFU | 125489 tok/s step 970/19560 | loss 4.360379 (-1.94z)| norm 0.5591 (-0.20z)| lr 7.76e-04 | 4176.54 ms | 32.3% bf16 MFU | 125491 tok/s step 971/19560 | loss 4.384562 (-1.53z)| norm 0.5325 (-0.47z)| lr 7.77e-04 | 4169.36 ms | 32.4% bf16 MFU | 125504 tok/s step 972/19560 | loss 4.326016 (-2.39z)| norm 0.4686 (-1.09z)| lr 7.78e-04 | 4161.48 ms | 32.4% bf16 MFU | 125528 tok/s step 973/19560 | loss 4.453434 (-0.39z)| norm 0.5062 (-0.71z)| lr 7.78e-04 | 4173.15 ms | 32.4% bf16 MFU | 125533 tok/s step 974/19560 | loss 4.373767 (-1.62z)| norm 0.5586 (-0.18z)| lr 7.79e-04 | 4176.36 ms | 32.3% bf16 MFU | 125534 tok/s step 975/19560 | loss 4.413142 (-0.99z)| norm 0.5243 (-0.51z)| lr 7.80e-04 | 4170.10 ms | 32.4% bf16 MFU | 125543 tok/s step 976/19560 | loss 4.412557 (-0.98z)| norm 0.4952 (-0.79z)| lr 7.81e-04 | 4169.07 ms | 32.4% bf16 MFU | 125554 tok/s step 977/19560 | loss 4.395491 (-1.23z)| norm 0.4971 (-0.76z)| lr 7.82e-04 | 4169.81 ms | 32.4% bf16 MFU | 125563 tok/s step 978/19560 | loss 4.404534 (-1.07z)| norm 0.4656 (-1.06z)| lr 7.82e-04 | 4254.40 ms | 31.7% bf16 MFU | 125446 tok/s step 979/19560 | loss 4.344734 (-1.97z)| norm 0.4564 (-1.13z)| lr 7.83e-04 | 4172.78 ms | 32.4% bf16 MFU | 125456 tok/s step 980/19560 | loss 4.418473 (-0.80z)| norm 0.5570 (-0.12z)| lr 7.84e-04 | 4166.86 ms | 32.4% bf16 MFU | 125475 tok/s step 981/19560 | loss 4.432181 (-0.58z)| norm 0.5833 (+0.15z)| lr 7.85e-04 | 4171.54 ms | 32.4% bf16 MFU | 125485 tok/s step 982/19560 | loss 4.342342 (-1.98z)| norm 0.5729 (+0.06z)| lr 7.86e-04 | 4185.47 ms | 32.3% bf16 MFU | 125474 tok/s step 983/19560 | loss 4.406124 (-0.95z)| norm 0.5582 (-0.08z)| lr 7.86e-04 | 4165.16 ms | 32.4% bf16 MFU | 125494 tok/s step 984/19560 | loss 4.356274 (-1.73z)| norm 0.5458 (-0.19z)| lr 7.87e-04 | 4173.40 ms | 32.4% bf16 MFU | 125501 tok/s step 985/19560 | loss 4.336526 (-2.00z)| norm 0.6283 (+0.67z)| lr 7.88e-04 | 4249.47 ms | 31.8% bf16 MFU | 125394 tok/s step 986/19560 | loss 4.365319 (-1.51z)| norm 0.6556 (+0.94z)| lr 7.89e-04 | 4168.82 ms | 32.4% bf16 MFU | 125413 tok/s step 987/19560 | loss 4.472692 (+0.20z)| norm 0.6354 (+0.73z)| lr 7.90e-04 | 4163.69 ms | 32.4% bf16 MFU | 125438 tok/s step 988/19560 | loss 4.458416 (-0.01z)| norm 0.6718 (+1.09z)| lr 7.90e-04 | 4169.05 ms | 32.4% bf16 MFU | 125454 tok/s step 989/19560 | loss 4.367878 (-1.46z)| norm 0.5898 (+0.24z)| lr 7.91e-04 | 4171.96 ms | 32.4% bf16 MFU | 125465 tok/s step 990/19560 | loss 4.350567 (-1.70z)| norm 0.4339 (-1.35z)| lr 7.92e-04 | 4172.40 ms | 32.4% bf16 MFU | 125474 tok/s step 991/19560 | loss 4.385617 (-1.13z)| norm 0.3874 (-1.80z)| lr 7.93e-04 | 4167.59 ms | 32.4% bf16 MFU | 125491 tok/s step 992/19560 | loss 4.390435 (-1.03z)| norm 0.4224 (-1.41z)| lr 7.94e-04 | 4160.67 ms | 32.5% bf16 MFU | 125517 tok/s step 993/19560 | loss 4.398395 (-0.89z)| norm 0.5049 (-0.56z)| lr 7.94e-04 | 4176.10 ms | 32.3% bf16 MFU | 125518 tok/s step 994/19560 | loss 4.424597 (-0.45z)| norm 0.6429 (+0.84z)| lr 7.95e-04 | 4179.36 ms | 32.3% bf16 MFU | 125515 tok/s step 995/19560 | loss 4.369436 (-1.34z)| norm 0.5753 (+0.15z)| lr 7.96e-04 | 4169.45 ms | 32.4% bf16 MFU | 125526 tok/s step 996/19560 | loss 4.401885 (-0.80z)| norm 0.5696 (+0.10z)| lr 7.97e-04 | 4188.02 ms | 32.2% bf16 MFU | 125509 tok/s step 997/19560 | loss 4.450872 (+0.04z)| norm 0.8890 (+3.21z)| lr 7.98e-04 | 4165.94 ms | 32.4% bf16 MFU | 125526 tok/s step 998/19560 | loss 4.309929 (-2.28z)| norm 0.5099 (-0.51z)| lr 7.98e-04 | 4173.54 ms | 32.4% bf16 MFU | 125531 tok/s step 999/19560 | loss 4.451076 (+0.07z)| norm 0.4994 (-0.61z)| lr 7.99e-04 | 4185.85 ms | 32.3% bf16 MFU | 125517 tok/s step 1000/19560 | loss 4.406842 (-0.66z)| norm 0.5471 (-0.15z)| lr 8.00e-04 | 4176.11 ms | 32.3% bf16 MFU | 125519 tok/s val loss 4.396316 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2507/10042 = 0.249651 step 1001/19560 | loss 4.362445 (-1.38z)| norm 0.5726 (+0.09z)| lr 8.00e-04 | 4522.07 ms | 29.9% bf16 MFU | 125040 tok/s step 1002/19560 | loss 4.390395 (-0.90z)| norm 0.6646 (+0.98z)| lr 8.00e-04 | 4243.93 ms | 31.8% bf16 MFU | 124965 tok/s step 1003/19560 | loss 4.421407 (-0.37z)| norm 0.6006 (+0.36z)| lr 8.00e-04 | 4171.19 ms | 32.4% bf16 MFU | 125001 tok/s step 1004/19560 | loss 4.362018 (-1.36z)| norm 0.5602 (-0.04z)| lr 8.00e-04 | 4221.76 ms | 32.0% bf16 MFU | 124960 tok/s step 1005/19560 | loss 4.451331 (+0.17z)| norm 0.6991 (+1.33z)| lr 8.00e-04 | 4179.21 ms | 32.3% bf16 MFU | 124985 tok/s step 1006/19560 | loss 4.426068 (-0.25z)| norm 0.7104 (+1.41z)| lr 8.00e-04 | 4234.93 ms | 31.9% bf16 MFU | 124926 tok/s step 1007/19560 | loss 4.437082 (-0.05z)| norm 0.7179 (+1.47z)| lr 8.00e-04 | 4160.58 ms | 32.5% bf16 MFU | 124980 tok/s step 1008/19560 | loss 4.389723 (-0.86z)| norm 0.7435 (+1.69z)| lr 8.00e-04 | 4243.08 ms | 31.8% bf16 MFU | 124909 tok/s step 1009/19560 | loss 4.341100 (-1.67z)| norm 0.7200 (+1.46z)| lr 8.00e-04 | 4179.35 ms | 32.3% bf16 MFU | 124936 tok/s step 1010/19560 | loss 4.437389 (-0.01z)| norm 0.5660 (+0.01z)| lr 8.00e-04 | 4228.16 ms | 31.9% bf16 MFU | 124889 tok/s step 1011/19560 | loss 4.419374 (-0.30z)| norm 0.7134 (+1.52z)| lr 8.00e-04 | 4359.08 ms | 31.0% bf16 MFU | 124658 tok/s step 1012/19560 | loss 4.403487 (-0.57z)| norm 0.5391 (-0.25z)| lr 8.00e-04 | 4342.97 ms | 31.1% bf16 MFU | 124462 tok/s step 1013/19560 | loss 4.406980 (-0.50z)| norm 0.5885 (+0.25z)| lr 8.00e-04 | 4229.59 ms | 31.9% bf16 MFU | 124436 tok/s step 1014/19560 | loss 4.400863 (-0.59z)| norm 0.8731 (+3.04z)| lr 8.00e-04 | 4264.17 ms | 31.7% bf16 MFU | 124362 tok/s step 1015/19560 | loss 4.400024 (-0.60z)| norm 0.9527 (+3.65z)| lr 8.00e-04 | 4187.68 ms | 32.2% bf16 MFU | 124404 tok/s step 1016/19560 | loss 4.403224 (-0.53z)| norm 0.6771 (+1.06z)| lr 8.00e-04 | 4635.00 ms | 29.1% bf16 MFU | 123839 tok/s step 1017/19560 | loss 4.469620 (+0.72z)| norm 2.8967 (+10.08z)| lr 8.00e-04 | 4322.89 ms | 31.2% bf16 MFU | 123712 tok/s step 1018/19560 | loss 4.529792 (+1.97z)| norm 0.8955 (+1.34z)| lr 8.00e-04 | 4449.47 ms | 30.3% bf16 MFU | 123418 tok/s step 1019/19560 | loss 4.452545 (+0.45z)| norm 0.8918 (+1.30z)| lr 8.00e-04 | 4193.78 ms | 32.2% bf16 MFU | 123497 tok/s step 1020/19560 | loss 4.446712 (+0.34z)| norm 0.6607 (+0.31z)| lr 8.00e-04 | 4253.04 ms | 31.7% bf16 MFU | 123486 tok/s step 1021/19560 | loss 4.374471 (-1.10z)| norm 0.6382 (+0.20z)| lr 8.00e-04 | 4198.31 ms | 32.2% bf16 MFU | 123556 tok/s step 1022/19560 | loss 4.459855 (+0.64z)| norm 0.8339 (+1.03z)| lr 8.00e-04 | 4175.07 ms | 32.3% bf16 MFU | 123657 tok/s step 1023/19560 | loss 4.447304 (+0.39z)| norm 1.0802 (+2.03z)| lr 8.00e-04 | 4160.91 ms | 32.4% bf16 MFU | 123774 tok/s step 1024/19560 | loss 4.423904 (-0.07z)| norm 0.7332 (+0.56z)| lr 8.00e-04 | 4297.20 ms | 31.4% bf16 MFU | 123686 tok/s step 1025/19560 | loss 4.372311 (-1.12z)| norm 0.5597 (-0.17z)| lr 8.00e-04 | 4173.01 ms | 32.4% bf16 MFU | 123784 tok/s step 1026/19560 | loss 4.389621 (-0.76z)| norm 0.5577 (-0.17z)| lr 8.00e-04 | 4158.50 ms | 32.5% bf16 MFU | 123898 tok/s step 1027/19560 | loss 4.418212 (-0.15z)| norm 0.5804 (-0.07z)| lr 8.00e-04 | 4536.43 ms | 29.8% bf16 MFU | 123482 tok/s step 1028/19560 | loss 4.428072 (+0.10z)| norm 0.5622 (-0.15z)| lr 8.00e-04 | 4190.79 ms | 32.2% bf16 MFU | 123563 tok/s step 1029/19560 | loss 4.354392 (-1.53z)| norm 0.4364 (-0.67z)| lr 8.00e-04 | 5343.54 ms | 25.3% bf16 MFU | 122291 tok/s step 1030/19560 | loss 4.374397 (-1.07z)| norm 0.4225 (-0.72z)| lr 8.00e-04 | 4199.51 ms | 32.2% bf16 MFU | 122418 tok/s step 1031/19560 | loss 4.392414 (-0.65z)| norm 0.4472 (-0.60z)| lr 8.00e-04 | 4244.80 ms | 31.8% bf16 MFU | 122473 tok/s step 1032/19560 | loss 4.360818 (-1.35z)| norm 0.4290 (-0.67z)| lr 8.00e-04 | 4169.07 ms | 32.4% bf16 MFU | 122637 tok/s step 1033/19560 | loss 4.367455 (-1.18z)| norm 0.3997 (-0.79z)| lr 8.00e-04 | 4177.73 ms | 32.3% bf16 MFU | 122780 tok/s step 1034/19560 | loss 4.421120 (+0.03z)| norm 0.4160 (-0.71z)| lr 8.00e-04 | 4261.19 ms | 31.7% bf16 MFU | 122793 tok/s step 1035/19560 | loss 4.425641 (+0.14z)| norm 0.4246 (-0.68z)| lr 8.00e-04 | 4211.60 ms | 32.1% bf16 MFU | 122878 tok/s step 1036/19560 | loss 4.402012 (-0.38z)| norm 0.4021 (-0.77z)| lr 8.00e-04 | 4177.08 ms | 32.3% bf16 MFU | 123010 tok/s step 1037/19560 | loss 4.326835 (-2.06z)| norm 0.3762 (-0.88z)| lr 8.00e-04 | 4314.09 ms | 31.3% bf16 MFU | 122936 tok/s step 1038/19560 | loss 4.363061 (-1.22z)| norm 0.3393 (-1.03z)| lr 8.00e-04 | 4170.07 ms | 32.4% bf16 MFU | 123075 tok/s step 1039/19560 | loss 4.411586 (-0.11z)| norm 0.3705 (-0.90z)| lr 8.00e-04 | 4246.07 ms | 31.8% bf16 MFU | 123095 tok/s step 1040/19560 | loss 4.389899 (-0.59z)| norm 0.4055 (-0.75z)| lr 8.00e-04 | 4166.27 ms | 32.4% bf16 MFU | 123233 tok/s step 1041/19560 | loss 4.343446 (-1.64z)| norm 0.3761 (-0.86z)| lr 8.00e-04 | 4177.83 ms | 32.3% bf16 MFU | 123346 tok/s step 1042/19560 | loss 4.295437 (-2.68z)| norm 0.4590 (-0.51z)| lr 8.00e-04 | 4478.12 ms | 30.2% bf16 MFU | 123032 tok/s step 1043/19560 | loss 4.373633 (-0.88z)| norm 0.4876 (-0.38z)| lr 8.00e-04 | 4213.08 ms | 32.0% bf16 MFU | 123103 tok/s step 1044/19560 | loss 4.269982 (-3.16z)| norm 0.4117 (-0.69z)| lr 8.00e-04 | 4158.49 ms | 32.5% bf16 MFU | 123251 tok/s step 1045/19560 | loss 4.336246 (-1.64z)| norm 0.3986 (-0.73z)| lr 8.00e-04 | 4208.66 ms | 32.1% bf16 MFU | 123318 tok/s step 1046/19560 | loss 4.339036 (-1.57z)| norm 0.4004 (-0.72z)| lr 8.00e-04 | 4240.98 ms | 31.8% bf16 MFU | 123333 tok/s step 1047/19560 | loss 4.351702 (-1.26z)| norm 0.4139 (-0.66z)| lr 8.00e-04 | 4239.93 ms | 31.8% bf16 MFU | 123349 tok/s step 1048/19560 | loss 4.316717 (-2.02z)| norm 0.4115 (-0.67z)| lr 8.00e-04 | 4314.95 ms | 31.3% bf16 MFU | 123257 tok/s step 1049/19560 | loss 4.345711 (-1.34z)| norm 0.4030 (-0.69z)| lr 8.00e-04 | 4158.03 ms | 32.5% bf16 MFU | 123398 tok/s step 1050/19560 | loss 4.345003 (-1.33z)| norm 0.4523 (-0.49z)| lr 8.00e-04 | 4176.15 ms | 32.3% bf16 MFU | 123506 tok/s step 1051/19560 | loss 4.279112 (-2.71z)| norm 0.4756 (-0.39z)| lr 8.00e-04 | 4195.87 ms | 32.2% bf16 MFU | 123578 tok/s step 1052/19560 | loss 4.288475 (-2.43z)| norm 0.4285 (-0.58z)| lr 8.00e-04 | 4182.46 ms | 32.3% bf16 MFU | 123667 tok/s step 1053/19560 | loss 4.333733 (-1.43z)| norm 0.3414 (-0.92z)| lr 8.00e-04 | 4181.72 ms | 32.3% bf16 MFU | 123752 tok/s step 1054/19560 | loss 4.209426 (-3.82z)| norm 0.5070 (-0.24z)| lr 8.00e-04 | 4164.01 ms | 32.4% bf16 MFU | 123860 tok/s step 1055/19560 | loss 4.349304 (-0.99z)| norm 0.3187 (-1.00z)| lr 8.00e-04 | 4234.01 ms | 31.9% bf16 MFU | 123859 tok/s step 1056/19560 | loss 4.253550 (-2.81z)| norm 0.3629 (-0.81z)| lr 8.00e-04 | 4170.19 ms | 32.4% bf16 MFU | 123952 tok/s step 1057/19560 | loss 4.275777 (-2.31z)| norm 0.3890 (-0.70z)| lr 8.00e-04 | 4163.56 ms | 32.4% bf16 MFU | 124050 tok/s step 1058/19560 | loss 4.423371 (+0.54z)| norm 0.4386 (-0.49z)| lr 8.00e-04 | 4167.59 ms | 32.4% bf16 MFU | 124138 tok/s step 1059/19560 | loss 4.321957 (-1.39z)| norm 0.5100 (-0.20z)| lr 8.00e-04 | 4338.80 ms | 31.1% bf16 MFU | 123973 tok/s step 1060/19560 | loss 4.310558 (-1.58z)| norm 0.5745 (+0.06z)| lr 8.00e-04 | 5173.45 ms | 26.1% bf16 MFU | 122841 tok/s step 1061/19560 | loss 4.301366 (-1.72z)| norm 0.5665 (+0.02z)| lr 8.00e-04 | 4331.35 ms | 31.2% bf16 MFU | 122751 tok/s step 1062/19560 | loss 4.324615 (-1.27z)| norm 0.6478 (+0.34z)| lr 8.00e-04 | 4215.82 ms | 32.0% bf16 MFU | 122832 tok/s step 1063/19560 | loss 4.295536 (-1.78z)| norm 0.6377 (+0.30z)| lr 8.00e-04 | 4301.15 ms | 31.4% bf16 MFU | 122785 tok/s step 1064/19560 | loss 4.328828 (-1.14z)| norm 0.6429 (+0.31z)| lr 8.00e-04 | 4257.85 ms | 31.7% bf16 MFU | 122803 tok/s step 1065/19560 | loss 4.341374 (-0.89z)| norm 0.4864 (-0.32z)| lr 8.00e-04 | 4248.58 ms | 31.8% bf16 MFU | 122833 tok/s step 1066/19560 | loss 4.317457 (-1.31z)| norm 0.3803 (-0.74z)| lr 8.00e-04 | 4169.72 ms | 32.4% bf16 MFU | 122978 tok/s step 1067/19560 | loss 4.362181 (-0.47z)| norm 0.3995 (-0.66z)| lr 8.00e-04 | 4348.71 ms | 31.0% bf16 MFU | 122857 tok/s step 1068/19560 | loss 4.306365 (-1.48z)| norm 0.3405 (-0.89z)| lr 8.00e-04 | 4227.74 ms | 31.9% bf16 MFU | 122915 tok/s step 1069/19560 | loss 4.427187 (+0.76z)| norm 0.3377 (-0.88z)| lr 8.00e-04 | 4164.36 ms | 32.4% bf16 MFU | 123064 tok/s step 1070/19560 | loss 4.299241 (-1.59z)| norm 0.4295 (-0.51z)| lr 8.00e-04 | 4201.56 ms | 32.1% bf16 MFU | 123150 tok/s step 1071/19560 | loss 4.351629 (-0.62z)| norm 0.3964 (-0.63z)| lr 8.00e-04 | 4179.07 ms | 32.3% bf16 MFU | 123265 tok/s step 1072/19560 | loss 4.275220 (-1.97z)| norm 0.3688 (-0.73z)| lr 8.00e-04 | 4309.88 ms | 31.3% bf16 MFU | 123184 tok/s step 1073/19560 | loss 4.416855 (+0.61z)| norm 0.4509 (-0.40z)| lr 8.00e-04 | 4164.75 ms | 32.4% bf16 MFU | 123319 tok/s step 1074/19560 | loss 4.384098 (+0.02z)| norm 0.4571 (-0.37z)| lr 8.00e-04 | 4211.26 ms | 32.1% bf16 MFU | 123378 tok/s step 1075/19560 | loss 4.281689 (-1.82z)| norm 0.4685 (-0.32z)| lr 8.00e-04 | 4166.85 ms | 32.4% bf16 MFU | 123501 tok/s step 1076/19560 | loss 4.373288 (-0.14z)| norm 0.4098 (-0.54z)| lr 8.00e-04 | 4163.24 ms | 32.4% bf16 MFU | 123622 tok/s step 1077/19560 | loss 4.323480 (-1.04z)| norm 0.3815 (-0.65z)| lr 8.00e-04 | 4197.05 ms | 32.2% bf16 MFU | 123687 tok/s step 1078/19560 | loss 4.323171 (-1.04z)| norm 0.3984 (-0.58z)| lr 8.00e-04 | 4167.01 ms | 32.4% bf16 MFU | 123794 tok/s step 1079/19560 | loss 4.284929 (-1.73z)| norm 0.4099 (-0.53z)| lr 8.00e-04 | 4173.05 ms | 32.4% bf16 MFU | 123886 tok/s step 1080/19560 | loss 4.356582 (-0.40z)| norm 0.3867 (-0.61z)| lr 8.00e-04 | 4159.23 ms | 32.5% bf16 MFU | 123994 tok/s step 1081/19560 | loss 4.333667 (-0.81z)| norm 0.4098 (-0.52z)| lr 8.00e-04 | 4173.74 ms | 32.3% bf16 MFU | 124075 tok/s step 1082/19560 | loss 4.335463 (-0.77z)| norm 0.4486 (-0.36z)| lr 8.00e-04 | 4215.73 ms | 32.0% bf16 MFU | 124090 tok/s step 1083/19560 | loss 4.387332 (+0.20z)| norm 0.5562 (+0.06z)| lr 8.00e-04 | 4159.04 ms | 32.5% bf16 MFU | 124188 tok/s step 1084/19560 | loss 4.330088 (-0.86z)| norm 0.4194 (-0.48z)| lr 8.00e-04 | 4171.85 ms | 32.4% bf16 MFU | 124262 tok/s step 1085/19560 | loss 4.331426 (-0.82z)| norm 0.4084 (-0.52z)| lr 8.00e-04 | 4172.48 ms | 32.4% bf16 MFU | 124332 tok/s step 1086/19560 | loss 4.303910 (-1.31z)| norm 0.3988 (-0.56z)| lr 8.00e-04 | 4249.05 ms | 31.8% bf16 MFU | 124285 tok/s step 1087/19560 | loss 4.310440 (-1.17z)| norm 0.4088 (-0.51z)| lr 8.00e-04 | 4252.62 ms | 31.7% bf16 MFU | 124235 tok/s step 1088/19560 | loss 4.263587 (-1.98z)| norm 0.4462 (-0.36z)| lr 8.00e-04 | 4199.91 ms | 32.1% bf16 MFU | 124265 tok/s step 1089/19560 | loss 4.339001 (-0.60z)| norm 0.5121 (-0.10z)| lr 8.00e-04 | 4181.73 ms | 32.3% bf16 MFU | 124320 tok/s step 1090/19560 | loss 4.325466 (-0.84z)| norm 0.5799 (+0.16z)| lr 8.00e-04 | 4213.93 ms | 32.0% bf16 MFU | 124325 tok/s step 1091/19560 | loss 4.375282 (+0.08z)| norm 0.5655 (+0.10z)| lr 8.00e-04 | 4169.69 ms | 32.4% bf16 MFU | 124396 tok/s step 1092/19560 | loss 4.305171 (-1.19z)| norm 0.5712 (+0.12z)| lr 8.00e-04 | 4238.83 ms | 31.9% bf16 MFU | 124360 tok/s step 1093/19560 | loss 4.382021 (+0.23z)| norm 0.4390 (-0.40z)| lr 8.00e-04 | 4170.11 ms | 32.4% bf16 MFU | 124429 tok/s step 1094/19560 | loss 4.303027 (-1.22z)| norm 0.3829 (-0.62z)| lr 8.00e-04 | 4180.96 ms | 32.3% bf16 MFU | 124477 tok/s step 1095/19560 | loss 4.330169 (-0.71z)| norm 0.3836 (-0.62z)| lr 8.00e-04 | 4179.84 ms | 32.3% bf16 MFU | 124525 tok/s step 1096/19560 | loss 4.324187 (-0.81z)| norm 0.4339 (-0.42z)| lr 8.00e-04 | 4183.96 ms | 32.3% bf16 MFU | 124564 tok/s step 1097/19560 | loss 4.310148 (-1.05z)| norm 0.4405 (-0.39z)| lr 8.00e-04 | 4179.08 ms | 32.3% bf16 MFU | 124609 tok/s step 1098/19560 | loss 4.221309 (-2.59z)| norm 0.4196 (-0.47z)| lr 8.00e-04 | 4184.07 ms | 32.3% bf16 MFU | 124644 tok/s step 1099/19560 | loss 4.292018 (-1.30z)| norm 0.3807 (-0.62z)| lr 8.00e-04 | 4186.25 ms | 32.3% bf16 MFU | 124673 tok/s step 1100/19560 | loss 4.324409 (-0.73z)| norm 0.3784 (-0.63z)| lr 8.00e-04 | 4165.17 ms | 32.4% bf16 MFU | 124733 tok/s step 1101/19560 | loss 4.251965 (-1.97z)| norm 0.3837 (-0.60z)| lr 8.00e-04 | 4175.56 ms | 32.3% bf16 MFU | 124775 tok/s step 1102/19560 | loss 4.334247 (-0.52z)| norm 0.3678 (-0.66z)| lr 8.00e-04 | 4183.87 ms | 32.3% bf16 MFU | 124802 tok/s step 1103/19560 | loss 4.275638 (-1.52z)| norm 0.3578 (-0.69z)| lr 8.00e-04 | 4183.58 ms | 32.3% bf16 MFU | 124828 tok/s step 1104/19560 | loss 4.295272 (-1.16z)| norm 0.4144 (-0.46z)| lr 8.00e-04 | 4173.56 ms | 32.4% bf16 MFU | 124867 tok/s step 1105/19560 | loss 4.303271 (-1.01z)| norm 0.4386 (-0.37z)| lr 8.00e-04 | 4181.89 ms | 32.3% bf16 MFU | 124892 tok/s step 1106/19560 | loss 4.280529 (-1.38z)| norm 0.4344 (-0.38z)| lr 8.00e-04 | 4242.60 ms | 31.8% bf16 MFU | 124827 tok/s step 1107/19560 | loss 4.282955 (-1.32z)| norm 0.4581 (-0.29z)| lr 8.00e-04 | 4179.12 ms | 32.3% bf16 MFU | 124858 tok/s step 1108/19560 | loss 4.224988 (-2.25z)| norm 0.4065 (-0.49z)| lr 8.00e-04 | 4179.79 ms | 32.3% bf16 MFU | 124887 tok/s step 1109/19560 | loss 4.395872 (+0.64z)| norm 0.4254 (-0.41z)| lr 8.00e-04 | 4184.95 ms | 32.3% bf16 MFU | 124906 tok/s step 1110/19560 | loss 4.223157 (-2.23z)| norm 0.4379 (-0.35z)| lr 8.00e-04 | 4192.63 ms | 32.2% bf16 MFU | 124914 tok/s step 1111/19560 | loss 4.337910 (-0.31z)| norm 0.3995 (-0.50z)| lr 8.00e-04 | 4188.03 ms | 32.2% bf16 MFU | 124927 tok/s step 1112/19560 | loss 4.289674 (-1.10z)| norm 0.4434 (-0.32z)| lr 8.00e-04 | 4173.92 ms | 32.3% bf16 MFU | 124961 tok/s step 1113/19560 | loss 4.312870 (-0.71z)| norm 0.4185 (-0.42z)| lr 8.00e-04 | 4185.10 ms | 32.3% bf16 MFU | 124977 tok/s step 1114/19560 | loss 4.301633 (-0.89z)| norm 0.4042 (-0.46z)| lr 8.00e-04 | 4174.67 ms | 32.3% bf16 MFU | 125008 tok/s step 1115/19560 | loss 4.229674 (-2.04z)| norm 0.4217 (-0.39z)| lr 8.00e-04 | 4194.74 ms | 32.2% bf16 MFU | 125007 tok/s step 1116/19560 | loss 4.266252 (-1.42z)| norm 0.3818 (-0.54z)| lr 8.00e-04 | 4179.59 ms | 32.3% bf16 MFU | 125028 tok/s step 1117/19560 | loss 4.250833 (-1.64z)| norm 0.3575 (-0.62z)| lr 8.00e-04 | 4175.31 ms | 32.3% bf16 MFU | 125055 tok/s step 1118/19560 | loss 4.284619 (-1.07z)| norm 0.3508 (-0.65z)| lr 8.00e-04 | 4218.86 ms | 32.0% bf16 MFU | 125016 tok/s step 1119/19560 | loss 4.290088 (-0.97z)| norm 0.3652 (-0.59z)| lr 8.00e-04 | 4275.56 ms | 31.6% bf16 MFU | 124897 tok/s step 1120/19560 | loss 4.294566 (-0.88z)| norm 0.3408 (-0.68z)| lr 8.00e-04 | 4205.31 ms | 32.1% bf16 MFU | 124885 tok/s step 1121/19560 | loss 4.243443 (-1.68z)| norm 0.3706 (-0.56z)| lr 8.00e-04 | 4254.19 ms | 31.7% bf16 MFU | 124803 tok/s step 1122/19560 | loss 4.300099 (-0.75z)| norm 0.4316 (-0.32z)| lr 8.00e-04 | 4201.51 ms | 32.1% bf16 MFU | 124802 tok/s step 1123/19560 | loss 4.291986 (-0.87z)| norm 0.5285 (+0.06z)| lr 8.00e-04 | 4721.66 ms | 28.6% bf16 MFU | 124114 tok/s step 1124/19560 | loss 4.281430 (-1.03z)| norm 0.5370 (+0.10z)| lr 8.00e-04 | 4227.38 ms | 31.9% bf16 MFU | 124110 tok/s step 1125/19560 | loss 4.226421 (-1.88z)| norm 0.4533 (-0.22z)| lr 8.00e-04 | 4614.62 ms | 29.3% bf16 MFU | 123585 tok/s step 1126/19560 | loss 4.257241 (-1.37z)| norm 0.4089 (-0.39z)| lr 8.00e-04 | 4173.33 ms | 32.4% bf16 MFU | 123687 tok/s step 1127/19560 | loss 4.256367 (-1.37z)| norm 0.4376 (-0.27z)| lr 8.00e-04 | 4210.68 ms | 32.1% bf16 MFU | 123728 tok/s step 1128/19560 | loss 4.316529 (-0.39z)| norm 0.3912 (-0.45z)| lr 8.00e-04 | 4176.96 ms | 32.3% bf16 MFU | 123818 tok/s step 1129/19560 | loss 4.237487 (-1.63z)| norm 0.3407 (-0.64z)| lr 8.00e-04 | 4194.57 ms | 32.2% bf16 MFU | 123877 tok/s step 1130/19560 | loss 4.345499 (+0.10z)| norm 0.3425 (-0.62z)| lr 8.00e-04 | 4184.16 ms | 32.3% bf16 MFU | 123948 tok/s step 1131/19560 | loss 4.325583 (-0.21z)| norm 0.4498 (-0.20z)| lr 8.00e-04 | 4177.41 ms | 32.3% bf16 MFU | 124026 tok/s step 1132/19560 | loss 4.257616 (-1.28z)| norm 0.5684 (+0.27z)| lr 8.00e-04 | 4243.74 ms | 31.8% bf16 MFU | 124002 tok/s step 1133/19560 | loss 4.311854 (-0.40z)| norm 0.4742 (-0.10z)| lr 8.00e-04 | 4190.08 ms | 32.2% bf16 MFU | 124058 tok/s step 1134/19560 | loss 4.287608 (-0.78z)| norm 0.3696 (-0.50z)| lr 8.00e-04 | 4205.80 ms | 32.1% bf16 MFU | 124088 tok/s step 1135/19560 | loss 4.241671 (-1.51z)| norm 0.3407 (-0.60z)| lr 8.00e-04 | 4174.40 ms | 32.3% bf16 MFU | 124163 tok/s step 1136/19560 | loss 4.272146 (-1.00z)| norm 0.3380 (-0.60z)| lr 8.00e-04 | 4176.55 ms | 32.3% bf16 MFU | 124232 tok/s step 1137/19560 | loss 4.365473 (+0.52z)| norm 0.3623 (-0.50z)| lr 8.00e-04 | 4178.62 ms | 32.3% bf16 MFU | 124294 tok/s step 1138/19560 | loss 4.214641 (-1.90z)| norm 0.4004 (-0.34z)| lr 8.00e-04 | 4177.35 ms | 32.3% bf16 MFU | 124354 tok/s step 1139/19560 | loss 4.222409 (-1.74z)| norm 0.3572 (-0.50z)| lr 8.00e-04 | 4178.59 ms | 32.3% bf16 MFU | 124410 tok/s step 1140/19560 | loss 4.251235 (-1.26z)| norm 0.3310 (-0.60z)| lr 8.00e-04 | 4178.26 ms | 32.3% bf16 MFU | 124464 tok/s step 1141/19560 | loss 4.302524 (-0.42z)| norm 0.3420 (-0.55z)| lr 8.00e-04 | 4171.03 ms | 32.4% bf16 MFU | 124525 tok/s step 1142/19560 | loss 4.216874 (-1.77z)| norm 0.3577 (-0.47z)| lr 8.00e-04 | 4177.03 ms | 32.3% bf16 MFU | 124575 tok/s step 1143/19560 | loss 4.257027 (-1.11z)| norm 0.4108 (-0.25z)| lr 8.00e-04 | 4177.56 ms | 32.3% bf16 MFU | 124621 tok/s step 1144/19560 | loss 4.323726 (-0.02z)| norm 0.4676 (-0.01z)| lr 8.00e-04 | 4197.88 ms | 32.2% bf16 MFU | 124635 tok/s step 1145/19560 | loss 4.282283 (-0.68z)| norm 0.4823 (+0.27z)| lr 8.00e-04 | 4174.96 ms | 32.3% bf16 MFU | 124682 tok/s step 1146/19560 | loss 4.249920 (-1.24z)| norm 0.4856 (+0.34z)| lr 8.00e-04 | 4183.55 ms | 32.3% bf16 MFU | 124714 tok/s step 1147/19560 | loss 4.379261 (+1.03z)| norm 0.4863 (+0.40z)| lr 8.00e-04 | 4191.13 ms | 32.2% bf16 MFU | 124733 tok/s step 1148/19560 | loss 4.259780 (-1.06z)| norm 0.5246 (+0.80z)| lr 8.00e-04 | 4183.52 ms | 32.3% bf16 MFU | 124762 tok/s step 1149/19560 | loss 4.278924 (-0.71z)| norm 0.4631 (+0.21z)| lr 8.00e-04 | 4208.65 ms | 32.1% bf16 MFU | 124753 tok/s step 1150/19560 | loss 4.225072 (-1.67z)| norm 0.3650 (-0.78z)| lr 8.00e-04 | 4163.50 ms | 32.4% bf16 MFU | 124812 tok/s step 1151/19560 | loss 4.346380 (+0.57z)| norm 0.3589 (-0.98z)| lr 8.00e-04 | 4208.25 ms | 32.1% bf16 MFU | 124800 tok/s step 1152/19560 | loss 4.212983 (-1.88z)| norm 0.3596 (-0.99z)| lr 8.00e-04 | 4254.94 ms | 31.7% bf16 MFU | 124721 tok/s step 1153/19560 | loss 4.217590 (-1.76z)| norm 0.3362 (-1.30z)| lr 8.00e-04 | 4193.63 ms | 32.2% bf16 MFU | 124736 tok/s step 1154/19560 | loss 4.243108 (-1.27z)| norm 0.3139 (-1.60z)| lr 8.00e-04 | 4276.50 ms | 31.6% bf16 MFU | 124629 tok/s step 1155/19560 | loss 4.198038 (-2.07z)| norm 0.3082 (-1.66z)| lr 8.00e-04 | 4177.32 ms | 32.3% bf16 MFU | 124673 tok/s step 1156/19560 | loss 4.241160 (-1.26z)| norm 0.3260 (-1.39z)| lr 8.00e-04 | 4385.17 ms | 30.8% bf16 MFU | 124418 tok/s step 1157/19560 | loss 4.254609 (-1.00z)| norm 0.3303 (-1.31z)| lr 8.00e-04 | 4177.11 ms | 32.3% bf16 MFU | 124472 tok/s step 1158/19560 | loss 4.244774 (-1.16z)| norm 0.3752 (-0.66z)| lr 8.00e-04 | 4173.28 ms | 32.4% bf16 MFU | 124530 tok/s step 1159/19560 | loss 4.260534 (-0.85z)| norm 0.4072 (-0.19z)| lr 8.00e-04 | 4178.59 ms | 32.3% bf16 MFU | 124577 tok/s step 1160/19560 | loss 4.170393 (-2.47z)| norm 0.3791 (-0.59z)| lr 8.00e-04 | 4293.39 ms | 31.4% bf16 MFU | 124454 tok/s step 1161/19560 | loss 4.215888 (-1.60z)| norm 0.3774 (-0.61z)| lr 8.00e-04 | 4174.15 ms | 32.3% bf16 MFU | 124512 tok/s step 1162/19560 | loss 4.192233 (-2.01z)| norm 0.4113 (-0.13z)| lr 8.00e-04 | 4181.35 ms | 32.3% bf16 MFU | 124555 tok/s step 1163/19560 | loss 4.239342 (-1.13z)| norm 0.3760 (-0.62z)| lr 8.00e-04 | 4193.51 ms | 32.2% bf16 MFU | 124579 tok/s step 1164/19560 | loss 4.197912 (-1.88z)| norm 0.4132 (-0.09z)| lr 8.00e-04 | 4183.11 ms | 32.3% bf16 MFU | 124617 tok/s step 1165/19560 | loss 4.243619 (-1.01z)| norm 0.4099 (-0.15z)| lr 8.00e-04 | 4176.63 ms | 32.3% bf16 MFU | 124662 tok/s step 1166/19560 | loss 4.225112 (-1.33z)| norm 0.3839 (-0.52z)| lr 8.00e-04 | 4216.74 ms | 32.0% bf16 MFU | 124646 tok/s step 1167/19560 | loss 4.226384 (-1.30z)| norm 0.3419 (-1.12z)| lr 8.00e-04 | 4187.73 ms | 32.2% bf16 MFU | 124673 tok/s step 1168/19560 | loss 4.202873 (-1.71z)| norm 0.3431 (-1.09z)| lr 8.00e-04 | 4241.36 ms | 31.8% bf16 MFU | 124620 tok/s step 1169/19560 | loss 4.270224 (-0.44z)| norm 0.3764 (-0.62z)| lr 8.00e-04 | 4158.33 ms | 32.5% bf16 MFU | 124693 tok/s step 1170/19560 | loss 4.272802 (-0.38z)| norm 0.3838 (-0.50z)| lr 8.00e-04 | 4168.76 ms | 32.4% bf16 MFU | 124747 tok/s step 1171/19560 | loss 4.195296 (-1.81z)| norm 0.4010 (-0.25z)| lr 8.00e-04 | 4237.42 ms | 31.9% bf16 MFU | 124696 tok/s step 1172/19560 | loss 4.182362 (-2.01z)| norm 0.4171 (-0.02z)| lr 8.00e-04 | 4208.32 ms | 32.1% bf16 MFU | 124690 tok/s step 1173/19560 | loss 4.189590 (-1.84z)| norm 0.4594 (+0.58z)| lr 8.00e-04 | 4163.10 ms | 32.4% bf16 MFU | 124753 tok/s step 1174/19560 | loss 4.244050 (-0.83z)| norm 0.4472 (+0.40z)| lr 8.00e-04 | 4174.44 ms | 32.3% bf16 MFU | 124795 tok/s step 1175/19560 | loss 4.255573 (-0.60z)| norm 0.4234 (+0.06z)| lr 8.00e-04 | 4305.60 ms | 31.4% bf16 MFU | 124644 tok/s step 1176/19560 | loss 4.252986 (-0.64z)| norm 0.3979 (-0.30z)| lr 8.00e-04 | 4174.01 ms | 32.3% bf16 MFU | 124692 tok/s step 1177/19560 | loss 4.218166 (-1.26z)| norm 0.3817 (-0.53z)| lr 8.00e-04 | 4180.50 ms | 32.3% bf16 MFU | 124728 tok/s step 1178/19560 | loss 4.262199 (-0.44z)| norm 0.3973 (-0.31z)| lr 8.00e-04 | 4239.25 ms | 31.8% bf16 MFU | 124675 tok/s step 1179/19560 | loss 4.196988 (-1.62z)| norm 0.4036 (-0.21z)| lr 8.00e-04 | 4185.95 ms | 32.3% bf16 MFU | 124704 tok/s step 1180/19560 | loss 4.214650 (-1.28z)| norm 0.4173 (-0.01z)| lr 8.00e-04 | 4186.19 ms | 32.3% bf16 MFU | 124731 tok/s step 1181/19560 | loss 4.285134 (+0.01z)| norm 0.3567 (-0.88z)| lr 8.00e-04 | 4165.90 ms | 32.4% bf16 MFU | 124787 tok/s step 1182/19560 | loss 4.247867 (-0.68z)| norm 0.3527 (-0.92z)| lr 8.00e-04 | 4188.17 ms | 32.2% bf16 MFU | 124807 tok/s step 1183/19560 | loss 4.246960 (-0.68z)| norm 0.3934 (-0.35z)| lr 8.00e-04 | 4175.06 ms | 32.3% bf16 MFU | 124845 tok/s step 1184/19560 | loss 4.171141 (-2.03z)| norm 0.3816 (-0.52z)| lr 8.00e-04 | 4161.58 ms | 32.4% bf16 MFU | 124902 tok/s step 1185/19560 | loss 4.224857 (-1.05z)| norm 0.4120 (-0.09z)| lr 8.00e-04 | 4181.19 ms | 32.3% bf16 MFU | 124927 tok/s step 1186/19560 | loss 4.314653 (+0.60z)| norm 0.3803 (-0.54z)| lr 8.00e-04 | 4244.03 ms | 31.8% bf16 MFU | 124857 tok/s step 1187/19560 | loss 4.208823 (-1.33z)| norm 0.4154 (-0.02z)| lr 8.00e-04 | 4221.44 ms | 32.0% bf16 MFU | 124824 tok/s step 1188/19560 | loss 4.259151 (-0.40z)| norm 0.4105 (-0.07z)| lr 8.00e-04 | 4169.69 ms | 32.4% bf16 MFU | 124870 tok/s step 1189/19560 | loss 4.226488 (-0.99z)| norm 0.4096 (-0.07z)| lr 8.00e-04 | 4170.67 ms | 32.4% bf16 MFU | 124912 tok/s step 1190/19560 | loss 4.192670 (-1.57z)| norm 0.4378 (+0.41z)| lr 8.00e-04 | 4177.93 ms | 32.3% bf16 MFU | 124941 tok/s step 1191/19560 | loss 4.138741 (-2.47z)| norm 0.3643 (-0.78z)| lr 8.00e-04 | 4192.25 ms | 32.2% bf16 MFU | 124947 tok/s step 1192/19560 | loss 4.271512 (-0.11z)| norm 0.3355 (-1.30z)| lr 8.00e-04 | 4170.15 ms | 32.4% bf16 MFU | 124985 tok/s step 1193/19560 | loss 4.211035 (-1.17z)| norm 0.3720 (-0.63z)| lr 8.00e-04 | 4162.22 ms | 32.4% bf16 MFU | 125034 tok/s step 1194/19560 | loss 4.176602 (-1.74z)| norm 0.4188 (+0.21z)| lr 8.00e-04 | 4170.65 ms | 32.4% bf16 MFU | 125068 tok/s step 1195/19560 | loss 4.197371 (-1.36z)| norm 0.4117 (+0.08z)| lr 8.00e-04 | 4182.93 ms | 32.3% bf16 MFU | 125082 tok/s step 1196/19560 | loss 4.178467 (-1.66z)| norm 0.4085 (+0.01z)| lr 8.00e-04 | 4179.91 ms | 32.3% bf16 MFU | 125099 tok/s step 1197/19560 | loss 4.248907 (-0.41z)| norm 0.3896 (-0.34z)| lr 8.00e-04 | 4179.36 ms | 32.3% bf16 MFU | 125116 tok/s step 1198/19560 | loss 4.173872 (-1.73z)| norm 0.3333 (-1.35z)| lr 8.00e-04 | 4248.96 ms | 31.8% bf16 MFU | 125030 tok/s step 1199/19560 | loss 4.220258 (-0.89z)| norm 0.3096 (-1.75z)| lr 8.00e-04 | 4217.40 ms | 32.0% bf16 MFU | 124994 tok/s step 1200/19560 | loss 4.232314 (-0.66z)| norm 0.3036 (-1.83z)| lr 8.00e-04 | 4172.52 ms | 32.4% bf16 MFU | 125027 tok/s step 1201/19560 | loss 4.340474 (+1.31z)| norm 0.3084 (-1.70z)| lr 8.00e-04 | 4329.94 ms | 31.2% bf16 MFU | 124830 tok/s step 1202/19560 | loss 4.181269 (-1.58z)| norm 0.3183 (-1.50z)| lr 8.00e-04 | 4225.03 ms | 32.0% bf16 MFU | 124793 tok/s step 1203/19560 | loss 4.257155 (-0.18z)| norm 0.3487 (-0.96z)| lr 8.00e-04 | 4178.15 ms | 32.3% bf16 MFU | 124828 tok/s step 1204/19560 | loss 4.228341 (-0.70z)| norm 0.3595 (-0.76z)| lr 8.00e-04 | 4166.03 ms | 32.4% bf16 MFU | 124879 tok/s step 1205/19560 | loss 4.186160 (-1.46z)| norm 0.3971 (-0.10z)| lr 8.00e-04 | 4198.14 ms | 32.2% bf16 MFU | 124879 tok/s step 1206/19560 | loss 4.172120 (-1.69z)| norm 0.4079 (+0.08z)| lr 8.00e-04 | 4168.58 ms | 32.4% bf16 MFU | 124924 tok/s step 1207/19560 | loss 4.238551 (-0.46z)| norm 0.3830 (-0.35z)| lr 8.00e-04 | 4167.37 ms | 32.4% bf16 MFU | 124968 tok/s step 1208/19560 | loss 4.158473 (-1.90z)| norm 0.3866 (-0.29z)| lr 8.00e-04 | 4195.14 ms | 32.2% bf16 MFU | 124968 tok/s step 1209/19560 | loss 4.201203 (-1.10z)| norm 0.4111 (+0.14z)| lr 8.00e-04 | 4180.49 ms | 32.3% bf16 MFU | 124991 tok/s step 1210/19560 | loss 4.167830 (-1.68z)| norm 0.4603 (+1.00z)| lr 8.00e-04 | 4157.80 ms | 32.5% bf16 MFU | 125046 tok/s step 1211/19560 | loss 4.231963 (-0.49z)| norm 0.4790 (+1.37z)| lr 8.00e-04 | 4165.82 ms | 32.4% bf16 MFU | 125086 tok/s step 1212/19560 | loss 4.209494 (-0.90z)| norm 0.4258 (+0.42z)| lr 8.00e-04 | 4433.29 ms | 30.5% bf16 MFU | 124745 tok/s step 1213/19560 | loss 4.129832 (-2.34z)| norm 0.3731 (-0.52z)| lr 8.00e-04 | 4207.43 ms | 32.1% bf16 MFU | 124738 tok/s step 1214/19560 | loss 4.243899 (-0.21z)| norm 0.3451 (-1.00z)| lr 8.00e-04 | 4204.27 ms | 32.1% bf16 MFU | 124737 tok/s step 1215/19560 | loss 4.192153 (-1.16z)| norm 0.3155 (-1.50z)| lr 8.00e-04 | 4176.98 ms | 32.3% bf16 MFU | 124776 tok/s step 1216/19560 | loss 4.275758 (+0.40z)| norm 0.3314 (-1.20z)| lr 8.00e-04 | 4168.33 ms | 32.4% bf16 MFU | 124826 tok/s step 1217/19560 | loss 4.175341 (-1.45z)| norm 0.3466 (-0.93z)| lr 8.00e-04 | 4182.03 ms | 32.3% bf16 MFU | 124853 tok/s step 1218/19560 | loss 4.214921 (-0.70z)| norm 0.3436 (-0.98z)| lr 8.00e-04 | 4181.06 ms | 32.3% bf16 MFU | 124880 tok/s step 1219/19560 | loss 4.192554 (-1.11z)| norm 0.3161 (-1.50z)| lr 8.00e-04 | 4173.57 ms | 32.4% bf16 MFU | 124917 tok/s step 1220/19560 | loss 4.109816 (-2.59z)| norm 0.3423 (-1.01z)| lr 8.00e-04 | 4250.65 ms | 31.8% bf16 MFU | 124838 tok/s step 1221/19560 | loss 4.137898 (-2.05z)| norm 0.3393 (-1.05z)| lr 8.00e-04 | 4299.41 ms | 31.4% bf16 MFU | 124694 tok/s step 1222/19560 | loss 4.332202 (+1.57z)| norm 0.3570 (-0.70z)| lr 8.00e-04 | 4195.61 ms | 32.2% bf16 MFU | 124707 tok/s step 1223/19560 | loss 4.193335 (-1.00z)| norm 0.4178 (+0.50z)| lr 8.00e-04 | 4174.15 ms | 32.3% bf16 MFU | 124752 tok/s step 1224/19560 | loss 4.269424 (+0.44z)| norm 0.4708 (+1.53z)| lr 8.00e-04 | 4179.83 ms | 32.3% bf16 MFU | 124786 tok/s step 1225/19560 | loss 4.201569 (-0.83z)| norm 0.4154 (+0.45z)| lr 8.00e-04 | 4201.75 ms | 32.1% bf16 MFU | 124786 tok/s step 1226/19560 | loss 4.175070 (-1.31z)| norm 0.4279 (+0.69z)| lr 8.00e-04 | 4170.32 ms | 32.4% bf16 MFU | 124832 tok/s step 1227/19560 | loss 4.174118 (-1.31z)| norm 0.3665 (-0.51z)| lr 8.00e-04 | 4202.13 ms | 32.1% bf16 MFU | 124829 tok/s step 1228/19560 | loss 4.210599 (-0.61z)| norm 0.3792 (-0.26z)| lr 8.00e-04 | 9321.86 ms | 14.5% bf16 MFU | 121400 tok/s step 1229/19560 | loss 4.169018 (-1.38z)| norm 0.4560 (+1.23z)| lr 8.00e-04 | 4164.68 ms | 32.4% bf16 MFU | 121624 tok/s step 1230/19560 | loss 4.232615 (-0.17z)| norm 0.4492 (+1.08z)| lr 8.00e-04 | 4205.44 ms | 32.1% bf16 MFU | 121776 tok/s step 1231/19560 | loss 4.096223 (-2.66z)| norm 0.3805 (-0.26z)| lr 8.00e-04 | 4149.58 ms | 32.5% bf16 MFU | 122005 tok/s step 1232/19560 | loss 4.130832 (-1.97z)| norm 0.3248 (-1.33z)| lr 8.00e-04 | 4241.48 ms | 31.8% bf16 MFU | 122085 tok/s step 1233/19560 | loss 4.162564 (-1.37z)| norm 0.3304 (-1.20z)| lr 8.00e-04 | 4160.49 ms | 32.5% bf16 MFU | 122282 tok/s step 1234/19560 | loss 4.269853 (+0.58z)| norm 0.3310 (-1.17z)| lr 8.00e-04 | 4164.00 ms | 32.4% bf16 MFU | 122463 tok/s step 1235/19560 | loss 4.146358 (-1.63z)| norm 0.3238 (-1.29z)| lr 8.00e-04 | 4157.69 ms | 32.5% bf16 MFU | 122645 tok/s step 1236/19560 | loss 4.164692 (-1.29z)| norm 0.3460 (-0.85z)| lr 8.00e-04 | 4160.70 ms | 32.5% bf16 MFU | 122813 tok/s step 1237/19560 | loss 4.361505 (+2.28z)| norm 0.3459 (-0.84z)| lr 8.00e-04 | 4268.15 ms | 31.6% bf16 MFU | 122814 tok/s step 1238/19560 | loss 4.146337 (-1.60z)| norm 0.3660 (-0.44z)| lr 8.00e-04 | 4159.51 ms | 32.5% bf16 MFU | 122976 tok/s step 1239/19560 | loss 4.209693 (-0.45z)| norm 0.4451 (+1.07z)| lr 8.00e-04 | 4235.33 ms | 31.9% bf16 MFU | 123017 tok/s step 1240/19560 | loss 4.186865 (-0.85z)| norm 0.4426 (+1.02z)| lr 8.00e-04 | 4172.11 ms | 32.4% bf16 MFU | 123149 tok/s step 1241/19560 | loss 4.174300 (-1.07z)| norm 0.3914 (+0.04z)| lr 8.00e-04 | 4312.28 ms | 31.3% bf16 MFU | 123071 tok/s step 1242/19560 | loss 4.147813 (-1.53z)| norm 0.3552 (-0.64z)| lr 8.00e-04 | 4179.21 ms | 32.3% bf16 MFU | 123190 tok/s step 1243/19560 | loss 4.128199 (-1.85z)| norm 0.3172 (-1.35z)| lr 8.00e-04 | 4174.31 ms | 32.3% bf16 MFU | 123310 tok/s step 1244/19560 | loss 4.193850 (-0.65z)| norm 0.2949 (-1.74z)| lr 8.00e-04 | 4176.49 ms | 32.3% bf16 MFU | 123421 tok/s step 1245/19560 | loss 4.160767 (-1.23z)| norm 0.2749 (-2.07z)| lr 8.00e-04 | 4162.81 ms | 32.4% bf16 MFU | 123547 tok/s step 1246/19560 | loss 4.164358 (-1.15z)| norm 0.2692 (-2.13z)| lr 8.00e-04 | 4182.22 ms | 32.3% bf16 MFU | 123638 tok/s step 1247/19560 | loss 4.161415 (-1.18z)| norm 0.2942 (-1.64z)| lr 8.00e-04 | 4176.76 ms | 32.3% bf16 MFU | 123732 tok/s step 1248/19560 | loss 4.119873 (-1.88z)| norm 0.3418 (-0.78z)| lr 8.00e-04 | 4234.83 ms | 31.9% bf16 MFU | 123736 tok/s step 1249/19560 | loss 4.235349 (+0.17z)| norm 0.4067 (+0.38z)| lr 8.00e-04 | 4178.95 ms | 32.3% bf16 MFU | 123822 tok/s step 1250/19560 | loss 4.158347 (-1.18z)| norm 0.4495 (+1.15z)| lr 8.00e-04 | 4649.94 ms | 29.0% bf16 MFU | 123269 tok/s val loss 4.159478 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2595/10042 = 0.258415 step 1251/19560 | loss 4.163322 (-1.08z)| norm 0.4165 (+0.58z)| lr 8.00e-04 | 4702.16 ms | 28.7% bf16 MFU | 122680 tok/s step 1252/19560 | loss 4.133140 (-1.58z)| norm 0.4605 (+1.44z)| lr 8.00e-04 | 4434.36 ms | 30.4% bf16 MFU | 122458 tok/s step 1253/19560 | loss 4.122390 (-1.74z)| norm 0.4157 (+0.60z)| lr 8.00e-04 | 5059.09 ms | 26.7% bf16 MFU | 121517 tok/s step 1254/19560 | loss 4.204749 (-0.29z)| norm 0.3209 (-1.18z)| lr 8.00e-04 | 4363.82 ms | 30.9% bf16 MFU | 121448 tok/s step 1255/19560 | loss 4.119743 (-1.74z)| norm 0.3503 (-0.61z)| lr 8.00e-04 | 4339.90 ms | 31.1% bf16 MFU | 121416 tok/s step 1256/19560 | loss 4.095457 (-2.12z)| norm 0.3099 (-1.36z)| lr 8.00e-04 | 4282.36 ms | 31.5% bf16 MFU | 121467 tok/s step 1257/19560 | loss 4.104014 (-1.92z)| norm 0.2886 (-1.73z)| lr 8.00e-04 | 4201.72 ms | 32.1% bf16 MFU | 121632 tok/s step 1258/19560 | loss 4.090152 (-2.13z)| norm 0.2758 (-1.94z)| lr 8.00e-04 | 4221.12 ms | 32.0% bf16 MFU | 121761 tok/s step 1259/19560 | loss 4.054205 (-2.66z)| norm 0.3098 (-1.29z)| lr 8.00e-04 | 4306.05 ms | 31.4% bf16 MFU | 121761 tok/s step 1260/19560 | loss 4.119289 (-1.54z)| norm 0.3966 (+0.35z)| lr 8.00e-04 | 4309.68 ms | 31.3% bf16 MFU | 121755 tok/s step 1261/19560 | loss 4.091072 (-1.97z)| norm 0.4194 (+0.81z)| lr 8.00e-04 | 4160.98 ms | 32.4% bf16 MFU | 121968 tok/s step 1262/19560 | loss 4.199053 (-0.18z)| norm 0.4132 (+0.67z)| lr 8.00e-04 | 4221.51 ms | 32.0% bf16 MFU | 122079 tok/s step 1263/19560 | loss 4.122725 (-1.42z)| norm 0.3963 (+0.34z)| lr 8.00e-04 | 4252.72 ms | 31.7% bf16 MFU | 122139 tok/s step 1264/19560 | loss 4.117171 (-1.49z)| norm 0.3858 (+0.13z)| lr 8.00e-04 | 4169.37 ms | 32.4% bf16 MFU | 122320 tok/s step 1265/19560 | loss 4.132019 (-1.24z)| norm 0.3507 (-0.56z)| lr 8.00e-04 | 4276.73 ms | 31.6% bf16 MFU | 122333 tok/s step 1266/19560 | loss 4.088474 (-1.93z)| norm 0.3214 (-1.12z)| lr 8.00e-04 | 4167.83 ms | 32.4% bf16 MFU | 122506 tok/s step 1267/19560 | loss 4.119057 (-1.40z)| norm 0.3321 (-0.90z)| lr 8.00e-04 | 4168.91 ms | 32.4% bf16 MFU | 122669 tok/s step 1268/19560 | loss 4.144248 (-0.97z)| norm 0.3964 (+0.34z)| lr 8.00e-04 | 4227.44 ms | 31.9% bf16 MFU | 122737 tok/s step 1269/19560 | loss 4.090381 (-1.82z)| norm 0.4615 (+1.58z)| lr 8.00e-04 | 4449.20 ms | 30.3% bf16 MFU | 122492 tok/s step 1270/19560 | loss 4.134204 (-1.09z)| norm 0.4238 (+0.84z)| lr 8.00e-04 | 4166.64 ms | 32.4% bf16 MFU | 122659 tok/s step 1271/19560 | loss 4.113396 (-1.40z)| norm 0.3870 (+0.13z)| lr 8.00e-04 | 4243.90 ms | 31.8% bf16 MFU | 122703 tok/s step 1272/19560 | loss 4.068115 (-2.10z)| norm 0.4005 (+0.41z)| lr 8.00e-04 | 4197.52 ms | 32.2% bf16 MFU | 122813 tok/s step 1273/19560 | loss 4.186334 (-0.18z)| norm 0.4098 (+0.61z)| lr 8.00e-04 | 4166.73 ms | 32.4% bf16 MFU | 122963 tok/s step 1274/19560 | loss 4.171010 (-0.42z)| norm 0.4313 (+1.06z)| lr 8.00e-04 | 4172.25 ms | 32.4% bf16 MFU | 123098 tok/s step 1275/19560 | loss 4.092526 (-1.70z)| norm 0.3696 (-0.17z)| lr 8.00e-04 | 4210.85 ms | 32.1% bf16 MFU | 123169 tok/s step 1276/19560 | loss 4.191216 (-0.04z)| norm 0.3433 (-0.70z)| lr 8.00e-04 | 4178.91 ms | 32.3% bf16 MFU | 123283 tok/s step 1277/19560 | loss 4.102211 (-1.52z)| norm 0.3373 (-0.82z)| lr 8.00e-04 | 4180.19 ms | 32.3% bf16 MFU | 123390 tok/s step 1278/19560 | loss 4.137390 (-0.91z)| norm 0.3472 (-0.60z)| lr 8.00e-04 | 4166.20 ms | 32.4% bf16 MFU | 123513 tok/s step 1279/19560 | loss 4.127734 (-1.07z)| norm 0.3276 (-1.01z)| lr 8.00e-04 | 4165.29 ms | 32.4% bf16 MFU | 123631 tok/s step 1280/19560 | loss 4.065086 (-2.09z)| norm 0.3094 (-1.38z)| lr 8.00e-04 | 4179.07 ms | 32.3% bf16 MFU | 123722 tok/s step 1281/19560 | loss 4.134064 (-0.91z)| norm 0.2919 (-1.73z)| lr 8.00e-04 | 4174.68 ms | 32.3% bf16 MFU | 123815 tok/s step 1282/19560 | loss 4.202341 (+0.24z)| norm 0.6090 (+4.50z)| lr 8.00e-04 | 4173.71 ms | 32.3% bf16 MFU | 123905 tok/s step 1283/19560 | loss 4.160864 (-0.45z)| norm 0.4170 (+0.77z)| lr 8.00e-04 | 4180.75 ms | 32.3% bf16 MFU | 123980 tok/s step 1284/19560 | loss 4.105962 (-1.35z)| norm 0.4841 (+2.02z)| lr 8.00e-04 | 4181.11 ms | 32.3% bf16 MFU | 124051 tok/s step 1285/19560 | loss 4.126081 (-1.00z)| norm 0.4796 (+1.89z)| lr 8.00e-04 | 4168.01 ms | 32.4% bf16 MFU | 124138 tok/s step 1286/19560 | loss 4.117987 (-1.12z)| norm 0.4061 (+0.49z)| lr 8.00e-04 | 4174.20 ms | 32.3% bf16 MFU | 124211 tok/s step 1287/19560 | loss 4.081904 (-1.69z)| norm 0.4265 (+0.88z)| lr 8.00e-04 | 4159.89 ms | 32.5% bf16 MFU | 124302 tok/s step 1288/19560 | loss 4.091988 (-1.50z)| norm 0.3919 (+0.22z)| lr 8.00e-04 | 4162.49 ms | 32.4% bf16 MFU | 124385 tok/s step 1289/19560 | loss 4.117147 (-1.07z)| norm 0.3833 (+0.06z)| lr 8.00e-04 | 4227.92 ms | 31.9% bf16 MFU | 124366 tok/s step 1290/19560 | loss 4.166511 (-0.25z)| norm 0.4280 (+0.90z)| lr 8.00e-04 | 4273.16 ms | 31.6% bf16 MFU | 124282 tok/s step 1291/19560 | loss 4.125624 (-0.91z)| norm 0.4697 (+1.65z)| lr 8.00e-04 | 4163.94 ms | 32.4% bf16 MFU | 124364 tok/s step 1292/19560 | loss 4.188930 (+0.14z)| norm 0.4435 (+1.15z)| lr 8.00e-04 | 4162.26 ms | 32.4% bf16 MFU | 124444 tok/s step 1293/19560 | loss 4.078534 (-1.65z)| norm 0.4273 (+0.85z)| lr 8.00e-04 | 4242.34 ms | 31.8% bf16 MFU | 124401 tok/s step 1294/19560 | loss 4.159680 (-0.31z)| norm 0.4229 (+0.76z)| lr 8.00e-04 | 4185.20 ms | 32.3% bf16 MFU | 124444 tok/s step 1295/19560 | loss 4.140093 (-0.63z)| norm 0.3702 (-0.22z)| lr 8.00e-04 | 4169.56 ms | 32.4% bf16 MFU | 124509 tok/s step 1296/19560 | loss 4.141901 (-0.59z)| norm 0.3301 (-0.96z)| lr 8.00e-04 | 4306.52 ms | 31.4% bf16 MFU | 124371 tok/s step 1297/19560 | loss 4.066523 (-1.79z)| norm 0.3280 (-0.99z)| lr 7.99e-04 | 4169.25 ms | 32.4% bf16 MFU | 124440 tok/s step 1298/19560 | loss 4.155439 (-0.33z)| norm 0.2827 (-1.79z)| lr 7.99e-04 | 4172.85 ms | 32.4% bf16 MFU | 124500 tok/s step 1299/19560 | loss 4.130477 (-0.73z)| norm 0.3265 (-0.98z)| lr 7.99e-04 | 4173.75 ms | 32.3% bf16 MFU | 124556 tok/s step 1300/19560 | loss 4.134436 (-0.66z)| norm 0.2731 (-1.90z)| lr 7.99e-04 | 4169.80 ms | 32.4% bf16 MFU | 124615 tok/s step 1301/19560 | loss 4.098567 (-1.23z)| norm 0.2773 (-1.80z)| lr 7.99e-04 | 4175.68 ms | 32.3% bf16 MFU | 124662 tok/s step 1302/19560 | loss 4.108863 (-1.04z)| norm 0.2895 (-1.55z)| lr 7.99e-04 | 4305.07 ms | 31.4% bf16 MFU | 124518 tok/s step 1303/19560 | loss 4.090091 (-1.33z)| norm 0.2888 (-1.53z)| lr 7.99e-04 | 4169.49 ms | 32.4% bf16 MFU | 124579 tok/s step 1304/19560 | loss 4.076824 (-1.52z)| norm 0.3025 (-1.27z)| lr 7.99e-04 | 4169.69 ms | 32.4% bf16 MFU | 124637 tok/s step 1305/19560 | loss 4.225271 (+0.91z)| norm 0.3036 (-1.23z)| lr 7.99e-04 | 4164.44 ms | 32.4% bf16 MFU | 124700 tok/s step 1306/19560 | loss 4.123162 (-0.75z)| norm 0.3495 (-0.42z)| lr 7.99e-04 | 4172.27 ms | 32.4% bf16 MFU | 124748 tok/s step 1307/19560 | loss 4.147937 (-0.34z)| norm 0.3174 (-0.97z)| lr 7.99e-04 | 4175.93 ms | 32.3% bf16 MFU | 124788 tok/s step 1308/19560 | loss 4.119172 (-0.80z)| norm 0.3180 (-0.95z)| lr 7.99e-04 | 4180.49 ms | 32.3% bf16 MFU | 124819 tok/s step 1309/19560 | loss 4.115176 (-0.85z)| norm 0.3519 (-0.35z)| lr 7.99e-04 | 4156.87 ms | 32.5% bf16 MFU | 124885 tok/s step 1310/19560 | loss 4.082164 (-1.38z)| norm 0.4105 (+0.66z)| lr 7.99e-04 | 4175.25 ms | 32.3% bf16 MFU | 124919 tok/s step 1311/19560 | loss 4.074443 (-1.49z)| norm 0.3923 (+0.34z)| lr 7.99e-04 | 4161.94 ms | 32.4% bf16 MFU | 124972 tok/s step 1312/19560 | loss 4.116062 (-0.78z)| norm 0.3798 (+0.12z)| lr 7.99e-04 | 4162.93 ms | 32.4% bf16 MFU | 125020 tok/s step 1313/19560 | loss 4.085643 (-1.27z)| norm 0.3473 (-0.43z)| lr 7.99e-04 | 4173.55 ms | 32.4% bf16 MFU | 125050 tok/s step 1314/19560 | loss 4.055845 (-1.75z)| norm 0.3765 (+0.08z)| lr 7.99e-04 | 4185.66 ms | 32.3% bf16 MFU | 125061 tok/s step 1315/19560 | loss 4.106004 (-0.90z)| norm 0.3353 (-0.63z)| lr 7.99e-04 | 4271.11 ms | 31.6% bf16 MFU | 124945 tok/s step 1316/19560 | loss 4.119328 (-0.66z)| norm 0.2900 (-1.40z)| lr 7.99e-04 | 4165.03 ms | 32.4% bf16 MFU | 124992 tok/s step 1317/19560 | loss 4.074534 (-1.40z)| norm 0.3271 (-0.74z)| lr 7.99e-04 | 4216.25 ms | 32.0% bf16 MFU | 124960 tok/s step 1318/19560 | loss 4.096195 (-1.01z)| norm 0.3707 (+0.02z)| lr 7.99e-04 | 4166.35 ms | 32.4% bf16 MFU | 125004 tok/s step 1319/19560 | loss 4.094936 (-1.03z)| norm 0.4541 (+1.45z)| lr 7.99e-04 | 4158.60 ms | 32.5% bf16 MFU | 125057 tok/s step 1320/19560 | loss 4.119869 (-0.60z)| norm 0.4362 (+1.12z)| lr 7.99e-04 | 4242.31 ms | 31.8% bf16 MFU | 124984 tok/s step 1321/19560 | loss 4.094947 (-1.00z)| norm 0.4062 (+0.60z)| lr 7.99e-04 | 4160.85 ms | 32.4% bf16 MFU | 125035 tok/s step 1322/19560 | loss 4.117206 (-0.62z)| norm 0.3280 (-0.73z)| lr 7.99e-04 | 4298.45 ms | 31.4% bf16 MFU | 124882 tok/s step 1323/19560 | loss 4.100642 (-0.88z)| norm 0.3293 (-0.70z)| lr 7.99e-04 | 4160.57 ms | 32.5% bf16 MFU | 124938 tok/s step 1324/19560 | loss 4.130460 (-0.37z)| norm 0.2865 (-1.41z)| lr 7.99e-04 | 4253.05 ms | 31.7% bf16 MFU | 124855 tok/s step 1325/19560 | loss 4.076741 (-1.27z)| norm 0.2869 (-1.38z)| lr 7.99e-04 | 4232.14 ms | 31.9% bf16 MFU | 124806 tok/s step 1326/19560 | loss 4.093752 (-0.96z)| norm 0.2899 (-1.31z)| lr 7.99e-04 | 4251.81 ms | 31.8% bf16 MFU | 124731 tok/s step 1327/19560 | loss 4.064318 (-1.44z)| norm 0.3013 (-1.12z)| lr 7.99e-04 | 4267.79 ms | 31.6% bf16 MFU | 124637 tok/s step 1328/19560 | loss 4.127846 (-0.35z)| norm 0.3323 (-0.60z)| lr 7.99e-04 | 4189.09 ms | 32.2% bf16 MFU | 124663 tok/s step 1329/19560 | loss 4.067159 (-1.40z)| norm 0.3691 (+0.01z)| lr 7.99e-04 | 4247.50 ms | 31.8% bf16 MFU | 124602 tok/s step 1330/19560 | loss 4.070539 (-1.32z)| norm 0.3766 (+0.13z)| lr 7.99e-04 | 4185.01 ms | 32.3% bf16 MFU | 124636 tok/s step 1331/19560 | loss 4.105696 (-0.69z)| norm 0.3591 (-0.17z)| lr 7.99e-04 | 4204.62 ms | 32.1% bf16 MFU | 124638 tok/s step 1332/19560 | loss 4.156992 (+0.24z)| norm 0.3386 (-0.51z)| lr 7.99e-04 | 4158.89 ms | 32.5% bf16 MFU | 124710 tok/s step 1333/19560 | loss 4.087416 (-1.00z)| norm 0.3447 (-0.40z)| lr 7.99e-04 | 4159.68 ms | 32.5% bf16 MFU | 124776 tok/s step 1334/19560 | loss 4.063401 (-1.41z)| norm 0.3090 (-0.99z)| lr 7.99e-04 | 4157.19 ms | 32.5% bf16 MFU | 124843 tok/s step 1335/19560 | loss 4.096237 (-0.81z)| norm 0.2938 (-1.23z)| lr 7.99e-04 | 4163.36 ms | 32.4% bf16 MFU | 124898 tok/s step 1336/19560 | loss 4.071991 (-1.22z)| norm 0.2860 (-1.34z)| lr 7.99e-04 | 4158.67 ms | 32.5% bf16 MFU | 124956 tok/s step 1337/19560 | loss 4.080623 (-1.05z)| norm 0.3304 (-0.59z)| lr 7.99e-04 | 4195.36 ms | 32.2% bf16 MFU | 124957 tok/s step 1338/19560 | loss 4.094241 (-0.80z)| norm 0.3690 (+0.07z)| lr 7.99e-04 | 4190.71 ms | 32.2% bf16 MFU | 124964 tok/s step 1339/19560 | loss 4.043670 (-1.68z)| norm 0.3600 (-0.06z)| lr 7.99e-04 | 4216.84 ms | 32.0% bf16 MFU | 124933 tok/s step 1340/19560 | loss 4.061303 (-1.34z)| norm 0.3215 (-0.72z)| lr 7.99e-04 | 4159.97 ms | 32.5% bf16 MFU | 124988 tok/s step 1341/19560 | loss 4.043741 (-1.63z)| norm 0.3048 (-0.99z)| lr 7.99e-04 | 4185.37 ms | 32.3% bf16 MFU | 125002 tok/s step 1342/19560 | loss 4.069592 (-1.15z)| norm 0.3030 (-1.01z)| lr 7.99e-04 | 4160.08 ms | 32.5% bf16 MFU | 125053 tok/s step 1343/19560 | loss 4.074003 (-1.06z)| norm 0.3203 (-0.72z)| lr 7.99e-04 | 4160.12 ms | 32.5% bf16 MFU | 125102 tok/s step 1344/19560 | loss 4.024020 (-1.94z)| norm 0.3690 (+0.11z)| lr 7.99e-04 | 4160.89 ms | 32.4% bf16 MFU | 125147 tok/s step 1345/19560 | loss 4.022422 (-1.92z)| norm 0.3676 (+0.09z)| lr 7.99e-04 | 4160.71 ms | 32.5% bf16 MFU | 125190 tok/s step 1346/19560 | loss 4.031627 (-1.73z)| norm 0.3410 (-0.37z)| lr 7.99e-04 | 4165.07 ms | 32.4% bf16 MFU | 125224 tok/s step 1347/19560 | loss 4.094713 (-0.59z)| norm 0.3885 (+0.44z)| lr 7.99e-04 | 4174.99 ms | 32.3% bf16 MFU | 125242 tok/s step 1348/19560 | loss 3.992127 (-2.36z)| norm 0.3653 (+0.03z)| lr 7.99e-04 | 4162.38 ms | 32.4% bf16 MFU | 125278 tok/s step 1349/19560 | loss 4.075552 (-0.89z)| norm 0.4111 (+0.81z)| lr 7.99e-04 | 4159.07 ms | 32.5% bf16 MFU | 125317 tok/s step 1350/19560 | loss 4.014891 (-1.98z)| norm 0.3617 (-0.04z)| lr 7.99e-04 | 4166.15 ms | 32.4% bf16 MFU | 125343 tok/s step 1351/19560 | loss 4.143590 (+0.37z)| norm 0.3247 (-0.66z)| lr 7.99e-04 | 4157.85 ms | 32.5% bf16 MFU | 125381 tok/s step 1352/19560 | loss 4.105357 (-0.32z)| norm 0.3625 (+0.00z)| lr 7.99e-04 | 4180.73 ms | 32.3% bf16 MFU | 125382 tok/s step 1353/19560 | loss 4.096082 (-0.48z)| norm 0.3752 (+0.23z)| lr 7.99e-04 | 4157.76 ms | 32.5% bf16 MFU | 125418 tok/s step 1354/19560 | loss 4.053195 (-1.27z)| norm 0.3416 (-0.35z)| lr 7.99e-04 | 4161.21 ms | 32.4% bf16 MFU | 125447 tok/s step 1355/19560 | loss 4.072276 (-0.89z)| norm 0.3262 (-0.61z)| lr 7.99e-04 | 4177.75 ms | 32.3% bf16 MFU | 125449 tok/s step 1356/19560 | loss 4.073448 (-0.86z)| norm 0.3364 (-0.43z)| lr 7.99e-04 | 4219.70 ms | 32.0% bf16 MFU | 125389 tok/s step 1357/19560 | loss 4.109357 (-0.17z)| norm 0.3949 (+0.61z)| lr 7.99e-04 | 4155.45 ms | 32.5% bf16 MFU | 125428 tok/s step 1358/19560 | loss 4.061934 (-1.06z)| norm 0.3875 (+0.50z)| lr 7.99e-04 | 4175.21 ms | 32.3% bf16 MFU | 125435 tok/s step 1359/19560 | loss 4.122841 (+0.12z)| norm 0.3785 (+0.34z)| lr 7.99e-04 | 4187.67 ms | 32.2% bf16 MFU | 125423 tok/s step 1360/19560 | loss 4.070827 (-0.88z)| norm 0.3621 (+0.04z)| lr 7.99e-04 | 4259.09 ms | 31.7% bf16 MFU | 125307 tok/s step 1361/19560 | loss 4.093307 (-0.44z)| norm 0.3150 (-0.80z)| lr 7.99e-04 | 4155.99 ms | 32.5% bf16 MFU | 125349 tok/s step 1362/19560 | loss 4.007284 (-2.11z)| norm 0.3080 (-0.92z)| lr 7.99e-04 | 4161.88 ms | 32.4% bf16 MFU | 125381 tok/s step 1363/19560 | loss 4.095233 (-0.36z)| norm 0.3565 (-0.06z)| lr 7.99e-04 | 4224.45 ms | 32.0% bf16 MFU | 125317 tok/s step 1364/19560 | loss 4.120975 (+0.16z)| norm 0.3803 (+0.36z)| lr 7.99e-04 | 4159.88 ms | 32.5% bf16 MFU | 125353 tok/s step 1365/19560 | loss 4.084638 (-0.58z)| norm 0.4081 (+0.84z)| lr 7.99e-04 | 4165.45 ms | 32.4% bf16 MFU | 125379 tok/s step 1366/19560 | loss 4.124260 (+0.30z)| norm 0.3653 (+0.08z)| lr 7.99e-04 | 4252.12 ms | 31.8% bf16 MFU | 125275 tok/s step 1367/19560 | loss 3.995092 (-2.51z)| norm 0.3382 (-0.39z)| lr 7.99e-04 | 4162.82 ms | 32.4% bf16 MFU | 125308 tok/s step 1368/19560 | loss 4.054926 (-1.18z)| norm 0.3158 (-0.78z)| lr 7.99e-04 | 4163.84 ms | 32.4% bf16 MFU | 125338 tok/s step 1369/19560 | loss 4.055530 (-1.15z)| norm 0.3293 (-0.53z)| lr 7.99e-04 | 4163.06 ms | 32.4% bf16 MFU | 125368 tok/s step 1370/19560 | loss 4.022779 (-1.84z)| norm 0.3307 (-0.50z)| lr 7.99e-04 | 4160.10 ms | 32.5% bf16 MFU | 125401 tok/s step 1371/19560 | loss 4.122024 (+0.35z)| norm 0.2991 (-1.06z)| lr 7.99e-04 | 4160.06 ms | 32.5% bf16 MFU | 125433 tok/s step 1372/19560 | loss 4.072485 (-0.73z)| norm 0.2813 (-1.38z)| lr 7.99e-04 | 4159.53 ms | 32.5% bf16 MFU | 125463 tok/s step 1373/19560 | loss 4.167024 (+1.38z)| norm 0.3038 (-0.98z)| lr 7.99e-04 | 4161.12 ms | 32.4% bf16 MFU | 125490 tok/s step 1374/19560 | loss 4.056011 (-1.08z)| norm 0.2996 (-1.07z)| lr 7.99e-04 | 4177.89 ms | 32.3% bf16 MFU | 125490 tok/s step 1375/19560 | loss 4.087368 (-0.37z)| norm 0.3132 (-0.83z)| lr 7.99e-04 | 4174.90 ms | 32.3% bf16 MFU | 125495 tok/s step 1376/19560 | loss 3.986487 (-2.55z)| norm 0.3043 (-0.98z)| lr 7.99e-04 | 4159.86 ms | 32.5% bf16 MFU | 125522 tok/s step 1377/19560 | loss 4.083648 (-0.41z)| norm 0.2745 (-1.49z)| lr 7.99e-04 | 4174.49 ms | 32.3% bf16 MFU | 125525 tok/s step 1378/19560 | loss 4.021753 (-1.77z)| norm 0.3101 (-0.84z)| lr 7.99e-04 | 4155.54 ms | 32.5% bf16 MFU | 125557 tok/s step 1379/19560 | loss 4.055870 (-0.99z)| norm 0.2884 (-1.21z)| lr 7.99e-04 | 4171.05 ms | 32.4% bf16 MFU | 125564 tok/s step 1380/19560 | loss 4.151251 (+1.16z)| norm 0.2551 (-1.79z)| lr 7.99e-04 | 4164.91 ms | 32.4% bf16 MFU | 125580 tok/s step 1381/19560 | loss 4.102789 (+0.07z)| norm 0.2765 (-1.38z)| lr 7.99e-04 | 4158.17 ms | 32.5% bf16 MFU | 125606 tok/s step 1382/19560 | loss 4.012736 (-1.94z)| norm 0.3277 (-0.45z)| lr 7.99e-04 | 4191.84 ms | 32.2% bf16 MFU | 125579 tok/s step 1383/19560 | loss 4.107438 (+0.21z)| norm 0.4387 (+1.54z)| lr 7.99e-04 | 4174.23 ms | 32.3% bf16 MFU | 125580 tok/s step 1384/19560 | loss 4.087233 (-0.24z)| norm 0.4299 (+1.36z)| lr 7.99e-04 | 4156.78 ms | 32.5% bf16 MFU | 125607 tok/s step 1385/19560 | loss 4.065032 (-0.74z)| norm 0.4006 (+0.82z)| lr 7.99e-04 | 4157.68 ms | 32.5% bf16 MFU | 125632 tok/s step 1386/19560 | loss 4.061369 (-0.82z)| norm 0.3978 (+0.76z)| lr 7.99e-04 | 4159.84 ms | 32.5% bf16 MFU | 125652 tok/s step 1387/19560 | loss 4.081815 (-0.36z)| norm 0.4269 (+1.26z)| lr 7.99e-04 | 4173.17 ms | 32.4% bf16 MFU | 125651 tok/s step 1388/19560 | loss 4.095471 (-0.05z)| norm 0.4675 (+1.96z)| lr 7.99e-04 | 4155.87 ms | 32.5% bf16 MFU | 125677 tok/s step 1389/19560 | loss 4.076349 (-0.48z)| norm 0.3892 (+0.57z)| lr 7.99e-04 | 4160.66 ms | 32.5% bf16 MFU | 125693 tok/s step 1390/19560 | loss 4.081277 (-0.35z)| norm 0.3523 (-0.08z)| lr 7.99e-04 | 4167.39 ms | 32.4% bf16 MFU | 125699 tok/s step 1391/19560 | loss 4.091886 (-0.10z)| norm 0.3344 (-0.39z)| lr 7.99e-04 | 4193.08 ms | 32.2% bf16 MFU | 125666 tok/s step 1392/19560 | loss 4.118309 (+0.51z)| norm 0.3127 (-0.77z)| lr 7.99e-04 | 4162.61 ms | 32.4% bf16 MFU | 125680 tok/s step 1393/19560 | loss 4.047612 (-1.11z)| norm 0.3283 (-0.49z)| lr 7.99e-04 | 4159.11 ms | 32.5% bf16 MFU | 125699 tok/s step 1394/19560 | loss 4.076756 (-0.43z)| norm 0.3181 (-0.67z)| lr 7.99e-04 | 4166.12 ms | 32.4% bf16 MFU | 125706 tok/s step 1395/19560 | loss 4.141691 (+1.06z)| norm 0.2980 (-1.02z)| lr 7.99e-04 | 4171.00 ms | 32.4% bf16 MFU | 125706 tok/s step 1396/19560 | loss 4.075853 (-0.45z)| norm 0.3106 (-0.78z)| lr 7.99e-04 | 4157.85 ms | 32.5% bf16 MFU | 125725 tok/s step 1397/19560 | loss 4.044823 (-1.15z)| norm 0.3146 (-0.70z)| lr 7.99e-04 | 4172.11 ms | 32.4% bf16 MFU | 125722 tok/s step 1398/19560 | loss 4.023781 (-1.61z)| norm 0.3060 (-0.84z)| lr 7.99e-04 | 4156.46 ms | 32.5% bf16 MFU | 125743 tok/s step 1399/19560 | loss 4.087761 (-0.14z)| norm 0.3069 (-0.81z)| lr 7.99e-04 | 4171.27 ms | 32.4% bf16 MFU | 125741 tok/s step 1400/19560 | loss 4.084147 (-0.22z)| norm 0.2892 (-1.12z)| lr 7.99e-04 | 4169.31 ms | 32.4% bf16 MFU | 125741 tok/s step 1401/19560 | loss 4.058629 (-0.80z)| norm 0.2805 (-1.25z)| lr 7.99e-04 | 4157.71 ms | 32.5% bf16 MFU | 125759 tok/s step 1402/19560 | loss 4.065036 (-0.64z)| norm 0.2987 (-0.91z)| lr 7.99e-04 | 4160.01 ms | 32.5% bf16 MFU | 125773 tok/s step 1403/19560 | loss 4.046400 (-1.06z)| norm 0.2918 (-1.02z)| lr 7.99e-04 | 4166.70 ms | 32.4% bf16 MFU | 125775 tok/s step 1404/19560 | loss 4.029475 (-1.45z)| norm 0.2634 (-1.51z)| lr 7.99e-04 | 4156.23 ms | 32.5% bf16 MFU | 125794 tok/s step 1405/19560 | loss 4.064222 (-0.61z)| norm 0.2732 (-1.32z)| lr 7.99e-04 | 4158.97 ms | 32.5% bf16 MFU | 125807 tok/s step 1406/19560 | loss 4.017522 (-1.69z)| norm 0.2743 (-1.28z)| lr 7.99e-04 | 4179.54 ms | 32.3% bf16 MFU | 125789 tok/s step 1407/19560 | loss 4.064913 (-0.56z)| norm 0.3059 (-0.72z)| lr 7.99e-04 | 4166.54 ms | 32.4% bf16 MFU | 125791 tok/s step 1408/19560 | loss 4.035571 (-1.24z)| norm 0.3601 (+0.23z)| lr 7.99e-04 | 4167.02 ms | 32.4% bf16 MFU | 125792 tok/s step 1409/19560 | loss 4.100949 (+0.30z)| norm 0.4475 (+1.74z)| lr 7.99e-04 | 4175.89 ms | 32.3% bf16 MFU | 125780 tok/s step 1410/19560 | loss 4.049014 (-0.92z)| norm 0.4401 (+1.77z)| lr 7.99e-04 | 4151.11 ms | 32.5% bf16 MFU | 125806 tok/s step 1411/19560 | loss 4.040370 (-1.12z)| norm 0.4202 (+1.39z)| lr 7.99e-04 | 4166.70 ms | 32.4% bf16 MFU | 125808 tok/s step 1412/19560 | loss 4.073327 (-0.30z)| norm 0.4716 (+2.37z)| lr 7.99e-04 | 4165.98 ms | 32.4% bf16 MFU | 125810 tok/s step 1413/19560 | loss 4.127645 (+1.02z)| norm 0.4822 (+2.57z)| lr 7.99e-04 | 4151.55 ms | 32.5% bf16 MFU | 125834 tok/s step 1414/19560 | loss 4.062456 (-0.56z)| norm 0.4135 (+1.26z)| lr 7.99e-04 | 4167.72 ms | 32.4% bf16 MFU | 125832 tok/s step 1415/19560 | loss 4.127156 (+1.01z)| norm 0.3158 (-0.57z)| lr 7.99e-04 | 4157.75 ms | 32.5% bf16 MFU | 125845 tok/s step 1416/19560 | loss 4.037050 (-1.17z)| norm 0.2851 (-1.14z)| lr 7.99e-04 | 4167.75 ms | 32.4% bf16 MFU | 125843 tok/s step 1417/19560 | loss 3.997769 (-2.07z)| norm 0.2793 (-1.23z)| lr 7.99e-04 | 4157.58 ms | 32.5% bf16 MFU | 125856 tok/s step 1418/19560 | loss 4.057353 (-0.63z)| norm 0.2823 (-1.16z)| lr 7.99e-04 | 4151.97 ms | 32.5% bf16 MFU | 125877 tok/s step 1419/19560 | loss 3.997060 (-2.05z)| norm 0.3012 (-0.79z)| lr 7.99e-04 | 4224.24 ms | 32.0% bf16 MFU | 125789 tok/s step 1420/19560 | loss 4.039227 (-1.03z)| norm 0.2844 (-1.11z)| lr 7.99e-04 | 4158.46 ms | 32.5% bf16 MFU | 125803 tok/s step 1421/19560 | loss 4.063274 (-0.44z)| norm 0.2771 (-1.23z)| lr 7.99e-04 | 4160.34 ms | 32.5% bf16 MFU | 125814 tok/s step 1422/19560 | loss 3.991323 (-2.16z)| norm 0.2799 (-1.16z)| lr 7.99e-04 | 4154.90 ms | 32.5% bf16 MFU | 125832 tok/s step 1423/19560 | loss 4.042382 (-0.90z)| norm 0.3093 (-0.57z)| lr 7.99e-04 | 4168.05 ms | 32.4% bf16 MFU | 125830 tok/s step 1424/19560 | loss 4.027062 (-1.26z)| norm 0.3237 (-0.28z)| lr 7.99e-04 | 4152.77 ms | 32.5% bf16 MFU | 125851 tok/s step 1425/19560 | loss 4.013453 (-1.57z)| norm 0.3124 (-0.50z)| lr 7.99e-04 | 4305.69 ms | 31.4% bf16 MFU | 125647 tok/s step 1426/19560 | loss 4.012961 (-1.56z)| norm 0.3038 (-0.68z)| lr 7.99e-04 | 4179.13 ms | 32.3% bf16 MFU | 125637 tok/s step 1427/19560 | loss 4.069351 (-0.17z)| norm 0.2960 (-0.83z)| lr 7.99e-04 | 4195.43 ms | 32.2% bf16 MFU | 125604 tok/s step 1428/19560 | loss 4.032322 (-1.06z)| norm 0.2992 (-0.77z)| lr 7.99e-04 | 4156.03 ms | 32.5% bf16 MFU | 125631 tok/s step 1429/19560 | loss 4.002869 (-1.75z)| norm 0.2982 (-0.80z)| lr 7.99e-04 | 4152.66 ms | 32.5% bf16 MFU | 125662 tok/s step 1430/19560 | loss 4.053920 (-0.50z)| norm 0.3558 (+0.35z)| lr 7.99e-04 | 4181.15 ms | 32.3% bf16 MFU | 125649 tok/s step 1431/19560 | loss 4.031053 (-1.04z)| norm 0.3771 (+0.76z)| lr 7.99e-04 | 4155.61 ms | 32.5% bf16 MFU | 125675 tok/s step 1432/19560 | loss 4.010285 (-1.52z)| norm 0.3585 (+0.38z)| lr 7.99e-04 | 4154.37 ms | 32.5% bf16 MFU | 125701 tok/s step 1433/19560 | loss 3.992064 (-2.00z)| norm 0.3381 (-0.04z)| lr 7.99e-04 | 4154.02 ms | 32.5% bf16 MFU | 125726 tok/s step 1434/19560 | loss 4.026991 (-1.10z)| norm 0.2661 (-1.47z)| lr 7.99e-04 | 4244.87 ms | 31.8% bf16 MFU | 125616 tok/s step 1435/19560 | loss 4.044415 (-0.65z)| norm 0.2848 (-1.08z)| lr 7.99e-04 | 4168.10 ms | 32.4% bf16 MFU | 125624 tok/s step 1436/19560 | loss 3.992420 (-1.94z)| norm 0.2821 (-1.13z)| lr 7.99e-04 | 4154.46 ms | 32.5% bf16 MFU | 125653 tok/s step 1437/19560 | loss 4.020552 (-1.21z)| norm 0.2998 (-0.76z)| lr 7.99e-04 | 4153.13 ms | 32.5% bf16 MFU | 125682 tok/s step 1438/19560 | loss 4.000962 (-1.67z)| norm 0.3236 (-0.28z)| lr 7.99e-04 | 4199.35 ms | 32.2% bf16 MFU | 125641 tok/s step 1439/19560 | loss 4.057458 (-0.25z)| norm 0.3842 (+0.94z)| lr 7.99e-04 | 4150.09 ms | 32.5% bf16 MFU | 125675 tok/s step 1440/19560 | loss 4.003949 (-1.56z)| norm 0.3787 (+0.83z)| lr 7.99e-04 | 4160.35 ms | 32.5% bf16 MFU | 125692 tok/s step 1441/19560 | loss 4.073742 (+0.18z)| norm 0.3323 (-0.10z)| lr 7.99e-04 | 4505.77 ms | 30.0% bf16 MFU | 125226 tok/s step 1442/19560 | loss 4.025428 (-1.01z)| norm 0.3495 (+0.25z)| lr 7.99e-04 | 4784.58 ms | 28.2% bf16 MFU | 124443 tok/s step 1443/19560 | loss 4.058004 (-0.19z)| norm 0.3621 (+0.50z)| lr 7.99e-04 | 4335.87 ms | 31.1% bf16 MFU | 124267 tok/s step 1444/19560 | loss 3.992226 (-1.80z)| norm 0.3622 (+0.49z)| lr 7.99e-04 | 4335.79 ms | 31.1% bf16 MFU | 124100 tok/s step 1445/19560 | loss 4.073140 (+0.21z)| norm 0.3248 (-0.26z)| lr 7.99e-04 | 4253.53 ms | 31.7% bf16 MFU | 124058 tok/s step 1446/19560 | loss 4.032723 (-0.78z)| norm 0.3307 (-0.14z)| lr 7.99e-04 | 4435.93 ms | 30.4% bf16 MFU | 123764 tok/s step 1447/19560 | loss 4.035619 (-0.70z)| norm 0.2976 (-0.80z)| lr 7.99e-04 | 4212.12 ms | 32.1% bf16 MFU | 123800 tok/s step 1448/19560 | loss 4.053455 (-0.25z)| norm 0.3083 (-0.57z)| lr 7.99e-04 | 4363.30 ms | 30.9% bf16 MFU | 123618 tok/s step 1449/19560 | loss 4.016829 (-1.14z)| norm 0.2842 (-1.05z)| lr 7.99e-04 | 4163.22 ms | 32.4% bf16 MFU | 123734 tok/s step 1450/19560 | loss 3.985045 (-1.90z)| norm 0.2745 (-1.24z)| lr 7.99e-04 | 4285.67 ms | 31.5% bf16 MFU | 123664 tok/s step 1451/19560 | loss 4.071977 (+0.26z)| norm 0.2925 (-0.86z)| lr 7.99e-04 | 4218.56 ms | 32.0% bf16 MFU | 123694 tok/s step 1452/19560 | loss 4.089956 (+0.72z)| norm 0.3290 (-0.11z)| lr 7.99e-04 | 4166.64 ms | 32.4% bf16 MFU | 123801 tok/s step 1453/19560 | loss 4.037048 (-0.60z)| norm 0.3887 (+1.12z)| lr 7.99e-04 | 4186.71 ms | 32.2% bf16 MFU | 123873 tok/s step 1454/19560 | loss 4.089741 (+0.73z)| norm 0.3662 (+0.64z)| lr 7.99e-04 | 4697.26 ms | 28.7% bf16 MFU | 123260 tok/s step 1455/19560 | loss 4.055682 (-0.13z)| norm 0.3499 (+0.29z)| lr 7.99e-04 | 4284.21 ms | 31.5% bf16 MFU | 123216 tok/s step 1456/19560 | loss 4.027428 (-0.82z)| norm 0.3772 (+0.85z)| lr 7.99e-04 | 4252.32 ms | 31.8% bf16 MFU | 123219 tok/s step 1457/19560 | loss 4.112599 (+1.32z)| norm 0.3817 (+0.95z)| lr 7.99e-04 | 4196.07 ms | 32.2% bf16 MFU | 123306 tok/s step 1458/19560 | loss 4.091842 (+0.79z)| norm 0.3445 (+0.18z)| lr 7.99e-04 | 4168.33 ms | 32.4% bf16 MFU | 123430 tok/s step 1459/19560 | loss 4.061668 (+0.04z)| norm 0.3434 (+0.16z)| lr 7.99e-04 | 4167.02 ms | 32.4% bf16 MFU | 123549 tok/s step 1460/19560 | loss 4.057124 (-0.06z)| norm 0.2985 (-0.77z)| lr 7.99e-04 | 4172.90 ms | 32.4% bf16 MFU | 123654 tok/s step 1461/19560 | loss 4.061155 (+0.05z)| norm 0.3115 (-0.50z)| lr 7.99e-04 | 4176.43 ms | 32.3% bf16 MFU | 123748 tok/s step 1462/19560 | loss 4.073643 (+0.37z)| norm 0.3046 (-0.64z)| lr 7.99e-04 | 4187.53 ms | 32.2% bf16 MFU | 123820 tok/s step 1463/19560 | loss 4.081179 (+0.57z)| norm 0.2841 (-1.06z)| lr 7.99e-04 | 4158.14 ms | 32.5% bf16 MFU | 123934 tok/s step 1464/19560 | loss 4.009396 (-1.27z)| norm 0.2839 (-1.07z)| lr 7.99e-04 | 4204.07 ms | 32.1% bf16 MFU | 123973 tok/s step 1465/19560 | loss 4.022382 (-0.92z)| norm 0.2914 (-0.90z)| lr 7.99e-04 | 4158.75 ms | 32.5% bf16 MFU | 124077 tok/s step 1466/19560 | loss 4.034468 (-0.60z)| norm 0.2870 (-0.98z)| lr 7.99e-04 | 4171.00 ms | 32.4% bf16 MFU | 124158 tok/s step 1467/19560 | loss 4.028193 (-0.76z)| norm 0.2711 (-1.28z)| lr 7.99e-04 | 4179.32 ms | 32.3% bf16 MFU | 124223 tok/s step 1468/19560 | loss 4.033910 (-0.60z)| norm 0.2637 (-1.42z)| lr 7.99e-04 | 4161.37 ms | 32.4% bf16 MFU | 124311 tok/s step 1469/19560 | loss 4.052443 (-0.13z)| norm 0.2849 (-0.98z)| lr 7.99e-04 | 4179.33 ms | 32.3% bf16 MFU | 124368 tok/s step 1470/19560 | loss 4.088305 (+0.79z)| norm 0.2912 (-0.85z)| lr 7.99e-04 | 4176.47 ms | 32.3% bf16 MFU | 124426 tok/s step 1471/19560 | loss 3.996492 (-1.54z)| norm 0.2920 (-0.83z)| lr 7.99e-04 | 4245.47 ms | 31.8% bf16 MFU | 124380 tok/s step 1472/19560 | loss 4.072537 (+0.39z)| norm 0.2965 (-0.72z)| lr 7.99e-04 | 4183.75 ms | 32.3% bf16 MFU | 124426 tok/s step 1473/19560 | loss 3.996080 (-1.55z)| norm 0.3221 (-0.19z)| lr 7.99e-04 | 4164.45 ms | 32.4% bf16 MFU | 124500 tok/s step 1474/19560 | loss 4.042309 (-0.38z)| norm 0.3467 (+0.30z)| lr 7.99e-04 | 4241.58 ms | 31.8% bf16 MFU | 124455 tok/s step 1475/19560 | loss 3.981664 (-1.87z)| norm 0.3053 (-0.53z)| lr 7.99e-04 | 4160.42 ms | 32.5% bf16 MFU | 124533 tok/s step 1476/19560 | loss 4.023492 (-0.84z)| norm 0.3176 (-0.27z)| lr 7.99e-04 | 4177.46 ms | 32.3% bf16 MFU | 124582 tok/s step 1477/19560 | loss 4.006360 (-1.25z)| norm 0.3284 (-0.03z)| lr 7.99e-04 | 4164.24 ms | 32.4% bf16 MFU | 124648 tok/s step 1478/19560 | loss 4.017329 (-0.97z)| norm 0.3012 (-0.59z)| lr 7.99e-04 | 4185.28 ms | 32.3% bf16 MFU | 124679 tok/s step 1479/19560 | loss 3.944571 (-2.74z)| norm 0.3182 (-0.24z)| lr 7.99e-04 | 4167.07 ms | 32.4% bf16 MFU | 124736 tok/s step 1480/19560 | loss 4.078610 (+0.61z)| norm 0.3071 (-0.46z)| lr 7.99e-04 | 4176.81 ms | 32.3% bf16 MFU | 124775 tok/s step 1481/19560 | loss 4.016851 (-0.92z)| norm 0.3069 (-0.45z)| lr 7.99e-04 | 4179.56 ms | 32.3% bf16 MFU | 124809 tok/s step 1482/19560 | loss 4.017203 (-0.90z)| norm 0.3125 (-0.33z)| lr 7.99e-04 | 4190.00 ms | 32.2% bf16 MFU | 124825 tok/s step 1483/19560 | loss 4.076262 (+0.57z)| norm 0.3447 (+0.34z)| lr 7.99e-04 | 4180.89 ms | 32.3% bf16 MFU | 124853 tok/s step 1484/19560 | loss 3.977703 (-1.85z)| norm 0.3435 (+0.31z)| lr 7.99e-04 | 4164.90 ms | 32.4% bf16 MFU | 124905 tok/s step 1485/19560 | loss 4.086383 (+0.84z)| norm 0.3140 (-0.29z)| lr 7.99e-04 | 4214.05 ms | 32.0% bf16 MFU | 124880 tok/s step 1486/19560 | loss 4.014701 (-0.92z)| norm 0.3412 (+0.29z)| lr 7.99e-04 | 4182.52 ms | 32.3% bf16 MFU | 124904 tok/s step 1487/19560 | loss 4.066819 (+0.38z)| norm 0.3383 (+0.23z)| lr 7.99e-04 | 4226.65 ms | 31.9% bf16 MFU | 124861 tok/s step 1488/19560 | loss 3.974621 (-1.88z)| norm 0.3659 (+0.81z)| lr 7.99e-04 | 4200.02 ms | 32.1% bf16 MFU | 124859 tok/s step 1489/19560 | loss 4.038350 (-0.30z)| norm 0.3215 (-0.12z)| lr 7.99e-04 | 4167.81 ms | 32.4% bf16 MFU | 124906 tok/s step 1490/19560 | loss 4.046762 (-0.10z)| norm 0.3164 (-0.23z)| lr 7.99e-04 | 4179.45 ms | 32.3% bf16 MFU | 124933 tok/s step 1491/19560 | loss 4.055670 (+0.13z)| norm 0.3143 (-0.27z)| lr 7.99e-04 | 4230.50 ms | 31.9% bf16 MFU | 124883 tok/s step 1492/19560 | loss 4.010318 (-0.99z)| norm 0.3052 (-0.45z)| lr 7.99e-04 | 4173.17 ms | 32.4% bf16 MFU | 124920 tok/s step 1493/19560 | loss 3.993039 (-1.40z)| norm 0.2931 (-0.70z)| lr 7.99e-04 | 4220.10 ms | 32.0% bf16 MFU | 124886 tok/s step 1494/19560 | loss 4.079976 (+0.80z)| norm 0.2993 (-0.55z)| lr 7.99e-04 | 4161.45 ms | 32.4% bf16 MFU | 124941 tok/s step 1495/19560 | loss 4.006802 (-1.06z)| norm 0.3163 (-0.18z)| lr 7.99e-04 | 4160.04 ms | 32.5% bf16 MFU | 124996 tok/s step 1496/19560 | loss 4.035160 (-0.34z)| norm 0.3679 (+0.91z)| lr 7.99e-04 | 4196.28 ms | 32.2% bf16 MFU | 124993 tok/s step 1497/19560 | loss 4.066152 (+0.45z)| norm 0.3420 (+0.35z)| lr 7.99e-04 | 4203.17 ms | 32.1% bf16 MFU | 124980 tok/s step 1498/19560 | loss 4.005123 (-1.10z)| norm 0.3564 (+0.66z)| lr 7.99e-04 | 4176.92 ms | 32.3% bf16 MFU | 125007 tok/s step 1499/19560 | loss 3.982961 (-1.64z)| norm 0.3645 (+0.82z)| lr 7.99e-04 | 4178.33 ms | 32.3% bf16 MFU | 125031 tok/s step 1500/19560 | loss 4.044853 (-0.06z)| norm 0.3491 (+0.48z)| lr 7.99e-04 | 4332.83 ms | 31.2% bf16 MFU | 124829 tok/s val loss 4.015504 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2587/10042 = 0.257618 step 1501/19560 | loss 4.036165 (-0.26z)| norm 0.3450 (+0.39z)| lr 7.99e-04 | 4433.11 ms | 30.5% bf16 MFU | 124501 tok/s step 1502/19560 | loss 3.970650 (-1.95z)| norm 0.3363 (+0.19z)| lr 7.99e-04 | 4216.56 ms | 32.0% bf16 MFU | 124493 tok/s step 1503/19560 | loss 4.018176 (-0.70z)| norm 0.3304 (+0.06z)| lr 7.99e-04 | 4299.15 ms | 31.4% bf16 MFU | 124366 tok/s step 1504/19560 | loss 4.039565 (-0.15z)| norm 0.3289 (+0.03z)| lr 7.99e-04 | 4202.23 ms | 32.1% bf16 MFU | 124386 tok/s step 1505/19560 | loss 4.028884 (-0.42z)| norm 0.3281 (+0.00z)| lr 7.99e-04 | 4228.61 ms | 31.9% bf16 MFU | 124366 tok/s step 1506/19560 | loss 4.045392 (+0.01z)| norm 0.3054 (-0.49z)| lr 7.99e-04 | 4207.17 ms | 32.1% bf16 MFU | 124378 tok/s step 1507/19560 | loss 3.947285 (-2.50z)| norm 0.3083 (-0.43z)| lr 7.99e-04 | 4272.59 ms | 31.6% bf16 MFU | 124295 tok/s step 1508/19560 | loss 4.017006 (-0.70z)| norm 0.2905 (-0.82z)| lr 7.99e-04 | 4242.15 ms | 31.8% bf16 MFU | 124260 tok/s step 1509/19560 | loss 3.978379 (-1.69z)| norm 0.3064 (-0.48z)| lr 7.99e-04 | 4234.37 ms | 31.9% bf16 MFU | 124238 tok/s step 1510/19560 | loss 3.989697 (-1.38z)| norm 0.2893 (-0.85z)| lr 7.99e-04 | 4194.63 ms | 32.2% bf16 MFU | 124275 tok/s step 1511/19560 | loss 3.986808 (-1.44z)| norm 0.3078 (-0.43z)| lr 7.99e-04 | 4166.63 ms | 32.4% bf16 MFU | 124353 tok/s step 1512/19560 | loss 4.049347 (+0.23z)| norm 0.3266 (+0.00z)| lr 7.99e-04 | 4191.00 ms | 32.2% bf16 MFU | 124390 tok/s step 1513/19560 | loss 4.026424 (-0.38z)| norm 0.3407 (+0.34z)| lr 7.98e-04 | 4268.79 ms | 31.6% bf16 MFU | 124312 tok/s step 1514/19560 | loss 4.071344 (+0.82z)| norm 0.3238 (-0.04z)| lr 7.98e-04 | 4216.83 ms | 32.0% bf16 MFU | 124313 tok/s step 1515/19560 | loss 4.062141 (+0.58z)| norm 0.3074 (-0.41z)| lr 7.98e-04 | 4356.54 ms | 31.0% bf16 MFU | 124114 tok/s step 1516/19560 | loss 3.973867 (-1.75z)| norm 0.3031 (-0.50z)| lr 7.98e-04 | 4201.46 ms | 32.1% bf16 MFU | 124148 tok/s step 1517/19560 | loss 4.027152 (-0.32z)| norm 0.2929 (-0.74z)| lr 7.98e-04 | 4354.65 ms | 31.0% bf16 MFU | 123960 tok/s step 1518/19560 | loss 3.959351 (-2.08z)| norm 0.2961 (-0.65z)| lr 7.98e-04 | 4348.48 ms | 31.0% bf16 MFU | 123791 tok/s step 1519/19560 | loss 4.014198 (-0.62z)| norm 0.2811 (-1.01z)| lr 7.98e-04 | 4167.45 ms | 32.4% bf16 MFU | 123892 tok/s step 1520/19560 | loss 4.008257 (-0.77z)| norm 0.2826 (-0.97z)| lr 7.98e-04 | 4162.83 ms | 32.4% bf16 MFU | 123994 tok/s step 1521/19560 | loss 3.996446 (-1.07z)| norm 0.2814 (-0.98z)| lr 7.98e-04 | 4219.73 ms | 32.0% bf16 MFU | 124007 tok/s step 1522/19560 | loss 4.067472 (+0.84z)| norm 0.2934 (-0.68z)| lr 7.98e-04 | 4167.58 ms | 32.4% bf16 MFU | 124097 tok/s step 1523/19560 | loss 4.031099 (-0.12z)| norm 0.2968 (-0.59z)| lr 7.98e-04 | 4164.46 ms | 32.4% bf16 MFU | 124187 tok/s step 1524/19560 | loss 4.066408 (+0.87z)| norm 0.3048 (-0.40z)| lr 7.98e-04 | 4208.39 ms | 32.1% bf16 MFU | 124206 tok/s step 1525/19560 | loss 4.109551 (+2.03z)| norm 0.3377 (+0.41z)| lr 7.98e-04 | 4160.07 ms | 32.5% bf16 MFU | 124297 tok/s step 1526/19560 | loss 4.008693 (-0.74z)| norm 0.3343 (+0.32z)| lr 7.98e-04 | 4178.36 ms | 32.3% bf16 MFU | 124356 tok/s step 1527/19560 | loss 3.967245 (-1.84z)| norm 0.3198 (-0.04z)| lr 7.98e-04 | 4225.42 ms | 32.0% bf16 MFU | 124343 tok/s step 1528/19560 | loss 3.988703 (-1.24z)| norm 0.3434 (+0.54z)| lr 7.98e-04 | 4216.15 ms | 32.0% bf16 MFU | 124343 tok/s step 1529/19560 | loss 4.034891 (+0.03z)| norm 0.3567 (+0.85z)| lr 7.98e-04 | 4173.74 ms | 32.3% bf16 MFU | 124407 tok/s step 1530/19560 | loss 3.958252 (-2.02z)| norm 0.3123 (-0.25z)| lr 7.98e-04 | 4226.63 ms | 31.9% bf16 MFU | 124389 tok/s step 1531/19560 | loss 4.097017 (+1.70z)| norm 0.3030 (-0.49z)| lr 7.98e-04 | 4279.77 ms | 31.5% bf16 MFU | 124294 tok/s step 1532/19560 | loss 4.013464 (-0.53z)| norm 0.3402 (+0.43z)| lr 7.98e-04 | 4157.71 ms | 32.5% bf16 MFU | 124385 tok/s step 1533/19560 | loss 4.079620 (+1.23z)| norm 0.3209 (-0.07z)| lr 7.98e-04 | 4228.00 ms | 31.9% bf16 MFU | 124366 tok/s step 1534/19560 | loss 4.019424 (-0.37z)| norm 0.3036 (-0.51z)| lr 7.98e-04 | 4239.38 ms | 31.8% bf16 MFU | 124331 tok/s step 1535/19560 | loss 3.969907 (-1.65z)| norm 0.3252 (+0.03z)| lr 7.98e-04 | 4258.17 ms | 31.7% bf16 MFU | 124271 tok/s step 1536/19560 | loss 4.006427 (-0.68z)| norm 0.2974 (-0.66z)| lr 7.98e-04 | 4165.75 ms | 32.4% bf16 MFU | 124350 tok/s step 1537/19560 | loss 4.015361 (-0.44z)| norm 0.2897 (-0.86z)| lr 7.98e-04 | 4285.56 ms | 31.5% bf16 MFU | 124249 tok/s step 1538/19560 | loss 3.990358 (-1.09z)| norm 0.3087 (-0.34z)| lr 7.98e-04 | 4229.28 ms | 31.9% bf16 MFU | 124235 tok/s step 1539/19560 | loss 4.010583 (-0.54z)| norm 0.3115 (-0.25z)| lr 7.98e-04 | 4157.09 ms | 32.5% bf16 MFU | 124329 tok/s step 1540/19560 | loss 4.082360 (+1.36z)| norm 0.3375 (+0.56z)| lr 7.98e-04 | 4258.20 ms | 31.7% bf16 MFU | 124269 tok/s step 1541/19560 | loss 4.073571 (+1.16z)| norm 0.2833 (-1.16z)| lr 7.98e-04 | 4171.76 ms | 32.4% bf16 MFU | 124339 tok/s step 1542/19560 | loss 4.035318 (+0.13z)| norm 0.2821 (-1.21z)| lr 7.98e-04 | 4213.11 ms | 32.0% bf16 MFU | 124345 tok/s step 1543/19560 | loss 3.974159 (-1.52z)| norm 0.2686 (-1.66z)| lr 7.98e-04 | 4164.09 ms | 32.4% bf16 MFU | 124423 tok/s step 1544/19560 | loss 4.003042 (-0.72z)| norm 0.2542 (-2.12z)| lr 7.98e-04 | 4177.91 ms | 32.3% bf16 MFU | 124476 tok/s step 1545/19560 | loss 4.093450 (+1.74z)| norm 0.2658 (-1.71z)| lr 7.98e-04 | 4207.82 ms | 32.1% bf16 MFU | 124482 tok/s step 1546/19560 | loss 3.973683 (-1.50z)| norm 0.2836 (-1.10z)| lr 7.98e-04 | 4181.39 ms | 32.3% bf16 MFU | 124527 tok/s step 1547/19560 | loss 4.014740 (-0.39z)| norm 0.3501 (+1.15z)| lr 7.98e-04 | 4193.48 ms | 32.2% bf16 MFU | 124552 tok/s step 1548/19560 | loss 3.987003 (-1.13z)| norm 0.3789 (+2.08z)| lr 7.98e-04 | 4170.69 ms | 32.4% bf16 MFU | 124610 tok/s step 1549/19560 | loss 4.004115 (-0.66z)| norm 0.3328 (+0.52z)| lr 7.98e-04 | 4172.67 ms | 32.4% bf16 MFU | 124662 tok/s step 1550/19560 | loss 3.994384 (-0.92z)| norm 0.3033 (-0.49z)| lr 7.98e-04 | 4175.56 ms | 32.3% bf16 MFU | 124707 tok/s step 1551/19560 | loss 4.002492 (-0.69z)| norm 0.4056 (+2.86z)| lr 7.98e-04 | 4174.28 ms | 32.3% bf16 MFU | 124751 tok/s step 1552/19560 | loss 4.102250 (+1.96z)| norm 0.3488 (+0.99z)| lr 7.98e-04 | 4163.32 ms | 32.4% bf16 MFU | 124810 tok/s step 1553/19560 | loss 4.069482 (+1.07z)| norm 0.3877 (+2.20z)| lr 7.98e-04 | 4173.51 ms | 32.4% bf16 MFU | 124851 tok/s step 1554/19560 | loss 4.001881 (-0.72z)| norm 0.3342 (+0.47z)| lr 7.98e-04 | 4174.38 ms | 32.3% bf16 MFU | 124888 tok/s step 1555/19560 | loss 4.025109 (-0.09z)| norm 0.3463 (+0.85z)| lr 7.98e-04 | 4165.22 ms | 32.4% bf16 MFU | 124938 tok/s step 1556/19560 | loss 4.016588 (-0.32z)| norm 0.3694 (+1.56z)| lr 7.98e-04 | 4171.21 ms | 32.4% bf16 MFU | 124975 tok/s step 1557/19560 | loss 4.008993 (-0.52z)| norm 0.3567 (+1.14z)| lr 7.98e-04 | 4175.46 ms | 32.3% bf16 MFU | 125005 tok/s step 1558/19560 | loss 3.999303 (-0.77z)| norm 0.3117 (-0.28z)| lr 7.98e-04 | 4177.75 ms | 32.3% bf16 MFU | 125029 tok/s step 1559/19560 | loss 3.993066 (-0.92z)| norm 0.3115 (-0.27z)| lr 7.98e-04 | 4254.39 ms | 31.7% bf16 MFU | 124940 tok/s step 1560/19560 | loss 4.039823 (+0.31z)| norm 0.3076 (-0.39z)| lr 7.98e-04 | 4178.88 ms | 32.3% bf16 MFU | 124966 tok/s step 1561/19560 | loss 4.024588 (-0.10z)| norm 0.3065 (-0.42z)| lr 7.98e-04 | 4215.76 ms | 32.0% bf16 MFU | 124936 tok/s step 1562/19560 | loss 4.058476 (+0.79z)| norm 0.3043 (-0.50z)| lr 7.98e-04 | 4160.73 ms | 32.5% bf16 MFU | 124989 tok/s step 1563/19560 | loss 4.030045 (+0.04z)| norm 0.2773 (-1.39z)| lr 7.98e-04 | 4179.46 ms | 32.3% bf16 MFU | 125012 tok/s step 1564/19560 | loss 4.011981 (-0.44z)| norm 0.2977 (-0.72z)| lr 7.98e-04 | 4280.91 ms | 31.5% bf16 MFU | 124885 tok/s step 1565/19560 | loss 3.985708 (-1.13z)| norm 0.3211 (+0.04z)| lr 7.98e-04 | 4168.94 ms | 32.4% bf16 MFU | 124929 tok/s step 1566/19560 | loss 3.973429 (-1.44z)| norm 0.3339 (+0.46z)| lr 7.98e-04 | 4162.67 ms | 32.4% bf16 MFU | 124980 tok/s step 1567/19560 | loss 3.993423 (-0.90z)| norm 0.3091 (-0.34z)| lr 7.98e-04 | 4236.82 ms | 31.9% bf16 MFU | 124918 tok/s step 1568/19560 | loss 4.006369 (-0.56z)| norm 0.2838 (-1.18z)| lr 7.98e-04 | 4202.97 ms | 32.1% bf16 MFU | 124909 tok/s step 1569/19560 | loss 3.969593 (-1.51z)| norm 0.2828 (-1.19z)| lr 7.98e-04 | 4179.72 ms | 32.3% bf16 MFU | 124936 tok/s step 1570/19560 | loss 3.965986 (-1.57z)| norm 0.2736 (-1.48z)| lr 7.98e-04 | 4182.88 ms | 32.3% bf16 MFU | 124956 tok/s step 1571/19560 | loss 3.998007 (-0.73z)| norm 0.2881 (-0.98z)| lr 7.98e-04 | 4192.43 ms | 32.2% bf16 MFU | 124961 tok/s step 1572/19560 | loss 3.997921 (-0.73z)| norm 0.2920 (-0.83z)| lr 7.98e-04 | 4175.31 ms | 32.3% bf16 MFU | 124991 tok/s step 1573/19560 | loss 4.076382 (+1.31z)| norm 0.3164 (-0.00z)| lr 7.98e-04 | 4270.30 ms | 31.6% bf16 MFU | 124880 tok/s step 1574/19560 | loss 4.000532 (-0.66z)| norm 0.3213 (+0.17z)| lr 7.98e-04 | 4393.62 ms | 30.7% bf16 MFU | 124603 tok/s step 1575/19560 | loss 4.005636 (-0.52z)| norm 0.2817 (-1.17z)| lr 7.98e-04 | 4172.17 ms | 32.4% bf16 MFU | 124656 tok/s step 1576/19560 | loss 3.992217 (-0.85z)| norm 0.2835 (-1.10z)| lr 7.98e-04 | 4179.26 ms | 32.3% bf16 MFU | 124696 tok/s step 1577/19560 | loss 3.993285 (-0.82z)| norm 0.2761 (-1.34z)| lr 7.98e-04 | 4169.89 ms | 32.4% bf16 MFU | 124747 tok/s step 1578/19560 | loss 3.972765 (-1.34z)| norm 0.2701 (-1.54z)| lr 7.98e-04 | 4177.98 ms | 32.3% bf16 MFU | 124784 tok/s step 1579/19560 | loss 3.949680 (-1.90z)| norm 0.2649 (-1.69z)| lr 7.98e-04 | 4168.19 ms | 32.4% bf16 MFU | 124834 tok/s step 1580/19560 | loss 4.065691 (+1.09z)| norm 0.2806 (-1.15z)| lr 7.98e-04 | 4178.65 ms | 32.3% bf16 MFU | 124866 tok/s step 1581/19560 | loss 3.982029 (-1.06z)| norm 0.3049 (-0.33z)| lr 7.98e-04 | 4182.63 ms | 32.3% bf16 MFU | 124890 tok/s step 1582/19560 | loss 4.032782 (+0.26z)| norm 0.3139 (-0.02z)| lr 7.98e-04 | 4171.11 ms | 32.4% bf16 MFU | 124930 tok/s step 1583/19560 | loss 4.032783 (+0.27z)| norm 0.3210 (+0.24z)| lr 7.98e-04 | 4180.28 ms | 32.3% bf16 MFU | 124955 tok/s step 1584/19560 | loss 4.035578 (+0.34z)| norm 0.3306 (+0.59z)| lr 7.98e-04 | 4198.17 ms | 32.2% bf16 MFU | 124951 tok/s step 1585/19560 | loss 4.065544 (+1.15z)| norm 0.3604 (+1.66z)| lr 7.98e-04 | 4162.23 ms | 32.4% bf16 MFU | 125002 tok/s step 1586/19560 | loss 4.008288 (-0.36z)| norm 0.3373 (+0.85z)| lr 7.98e-04 | 4179.22 ms | 32.3% bf16 MFU | 125024 tok/s step 1587/19560 | loss 4.025108 (+0.10z)| norm 0.3393 (+0.92z)| lr 7.98e-04 | 4165.38 ms | 32.4% bf16 MFU | 125067 tok/s step 1588/19560 | loss 3.985275 (-0.96z)| norm 0.3533 (+1.39z)| lr 7.98e-04 | 4174.59 ms | 32.3% bf16 MFU | 125093 tok/s step 1589/19560 | loss 3.952520 (-1.80z)| norm 0.3083 (-0.20z)| lr 7.98e-04 | 4173.59 ms | 32.4% bf16 MFU | 125119 tok/s step 1590/19560 | loss 3.994573 (-0.66z)| norm 0.3049 (-0.32z)| lr 7.98e-04 | 4163.99 ms | 32.4% bf16 MFU | 125159 tok/s step 1591/19560 | loss 3.982153 (-0.99z)| norm 0.3250 (+0.38z)| lr 7.98e-04 | 4193.39 ms | 32.2% bf16 MFU | 125152 tok/s step 1592/19560 | loss 4.058699 (+1.07z)| norm 0.3236 (+0.32z)| lr 7.98e-04 | 4170.53 ms | 32.4% bf16 MFU | 125180 tok/s step 1593/19560 | loss 3.974391 (-1.18z)| norm 0.3742 (+2.08z)| lr 7.98e-04 | 4171.16 ms | 32.4% bf16 MFU | 125206 tok/s step 1594/19560 | loss 4.020996 (+0.07z)| norm 0.3077 (-0.27z)| lr 7.98e-04 | 4178.05 ms | 32.3% bf16 MFU | 125220 tok/s step 1595/19560 | loss 4.012808 (-0.15z)| norm 0.3193 (+0.13z)| lr 7.98e-04 | 4744.24 ms | 28.5% bf16 MFU | 124484 tok/s step 1596/19560 | loss 3.928265 (-2.35z)| norm 0.3613 (+1.60z)| lr 7.98e-04 | 4171.42 ms | 32.4% bf16 MFU | 124544 tok/s step 1597/19560 | loss 3.980853 (-0.95z)| norm 0.3190 (+0.08z)| lr 7.98e-04 | 4163.64 ms | 32.4% bf16 MFU | 124613 tok/s step 1598/19560 | loss 3.954683 (-1.61z)| norm 0.2944 (-0.80z)| lr 7.98e-04 | 4166.44 ms | 32.4% bf16 MFU | 124674 tok/s step 1599/19560 | loss 3.972151 (-1.14z)| norm 0.3666 (+1.76z)| lr 7.98e-04 | 4221.97 ms | 32.0% bf16 MFU | 124650 tok/s step 1600/19560 | loss 3.958064 (-1.49z)| norm 0.3429 (+0.90z)| lr 7.98e-04 | 4161.25 ms | 32.4% bf16 MFU | 124717 tok/s step 1601/19560 | loss 3.983739 (-0.81z)| norm 0.3328 (+0.53z)| lr 7.98e-04 | 4211.55 ms | 32.1% bf16 MFU | 124705 tok/s step 1602/19560 | loss 3.977355 (-0.96z)| norm 0.3423 (+0.88z)| lr 7.98e-04 | 4164.58 ms | 32.4% bf16 MFU | 124765 tok/s step 1603/19560 | loss 3.988119 (-0.68z)| norm 0.3079 (-0.35z)| lr 7.98e-04 | 4250.32 ms | 31.8% bf16 MFU | 124694 tok/s step 1604/19560 | loss 3.981018 (-0.86z)| norm 0.3003 (-0.62z)| lr 7.98e-04 | 4205.01 ms | 32.1% bf16 MFU | 124694 tok/s step 1605/19560 | loss 3.937206 (-1.96z)| norm 0.2787 (-1.36z)| lr 7.98e-04 | 4338.89 ms | 31.1% bf16 MFU | 124501 tok/s step 1606/19560 | loss 4.001044 (-0.31z)| norm 0.2709 (-1.61z)| lr 7.98e-04 | 4184.35 ms | 32.3% bf16 MFU | 124540 tok/s step 1607/19560 | loss 4.004314 (-0.24z)| norm 0.2906 (-0.91z)| lr 7.98e-04 | 4563.34 ms | 29.6% bf16 MFU | 124058 tok/s step 1608/19560 | loss 3.995368 (-0.47z)| norm 0.3349 (+0.62z)| lr 7.98e-04 | 4210.94 ms | 32.1% bf16 MFU | 124080 tok/s step 1609/19560 | loss 4.042480 (+0.77z)| norm 0.3297 (+0.44z)| lr 7.98e-04 | 4232.16 ms | 31.9% bf16 MFU | 124070 tok/s step 1610/19560 | loss 3.953025 (-1.55z)| norm 0.2791 (-1.31z)| lr 7.98e-04 | 4216.65 ms | 32.0% bf16 MFU | 124084 tok/s step 1611/19560 | loss 3.982665 (-0.77z)| norm 0.3201 (+0.12z)| lr 7.98e-04 | 4157.23 ms | 32.5% bf16 MFU | 124185 tok/s step 1612/19560 | loss 3.958914 (-1.38z)| norm 0.3414 (+0.86z)| lr 7.98e-04 | 4167.50 ms | 32.4% bf16 MFU | 124266 tok/s step 1613/19560 | loss 3.994080 (-0.45z)| norm 0.3402 (+0.81z)| lr 7.98e-04 | 4209.54 ms | 32.1% bf16 MFU | 124280 tok/s step 1614/19560 | loss 3.986625 (-0.64z)| norm 0.3212 (+0.16z)| lr 7.98e-04 | 4221.35 ms | 32.0% bf16 MFU | 124276 tok/s step 1615/19560 | loss 3.974026 (-0.96z)| norm 0.3092 (-0.25z)| lr 7.98e-04 | 4156.93 ms | 32.5% bf16 MFU | 124369 tok/s step 1616/19560 | loss 3.948889 (-1.61z)| norm 0.2886 (-0.96z)| lr 7.98e-04 | 4224.95 ms | 32.0% bf16 MFU | 124355 tok/s step 1617/19560 | loss 3.981877 (-0.73z)| norm 0.3077 (-0.29z)| lr 7.98e-04 | 4227.93 ms | 31.9% bf16 MFU | 124337 tok/s step 1618/19560 | loss 3.977095 (-0.84z)| norm 0.2701 (-1.57z)| lr 7.98e-04 | 4261.85 ms | 31.7% bf16 MFU | 124272 tok/s step 1619/19560 | loss 3.972259 (-0.96z)| norm 0.2361 (-2.66z)| lr 7.98e-04 | 4155.67 ms | 32.5% bf16 MFU | 124366 tok/s step 1620/19560 | loss 3.974846 (-0.88z)| norm 0.2556 (-1.96z)| lr 7.98e-04 | 4186.16 ms | 32.3% bf16 MFU | 124410 tok/s step 1621/19560 | loss 3.954911 (-1.39z)| norm 0.2560 (-1.91z)| lr 7.98e-04 | 4166.06 ms | 32.4% bf16 MFU | 124482 tok/s step 1622/19560 | loss 3.951382 (-1.46z)| norm 0.2313 (-2.63z)| lr 7.98e-04 | 4278.72 ms | 31.6% bf16 MFU | 124384 tok/s step 1623/19560 | loss 3.955508 (-1.33z)| norm 0.2425 (-2.21z)| lr 7.98e-04 | 4296.33 ms | 31.4% bf16 MFU | 124267 tok/s step 1624/19560 | loss 3.945999 (-1.55z)| norm 0.2461 (-2.06z)| lr 7.98e-04 | 4195.75 ms | 32.2% bf16 MFU | 124301 tok/s step 1625/19560 | loss 3.961200 (-1.14z)| norm 0.2681 (-1.35z)| lr 7.98e-04 | 4205.64 ms | 32.1% bf16 MFU | 124319 tok/s step 1626/19560 | loss 3.942893 (-1.59z)| norm 0.2879 (-0.72z)| lr 7.98e-04 | 4164.67 ms | 32.4% bf16 MFU | 124398 tok/s step 1627/19560 | loss 3.906284 (-2.47z)| norm 0.3034 (-0.22z)| lr 7.98e-04 | 4232.73 ms | 31.9% bf16 MFU | 124371 tok/s step 1628/19560 | loss 3.938867 (-1.61z)| norm 0.3055 (-0.15z)| lr 7.98e-04 | 4206.76 ms | 32.1% bf16 MFU | 124384 tok/s step 1629/19560 | loss 3.995821 (-0.17z)| norm 0.3169 (+0.22z)| lr 7.98e-04 | 4184.16 ms | 32.3% bf16 MFU | 124430 tok/s step 1630/19560 | loss 3.929821 (-1.81z)| norm 0.3143 (+0.14z)| lr 7.98e-04 | 4161.32 ms | 32.4% bf16 MFU | 124508 tok/s step 1631/19560 | loss 3.964344 (-0.93z)| norm 0.3357 (+0.83z)| lr 7.98e-04 | 4177.20 ms | 32.3% bf16 MFU | 124558 tok/s step 1632/19560 | loss 3.977715 (-0.59z)| norm 0.3336 (+0.76z)| lr 7.98e-04 | 4995.89 ms | 27.0% bf16 MFU | 123578 tok/s step 1633/19560 | loss 3.935534 (-1.61z)| norm 0.2745 (-1.11z)| lr 7.98e-04 | 4535.05 ms | 29.8% bf16 MFU | 123179 tok/s step 1634/19560 | loss 4.012840 (+0.31z)| norm 0.2988 (-0.34z)| lr 7.98e-04 | 4308.26 ms | 31.3% bf16 MFU | 123105 tok/s step 1635/19560 | loss 3.970777 (-0.74z)| norm 0.3017 (-0.24z)| lr 7.98e-04 | 4409.38 ms | 30.6% bf16 MFU | 122895 tok/s step 1636/19560 | loss 3.917670 (-2.02z)| norm 0.3616 (+1.63z)| lr 7.98e-04 | 4482.49 ms | 30.1% bf16 MFU | 122598 tok/s step 1637/19560 | loss 4.116191 (+2.75z)| norm 0.3097 (-0.01z)| lr 7.98e-04 | 4168.05 ms | 32.4% bf16 MFU | 122758 tok/s step 1638/19560 | loss 3.990536 (-0.25z)| norm 0.2924 (-0.55z)| lr 7.98e-04 | 4282.18 ms | 31.5% bf16 MFU | 122742 tok/s step 1639/19560 | loss 3.927336 (-1.72z)| norm 0.2525 (-1.78z)| lr 7.98e-04 | 4261.89 ms | 31.7% bf16 MFU | 122755 tok/s step 1640/19560 | loss 3.997896 (-0.05z)| norm 0.2810 (-0.88z)| lr 7.98e-04 | 4395.15 ms | 30.7% bf16 MFU | 122582 tok/s step 1641/19560 | loss 3.946368 (-1.25z)| norm 0.2831 (-0.80z)| lr 7.98e-04 | 4228.15 ms | 31.9% bf16 MFU | 122653 tok/s step 1642/19560 | loss 4.020286 (+0.51z)| norm 0.2845 (-0.74z)| lr 7.98e-04 | 4252.02 ms | 31.8% bf16 MFU | 122685 tok/s step 1643/19560 | loss 3.956773 (-0.99z)| norm 0.3105 (+0.07z)| lr 7.98e-04 | 4333.70 ms | 31.2% bf16 MFU | 122600 tok/s step 1644/19560 | loss 3.933525 (-1.53z)| norm 0.3055 (-0.09z)| lr 7.98e-04 | 4170.11 ms | 32.4% bf16 MFU | 122756 tok/s step 1645/19560 | loss 3.977257 (-0.48z)| norm 0.2845 (-0.74z)| lr 7.98e-04 | 4234.36 ms | 31.9% bf16 MFU | 122809 tok/s step 1646/19560 | loss 4.041146 (+1.02z)| norm 0.2844 (-0.74z)| lr 7.98e-04 | 4206.80 ms | 32.1% bf16 MFU | 122900 tok/s step 1647/19560 | loss 3.965321 (-0.77z)| norm 0.2607 (-1.46z)| lr 7.98e-04 | 4281.17 ms | 31.5% bf16 MFU | 122878 tok/s step 1648/19560 | loss 3.954178 (-1.02z)| norm 0.2674 (-1.25z)| lr 7.98e-04 | 4264.29 ms | 31.7% bf16 MFU | 122882 tok/s step 1649/19560 | loss 3.982894 (-0.34z)| norm 0.2602 (-1.45z)| lr 7.98e-04 | 4264.85 ms | 31.7% bf16 MFU | 122884 tok/s step 1650/19560 | loss 4.023983 (+0.65z)| norm 0.2845 (-0.71z)| lr 7.98e-04 | 4222.47 ms | 32.0% bf16 MFU | 122949 tok/s step 1651/19560 | loss 3.975083 (-0.51z)| norm 0.2937 (-0.43z)| lr 7.98e-04 | 4189.95 ms | 32.2% bf16 MFU | 123058 tok/s step 1652/19560 | loss 4.019245 (+0.56z)| norm 0.3017 (-0.18z)| lr 7.98e-04 | 4298.01 ms | 31.4% bf16 MFU | 123004 tok/s step 1653/19560 | loss 3.969229 (-0.64z)| norm 0.3101 (+0.08z)| lr 7.98e-04 | 4234.07 ms | 31.9% bf16 MFU | 123045 tok/s step 1654/19560 | loss 3.963915 (-0.76z)| norm 0.3467 (+1.19z)| lr 7.98e-04 | 4163.67 ms | 32.4% bf16 MFU | 123189 tok/s step 1655/19560 | loss 3.979630 (-0.37z)| norm 0.3993 (+2.70z)| lr 7.98e-04 | 4163.45 ms | 32.4% bf16 MFU | 123326 tok/s step 1656/19560 | loss 3.990219 (-0.11z)| norm 0.4498 (+3.93z)| lr 7.98e-04 | 4264.96 ms | 31.7% bf16 MFU | 123306 tok/s step 1657/19560 | loss 4.002541 (+0.20z)| norm 0.4050 (+2.61z)| lr 7.98e-04 | 4168.17 ms | 32.4% bf16 MFU | 123430 tok/s step 1658/19560 | loss 3.978223 (-0.41z)| norm 0.3151 (+0.15z)| lr 7.98e-04 | 4168.63 ms | 32.4% bf16 MFU | 123547 tok/s step 1659/19560 | loss 3.953770 (-1.02z)| norm 0.2991 (-0.28z)| lr 7.98e-04 | 4175.09 ms | 32.3% bf16 MFU | 123648 tok/s step 1660/19560 | loss 3.994579 (+0.03z)| norm 0.2650 (-1.19z)| lr 7.98e-04 | 4314.67 ms | 31.3% bf16 MFU | 123541 tok/s step 1661/19560 | loss 3.961427 (-0.81z)| norm 0.2777 (-0.84z)| lr 7.98e-04 | 4187.31 ms | 32.2% bf16 MFU | 123625 tok/s step 1662/19560 | loss 3.904869 (-2.21z)| norm 0.2822 (-0.71z)| lr 7.97e-04 | 4176.45 ms | 32.3% bf16 MFU | 123720 tok/s step 1663/19560 | loss 3.939978 (-1.30z)| norm 0.2840 (-0.65z)| lr 7.97e-04 | 4167.35 ms | 32.4% bf16 MFU | 123825 tok/s step 1664/19560 | loss 3.975836 (-0.39z)| norm 0.2828 (-0.68z)| lr 7.97e-04 | 4169.81 ms | 32.4% bf16 MFU | 123920 tok/s step 1665/19560 | loss 3.963961 (-0.68z)| norm 0.2821 (-0.70z)| lr 7.97e-04 | 4166.57 ms | 32.4% bf16 MFU | 124016 tok/s step 1666/19560 | loss 3.989215 (-0.04z)| norm 0.2901 (-0.48z)| lr 7.97e-04 | 4209.56 ms | 32.1% bf16 MFU | 124042 tok/s step 1667/19560 | loss 4.040451 (+1.25z)| norm 0.2882 (-0.52z)| lr 7.97e-04 | 4368.29 ms | 30.9% bf16 MFU | 123841 tok/s step 1668/19560 | loss 3.961805 (-0.73z)| norm 0.3668 (+1.58z)| lr 7.97e-04 | 4211.27 ms | 32.1% bf16 MFU | 123874 tok/s step 1669/19560 | loss 4.026440 (+0.96z)| norm 0.3191 (+0.30z)| lr 7.97e-04 | 4338.79 ms | 31.1% bf16 MFU | 123722 tok/s step 1670/19560 | loss 3.936250 (-1.37z)| norm 0.2930 (-0.41z)| lr 7.97e-04 | 4316.94 ms | 31.3% bf16 MFU | 123609 tok/s step 1671/19560 | loss 4.020048 (+0.80z)| norm 0.3271 (+0.50z)| lr 7.97e-04 | 4233.30 ms | 31.9% bf16 MFU | 123621 tok/s step 1672/19560 | loss 3.956852 (-0.83z)| norm 0.2701 (-1.04z)| lr 7.97e-04 | 4177.01 ms | 32.3% bf16 MFU | 123715 tok/s step 1673/19560 | loss 3.921260 (-1.75z)| norm 0.2692 (-1.07z)| lr 7.97e-04 | 4288.48 ms | 31.5% bf16 MFU | 123642 tok/s step 1674/19560 | loss 3.954937 (-0.85z)| norm 0.3062 (-0.07z)| lr 7.97e-04 | 4174.57 ms | 32.3% bf16 MFU | 123740 tok/s step 1675/19560 | loss 3.984479 (-0.07z)| norm 0.2864 (-0.60z)| lr 7.97e-04 | 4235.08 ms | 31.9% bf16 MFU | 123743 tok/s step 1676/19560 | loss 3.960577 (-0.69z)| norm 0.2468 (-1.65z)| lr 7.97e-04 | 4171.50 ms | 32.4% bf16 MFU | 123840 tok/s step 1677/19560 | loss 4.017011 (+0.79z)| norm 0.2703 (-1.00z)| lr 7.97e-04 | 4287.58 ms | 31.5% bf16 MFU | 123762 tok/s step 1678/19560 | loss 3.908370 (-2.02z)| norm 0.2868 (-0.54z)| lr 7.97e-04 | 4242.35 ms | 31.8% bf16 MFU | 123753 tok/s step 1679/19560 | loss 3.974549 (-0.30z)| norm 0.3138 (+0.22z)| lr 7.97e-04 | 4168.48 ms | 32.4% bf16 MFU | 123854 tok/s step 1680/19560 | loss 4.030379 (+1.20z)| norm 0.3575 (+1.44z)| lr 7.97e-04 | 4207.03 ms | 32.1% bf16 MFU | 123892 tok/s step 1681/19560 | loss 3.961583 (-0.63z)| norm 0.3013 (-0.12z)| lr 7.97e-04 | 4173.21 ms | 32.4% bf16 MFU | 123979 tok/s step 1682/19560 | loss 3.989887 (+0.14z)| norm 0.3423 (+1.05z)| lr 7.97e-04 | 4169.03 ms | 32.4% bf16 MFU | 124068 tok/s step 1683/19560 | loss 3.955092 (-0.79z)| norm 0.3408 (+1.01z)| lr 7.97e-04 | 4223.95 ms | 32.0% bf16 MFU | 124071 tok/s step 1684/19560 | loss 3.950285 (-0.91z)| norm 0.2958 (-0.26z)| lr 7.97e-04 | 4239.50 ms | 31.8% bf16 MFU | 124051 tok/s step 1685/19560 | loss 3.958463 (-0.67z)| norm 0.2989 (-0.16z)| lr 7.97e-04 | 4173.10 ms | 32.4% bf16 MFU | 124130 tok/s step 1686/19560 | loss 3.955974 (-0.73z)| norm 0.3115 (+0.21z)| lr 7.97e-04 | 4184.36 ms | 32.3% bf16 MFU | 124188 tok/s step 1687/19560 | loss 3.988144 (+0.15z)| norm 0.3421 (+1.09z)| lr 7.97e-04 | 4176.80 ms | 32.3% bf16 MFU | 124255 tok/s step 1688/19560 | loss 4.011950 (+0.81z)| norm 0.3294 (+0.71z)| lr 7.97e-04 | 4166.43 ms | 32.4% bf16 MFU | 124334 tok/s step 1689/19560 | loss 3.970120 (-0.33z)| norm 0.2952 (-0.28z)| lr 7.97e-04 | 4169.07 ms | 32.4% bf16 MFU | 124405 tok/s step 1690/19560 | loss 3.960973 (-0.57z)| norm 0.2883 (-0.47z)| lr 7.97e-04 | 4227.18 ms | 31.9% bf16 MFU | 124386 tok/s step 1691/19560 | loss 3.906938 (-2.05z)| norm 0.2870 (-0.51z)| lr 7.97e-04 | 4180.53 ms | 32.3% bf16 MFU | 124438 tok/s step 1692/19560 | loss 3.869297 (-2.96z)| norm 0.2767 (-0.80z)| lr 7.97e-04 | 4170.20 ms | 32.4% bf16 MFU | 124502 tok/s step 1693/19560 | loss 3.945910 (-0.89z)| norm 0.3220 (+0.51z)| lr 7.97e-04 | 4169.97 ms | 32.4% bf16 MFU | 124563 tok/s step 1694/19560 | loss 3.928792 (-1.33z)| norm 0.2960 (-0.24z)| lr 7.97e-04 | 4258.34 ms | 31.7% bf16 MFU | 124491 tok/s step 1695/19560 | loss 3.966681 (-0.31z)| norm 0.3168 (+0.36z)| lr 7.97e-04 | 4235.40 ms | 31.9% bf16 MFU | 124456 tok/s step 1696/19560 | loss 3.953250 (-0.66z)| norm 0.2655 (-1.12z)| lr 7.97e-04 | 4163.85 ms | 32.4% bf16 MFU | 124529 tok/s step 1697/19560 | loss 3.964539 (-0.36z)| norm 0.2560 (-1.38z)| lr 7.97e-04 | 4176.25 ms | 32.3% bf16 MFU | 124579 tok/s step 1698/19560 | loss 3.930169 (-1.26z)| norm 0.2709 (-0.95z)| lr 7.97e-04 | 4169.93 ms | 32.4% bf16 MFU | 124637 tok/s step 1699/19560 | loss 3.944069 (-0.88z)| norm 0.3197 (+0.45z)| lr 7.97e-04 | 4164.31 ms | 32.4% bf16 MFU | 124700 tok/s step 1700/19560 | loss 4.012949 (+0.94z)| norm 0.3337 (+0.84z)| lr 7.97e-04 | 4173.79 ms | 32.3% bf16 MFU | 124746 tok/s step 1701/19560 | loss 3.938056 (-1.03z)| norm 0.6138 (+6.93z)| lr 7.97e-04 | 4254.63 ms | 31.7% bf16 MFU | 124670 tok/s step 1702/19560 | loss 3.912633 (-1.69z)| norm 0.3379 (+0.70z)| lr 7.97e-04 | 4177.20 ms | 32.3% bf16 MFU | 124712 tok/s step 1703/19560 | loss 3.949818 (-0.68z)| norm 0.3308 (+0.53z)| lr 7.97e-04 | 4179.22 ms | 32.3% bf16 MFU | 124749 tok/s step 1704/19560 | loss 3.920933 (-1.43z)| norm 0.3530 (+1.02z)| lr 7.97e-04 | 4223.84 ms | 32.0% bf16 MFU | 124718 tok/s step 1705/19560 | loss 3.996030 (+0.57z)| norm 0.4265 (+2.58z)| lr 7.97e-04 | 4239.83 ms | 31.8% bf16 MFU | 124665 tok/s step 1706/19560 | loss 4.039388 (+1.69z)| norm 0.4984 (+3.88z)| lr 7.97e-04 | 4182.78 ms | 32.3% bf16 MFU | 124699 tok/s step 1707/19560 | loss 3.988270 (+0.34z)| norm 0.5550 (+4.58z)| lr 7.97e-04 | 4186.76 ms | 32.2% bf16 MFU | 124725 tok/s step 1708/19560 | loss 3.979377 (+0.12z)| norm 0.5411 (+4.01z)| lr 7.97e-04 | 4221.55 ms | 32.0% bf16 MFU | 124699 tok/s step 1709/19560 | loss 4.062407 (+2.30z)| norm 0.4841 (+2.87z)| lr 7.97e-04 | 4164.09 ms | 32.4% bf16 MFU | 124759 tok/s step 1710/19560 | loss 3.956162 (-0.50z)| norm 0.3720 (+0.94z)| lr 7.97e-04 | 4163.14 ms | 32.4% bf16 MFU | 124818 tok/s step 1711/19560 | loss 3.963122 (-0.30z)| norm 0.3238 (+0.12z)| lr 7.97e-04 | 4173.98 ms | 32.3% bf16 MFU | 124857 tok/s step 1712/19560 | loss 4.015069 (+1.11z)| norm 0.3175 (+0.01z)| lr 7.97e-04 | 4172.21 ms | 32.4% bf16 MFU | 124898 tok/s step 1713/19560 | loss 3.954004 (-0.53z)| norm 0.3187 (+0.04z)| lr 7.97e-04 | 4180.76 ms | 32.3% bf16 MFU | 124923 tok/s step 1714/19560 | loss 3.984908 (+0.33z)| norm 0.3223 (+0.10z)| lr 7.97e-04 | 4180.73 ms | 32.3% bf16 MFU | 124947 tok/s step 1715/19560 | loss 3.992719 (+0.56z)| norm 0.3011 (-0.26z)| lr 7.97e-04 | 4179.80 ms | 32.3% bf16 MFU | 124971 tok/s step 1716/19560 | loss 3.955011 (-0.49z)| norm 0.2616 (-0.92z)| lr 7.97e-04 | 4165.12 ms | 32.4% bf16 MFU | 125017 tok/s step 1717/19560 | loss 3.938101 (-0.96z)| norm 0.2928 (-0.39z)| lr 7.97e-04 | 4179.15 ms | 32.3% bf16 MFU | 125038 tok/s step 1718/19560 | loss 3.963530 (-0.24z)| norm 0.2629 (-0.89z)| lr 7.97e-04 | 4175.98 ms | 32.3% bf16 MFU | 125064 tok/s step 1719/19560 | loss 3.968780 (-0.09z)| norm 0.2397 (-1.26z)| lr 7.97e-04 | 4179.94 ms | 32.3% bf16 MFU | 125082 tok/s step 1720/19560 | loss 3.944197 (-0.77z)| norm 0.2438 (-1.18z)| lr 7.97e-04 | 4523.64 ms | 29.8% bf16 MFU | 124623 tok/s step 1721/19560 | loss 3.949869 (-0.60z)| norm 0.2458 (-1.13z)| lr 7.97e-04 | 4177.80 ms | 32.3% bf16 MFU | 124667 tok/s step 1722/19560 | loss 3.962893 (-0.22z)| norm 0.2585 (-0.90z)| lr 7.97e-04 | 4181.24 ms | 32.3% bf16 MFU | 124703 tok/s step 1723/19560 | loss 3.917548 (-1.49z)| norm 0.2505 (-1.03z)| lr 7.97e-04 | 4191.27 ms | 32.2% bf16 MFU | 124722 tok/s step 1724/19560 | loss 4.023209 (+1.50z)| norm 0.2655 (-0.76z)| lr 7.97e-04 | 4183.41 ms | 32.3% bf16 MFU | 124752 tok/s step 1725/19560 | loss 3.944153 (-0.74z)| norm 0.2935 (-0.29z)| lr 7.97e-04 | 4163.22 ms | 32.4% bf16 MFU | 124811 tok/s step 1726/19560 | loss 4.007087 (+1.03z)| norm 0.3022 (-0.15z)| lr 7.97e-04 | 4170.71 ms | 32.4% bf16 MFU | 124856 tok/s step 1727/19560 | loss 3.949582 (-0.59z)| norm 0.2946 (-0.26z)| lr 7.97e-04 | 4173.00 ms | 32.4% bf16 MFU | 124895 tok/s step 1728/19560 | loss 3.990211 (+0.55z)| norm 0.2547 (-0.92z)| lr 7.97e-04 | 4159.70 ms | 32.5% bf16 MFU | 124953 tok/s step 1729/19560 | loss 3.899448 (-1.97z)| norm 0.2677 (-0.69z)| lr 7.97e-04 | 4173.87 ms | 32.3% bf16 MFU | 124986 tok/s step 1730/19560 | loss 3.874204 (-2.58z)| norm 0.2810 (-0.46z)| lr 7.97e-04 | 4167.59 ms | 32.4% bf16 MFU | 125026 tok/s step 1731/19560 | loss 3.908819 (-1.61z)| norm 0.2598 (-0.81z)| lr 7.97e-04 | 4165.15 ms | 32.4% bf16 MFU | 125069 tok/s step 1732/19560 | loss 3.969841 (+0.03z)| norm 0.2926 (-0.26z)| lr 7.97e-04 | 4172.81 ms | 32.4% bf16 MFU | 125097 tok/s step 1733/19560 | loss 3.918683 (-1.33z)| norm 0.2676 (-0.68z)| lr 7.97e-04 | 4167.62 ms | 32.4% bf16 MFU | 125133 tok/s step 1734/19560 | loss 3.947678 (-0.55z)| norm 0.2692 (-0.65z)| lr 7.97e-04 | 4362.90 ms | 30.9% bf16 MFU | 124884 tok/s step 1735/19560 | loss 3.959389 (-0.22z)| norm 0.2948 (-0.22z)| lr 7.97e-04 | 4177.47 ms | 32.3% bf16 MFU | 124915 tok/s step 1736/19560 | loss 3.952466 (-0.40z)| norm 0.2748 (-0.55z)| lr 7.97e-04 | 4161.45 ms | 32.4% bf16 MFU | 124969 tok/s step 1737/19560 | loss 3.985739 (+0.52z)| norm 0.2625 (-0.74z)| lr 7.97e-04 | 4169.30 ms | 32.4% bf16 MFU | 125008 tok/s step 1738/19560 | loss 3.898229 (-1.84z)| norm 0.2534 (-0.89z)| lr 7.97e-04 | 4162.46 ms | 32.4% bf16 MFU | 125055 tok/s step 1739/19560 | loss 3.925231 (-1.10z)| norm 0.2687 (-0.63z)| lr 7.97e-04 | 4172.58 ms | 32.4% bf16 MFU | 125085 tok/s step 1740/19560 | loss 3.913033 (-1.41z)| norm 0.2634 (-0.70z)| lr 7.97e-04 | 4238.79 ms | 31.9% bf16 MFU | 125015 tok/s step 1741/19560 | loss 3.933964 (-0.83z)| norm 0.2572 (-0.80z)| lr 7.97e-04 | 4165.15 ms | 32.4% bf16 MFU | 125058 tok/s step 1742/19560 | loss 3.925114 (-1.05z)| norm 0.2728 (-0.53z)| lr 7.97e-04 | 4261.82 ms | 31.7% bf16 MFU | 124956 tok/s step 1743/19560 | loss 3.935124 (-0.78z)| norm 0.2711 (-0.55z)| lr 7.97e-04 | 4185.12 ms | 32.3% bf16 MFU | 124972 tok/s step 1744/19560 | loss 3.978724 (+0.37z)| norm 0.3082 (+0.06z)| lr 7.97e-04 | 4168.68 ms | 32.4% bf16 MFU | 125012 tok/s step 1745/19560 | loss 4.007586 (+1.13z)| norm 0.2940 (-0.18z)| lr 7.97e-04 | 4181.18 ms | 32.3% bf16 MFU | 125031 tok/s step 1746/19560 | loss 4.005314 (+1.06z)| norm 0.3196 (+0.24z)| lr 7.97e-04 | 4215.78 ms | 32.0% bf16 MFU | 124998 tok/s step 1747/19560 | loss 3.993396 (+0.74z)| norm 0.3119 (+0.10z)| lr 7.97e-04 | 4259.44 ms | 31.7% bf16 MFU | 124902 tok/s step 1748/19560 | loss 3.952996 (-0.32z)| norm 0.2944 (-0.19z)| lr 7.97e-04 | 4177.67 ms | 32.3% bf16 MFU | 124932 tok/s step 1749/19560 | loss 4.017467 (+1.36z)| norm 0.3466 (+0.67z)| lr 7.97e-04 | 4173.12 ms | 32.4% bf16 MFU | 124967 tok/s step 1750/19560 | loss 3.932611 (-0.85z)| norm 0.3149 (+0.13z)| lr 7.97e-04 | 4176.42 ms | 32.3% bf16 MFU | 124996 tok/s val loss 3.926908 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2645/10042 = 0.263394 step 1751/19560 | loss 3.883038 (-2.09z)| norm 0.2994 (-0.14z)| lr 7.97e-04 | 4183.25 ms | 32.3% bf16 MFU | 125012 tok/s step 1752/19560 | loss 3.883247 (-2.04z)| norm 0.2948 (-0.23z)| lr 7.97e-04 | 4288.13 ms | 31.5% bf16 MFU | 124875 tok/s step 1753/19560 | loss 3.940136 (-0.60z)| norm 0.2741 (-0.58z)| lr 7.97e-04 | 4172.57 ms | 32.4% bf16 MFU | 124914 tok/s step 1754/19560 | loss 3.936274 (-0.70z)| norm 0.2615 (-0.78z)| lr 7.97e-04 | 4176.78 ms | 32.3% bf16 MFU | 124944 tok/s step 1755/19560 | loss 3.916560 (-1.20z)| norm 0.3274 (+0.32z)| lr 7.97e-04 | 4172.52 ms | 32.4% bf16 MFU | 124980 tok/s step 1756/19560 | loss 3.953312 (-0.28z)| norm 0.3129 (+0.08z)| lr 7.97e-04 | 4190.73 ms | 32.2% bf16 MFU | 124986 tok/s step 1757/19560 | loss 3.975837 (+0.30z)| norm 0.3038 (-0.07z)| lr 7.97e-04 | 4258.89 ms | 31.7% bf16 MFU | 124892 tok/s step 1758/19560 | loss 3.985531 (+0.53z)| norm 0.3048 (-0.05z)| lr 7.97e-04 | 4300.74 ms | 31.4% bf16 MFU | 124743 tok/s step 1759/19560 | loss 3.922798 (-1.05z)| norm 0.3474 (+0.66z)| lr 7.97e-04 | 4229.72 ms | 31.9% bf16 MFU | 124703 tok/s step 1760/19560 | loss 3.947456 (-0.42z)| norm 0.3131 (+0.09z)| lr 7.97e-04 | 4191.95 ms | 32.2% bf16 MFU | 124722 tok/s step 1761/19560 | loss 3.921727 (-1.06z)| norm 0.2885 (-0.33z)| lr 7.97e-04 | 4259.59 ms | 31.7% bf16 MFU | 124640 tok/s step 1762/19560 | loss 3.984749 (+0.53z)| norm 0.3235 (+0.26z)| lr 7.97e-04 | 4208.27 ms | 32.1% bf16 MFU | 124637 tok/s step 1763/19560 | loss 3.988269 (+0.62z)| norm 0.3360 (+0.46z)| lr 7.97e-04 | 4331.09 ms | 31.2% bf16 MFU | 124458 tok/s step 1764/19560 | loss 4.002851 (+0.97z)| norm 0.2838 (-0.41z)| lr 7.97e-04 | 4269.78 ms | 31.6% bf16 MFU | 124374 tok/s step 1765/19560 | loss 3.972813 (+0.25z)| norm 0.3082 (+0.00z)| lr 7.97e-04 | 4167.85 ms | 32.4% bf16 MFU | 124445 tok/s step 1766/19560 | loss 3.920510 (-1.14z)| norm 0.2948 (-0.22z)| lr 7.97e-04 | 4172.99 ms | 32.4% bf16 MFU | 124505 tok/s step 1767/19560 | loss 3.947432 (-0.42z)| norm 0.2537 (-0.92z)| lr 7.97e-04 | 4290.15 ms | 31.5% bf16 MFU | 124390 tok/s step 1768/19560 | loss 3.949425 (-0.36z)| norm 0.2619 (-0.78z)| lr 7.97e-04 | 4184.34 ms | 32.3% bf16 MFU | 124435 tok/s step 1769/19560 | loss 3.939893 (-0.61z)| norm 0.2542 (-0.90z)| lr 7.97e-04 | 4322.43 ms | 31.2% bf16 MFU | 124278 tok/s step 1770/19560 | loss 3.923817 (-1.03z)| norm 0.2686 (-0.65z)| lr 7.97e-04 | 4162.56 ms | 32.4% bf16 MFU | 124362 tok/s step 1771/19560 | loss 3.875282 (-2.28z)| norm 0.2925 (-0.25z)| lr 7.97e-04 | 4172.76 ms | 32.4% bf16 MFU | 124426 tok/s step 1772/19560 | loss 3.896361 (-1.70z)| norm 0.3037 (-0.06z)| lr 7.97e-04 | 4249.91 ms | 31.8% bf16 MFU | 124373 tok/s step 1773/19560 | loss 3.932835 (-0.73z)| norm 0.3087 (+0.02z)| lr 7.97e-04 | 4164.57 ms | 32.4% bf16 MFU | 124449 tok/s step 1774/19560 | loss 3.931850 (-0.74z)| norm 0.2759 (-0.53z)| lr 7.97e-04 | 4159.43 ms | 32.5% bf16 MFU | 124529 tok/s step 1775/19560 | loss 3.968403 (+0.23z)| norm 0.3006 (-0.12z)| lr 7.97e-04 | 4272.08 ms | 31.6% bf16 MFU | 124439 tok/s step 1776/19560 | loss 3.912167 (-1.25z)| norm 0.3192 (+0.19z)| lr 7.97e-04 | 4168.35 ms | 32.4% bf16 MFU | 124506 tok/s step 1777/19560 | loss 3.957009 (-0.06z)| norm 0.3104 (+0.03z)| lr 7.97e-04 | 4179.12 ms | 32.3% bf16 MFU | 124553 tok/s step 1778/19560 | loss 3.936744 (-0.58z)| norm 0.2806 (-0.47z)| lr 7.97e-04 | 4168.78 ms | 32.4% bf16 MFU | 124614 tok/s step 1779/19560 | loss 3.909295 (-1.30z)| norm 0.2656 (-0.72z)| lr 7.97e-04 | 4170.17 ms | 32.4% bf16 MFU | 124669 tok/s step 1780/19560 | loss 3.986010 (+0.76z)| norm 0.2990 (-0.16z)| lr 7.97e-04 | 4200.03 ms | 32.1% bf16 MFU | 124677 tok/s step 1781/19560 | loss 3.918353 (-1.04z)| norm 0.2455 (-1.05z)| lr 7.97e-04 | 4192.93 ms | 32.2% bf16 MFU | 124696 tok/s step 1782/19560 | loss 3.930474 (-0.71z)| norm 0.2415 (-1.10z)| lr 7.97e-04 | 4180.27 ms | 32.3% bf16 MFU | 124732 tok/s step 1783/19560 | loss 3.946752 (-0.27z)| norm 0.2565 (-0.83z)| lr 7.97e-04 | 4188.99 ms | 32.2% bf16 MFU | 124753 tok/s step 1784/19560 | loss 3.906870 (-1.31z)| norm 0.2623 (-0.73z)| lr 7.96e-04 | 4160.13 ms | 32.5% bf16 MFU | 124817 tok/s step 1785/19560 | loss 3.962751 (+0.19z)| norm 0.2755 (-0.49z)| lr 7.96e-04 | 4178.37 ms | 32.3% bf16 MFU | 124850 tok/s step 1786/19560 | loss 3.983375 (+0.74z)| norm 0.3314 (+0.49z)| lr 7.96e-04 | 4172.06 ms | 32.4% bf16 MFU | 124891 tok/s step 1787/19560 | loss 3.906502 (-1.31z)| norm 0.3170 (+0.23z)| lr 7.96e-04 | 4218.94 ms | 32.0% bf16 MFU | 124860 tok/s step 1788/19560 | loss 3.928508 (-0.71z)| norm 0.3200 (+0.28z)| lr 7.96e-04 | 4183.25 ms | 32.3% bf16 MFU | 124883 tok/s step 1789/19560 | loss 3.878547 (-1.99z)| norm 0.3094 (+0.09z)| lr 7.96e-04 | 4179.12 ms | 32.3% bf16 MFU | 124912 tok/s step 1790/19560 | loss 3.881761 (-1.89z)| norm 0.2881 (-0.28z)| lr 7.96e-04 | 4171.89 ms | 32.4% bf16 MFU | 124950 tok/s step 1791/19560 | loss 3.941253 (-0.34z)| norm 0.2916 (-0.22z)| lr 7.96e-04 | 4166.24 ms | 32.4% bf16 MFU | 124994 tok/s step 1792/19560 | loss 3.896421 (-1.48z)| norm 0.3127 (+0.14z)| lr 7.96e-04 | 4172.72 ms | 32.4% bf16 MFU | 125027 tok/s step 1793/19560 | loss 3.889647 (-1.62z)| norm 0.3344 (+0.51z)| lr 7.96e-04 | 4168.47 ms | 32.4% bf16 MFU | 125064 tok/s step 1794/19560 | loss 3.902066 (-1.28z)| norm 0.3313 (+0.45z)| lr 7.96e-04 | 4182.02 ms | 32.3% bf16 MFU | 125079 tok/s step 1795/19560 | loss 3.927783 (-0.62z)| norm 0.3695 (+1.11z)| lr 7.96e-04 | 4173.46 ms | 32.4% bf16 MFU | 125107 tok/s step 1796/19560 | loss 3.969651 (+0.47z)| norm 0.3581 (+0.91z)| lr 7.96e-04 | 4177.22 ms | 32.3% bf16 MFU | 125127 tok/s step 1797/19560 | loss 3.958773 (+0.21z)| norm 0.2742 (-0.55z)| lr 7.96e-04 | 4179.16 ms | 32.3% bf16 MFU | 125143 tok/s step 1798/19560 | loss 3.948954 (-0.06z)| norm 0.2722 (-0.58z)| lr 7.96e-04 | 4185.61 ms | 32.3% bf16 MFU | 125149 tok/s step 1799/19560 | loss 3.919991 (-0.81z)| norm 0.3129 (+0.13z)| lr 7.96e-04 | 4179.19 ms | 32.3% bf16 MFU | 125164 tok/s step 1800/19560 | loss 3.967030 (+0.45z)| norm 0.3133 (+0.13z)| lr 7.96e-04 | 4177.85 ms | 32.3% bf16 MFU | 125181 tok/s step 1801/19560 | loss 3.940668 (-0.26z)| norm 0.2969 (-0.16z)| lr 7.96e-04 | 4171.49 ms | 32.4% bf16 MFU | 125206 tok/s step 1802/19560 | loss 3.964418 (+0.37z)| norm 0.2591 (-0.81z)| lr 7.96e-04 | 4167.87 ms | 32.4% bf16 MFU | 125235 tok/s step 1803/19560 | loss 3.907689 (-1.13z)| norm 0.2892 (-0.28z)| lr 7.96e-04 | 4172.12 ms | 32.4% bf16 MFU | 125257 tok/s step 1804/19560 | loss 3.898466 (-1.35z)| norm 0.2950 (-0.19z)| lr 7.96e-04 | 4171.06 ms | 32.4% bf16 MFU | 125279 tok/s step 1805/19560 | loss 3.911909 (-0.98z)| norm 0.2928 (-0.23z)| lr 7.96e-04 | 4164.54 ms | 32.4% bf16 MFU | 125309 tok/s step 1806/19560 | loss 3.968859 (+0.53z)| norm 0.2690 (-0.65z)| lr 7.96e-04 | 4175.79 ms | 32.3% bf16 MFU | 125322 tok/s step 1807/19560 | loss 3.900897 (-1.27z)| norm 0.2894 (-0.29z)| lr 7.96e-04 | 4173.04 ms | 32.4% bf16 MFU | 125337 tok/s step 1808/19560 | loss 3.931365 (-0.45z)| norm 0.2967 (-0.15z)| lr 7.96e-04 | 4180.51 ms | 32.3% bf16 MFU | 125341 tok/s step 1809/19560 | loss 3.956760 (+0.24z)| norm 0.3197 (+0.25z)| lr 7.96e-04 | 4164.53 ms | 32.4% bf16 MFU | 125369 tok/s step 1810/19560 | loss 3.891271 (-1.51z)| norm 0.3060 (+0.02z)| lr 7.96e-04 | 4164.57 ms | 32.4% bf16 MFU | 125395 tok/s step 1811/19560 | loss 3.994049 (+1.26z)| norm 0.3112 (+0.11z)| lr 7.96e-04 | 4170.87 ms | 32.4% bf16 MFU | 125410 tok/s step 1812/19560 | loss 3.938792 (-0.23z)| norm 0.2809 (-0.42z)| lr 7.96e-04 | 4178.10 ms | 32.3% bf16 MFU | 125414 tok/s step 1813/19560 | loss 4.000597 (+1.42z)| norm 0.4990 (+3.25z)| lr 7.96e-04 | 4188.24 ms | 32.2% bf16 MFU | 125402 tok/s step 1814/19560 | loss 3.925018 (-0.59z)| norm 0.3473 (+0.68z)| lr 7.96e-04 | 4165.26 ms | 32.4% bf16 MFU | 125426 tok/s step 1815/19560 | loss 3.954411 (+0.20z)| norm 0.3155 (+0.15z)| lr 7.96e-04 | 4285.47 ms | 31.5% bf16 MFU | 125272 tok/s step 1816/19560 | loss 3.947060 (+0.01z)| norm 0.3345 (+0.47z)| lr 7.96e-04 | 4305.57 ms | 31.4% bf16 MFU | 125096 tok/s step 1817/19560 | loss 3.876522 (-1.86z)| norm 0.3402 (+0.56z)| lr 7.96e-04 | 4158.39 ms | 32.5% bf16 MFU | 125146 tok/s step 1818/19560 | loss 3.904267 (-1.10z)| norm 0.3666 (+0.99z)| lr 7.96e-04 | 4170.70 ms | 32.4% bf16 MFU | 125174 tok/s step 1819/19560 | loss 3.936502 (-0.24z)| norm 0.3164 (+0.15z)| lr 7.96e-04 | 4157.31 ms | 32.5% bf16 MFU | 125221 tok/s step 1820/19560 | loss 3.900285 (-1.23z)| norm 0.3283 (+0.34z)| lr 7.96e-04 | 4179.61 ms | 32.3% bf16 MFU | 125232 tok/s step 1821/19560 | loss 4.065482 (+3.09z)| norm 0.3197 (+0.20z)| lr 7.96e-04 | 4165.67 ms | 32.4% bf16 MFU | 125263 tok/s step 1822/19560 | loss 3.908923 (-0.98z)| norm 0.2746 (-0.56z)| lr 7.96e-04 | 4678.06 ms | 28.9% bf16 MFU | 124603 tok/s step 1823/19560 | loss 3.863988 (-2.09z)| norm 0.2746 (-0.55z)| lr 7.96e-04 | 4654.64 ms | 29.0% bf16 MFU | 124005 tok/s step 1824/19560 | loss 3.845912 (-2.47z)| norm 0.2778 (-0.50z)| lr 7.96e-04 | 4576.08 ms | 29.5% bf16 MFU | 123533 tok/s step 1825/19560 | loss 3.812581 (-3.15z)| norm 0.2586 (-0.82z)| lr 7.96e-04 | 4620.54 ms | 29.2% bf16 MFU | 123030 tok/s step 1826/19560 | loss 3.793259 (-3.42z)| norm 0.2792 (-0.48z)| lr 7.96e-04 | 4687.54 ms | 28.8% bf16 MFU | 122471 tok/s step 1827/19560 | loss 3.869513 (-1.64z)| norm 0.2978 (-0.16z)| lr 7.96e-04 | 4536.59 ms | 29.8% bf16 MFU | 122126 tok/s step 1828/19560 | loss 3.875547 (-1.48z)| norm 0.3331 (+0.43z)| lr 7.96e-04 | 4266.00 ms | 31.6% bf16 MFU | 122165 tok/s step 1829/19560 | loss 3.888981 (-1.16z)| norm 0.3656 (+1.12z)| lr 7.96e-04 | 4444.95 ms | 30.4% bf16 MFU | 121954 tok/s step 1830/19560 | loss 3.962625 (+0.49z)| norm 0.3155 (+0.19z)| lr 7.96e-04 | 4316.66 ms | 31.3% bf16 MFU | 121929 tok/s step 1831/19560 | loss 3.835910 (-2.30z)| norm 0.3148 (+0.18z)| lr 7.96e-04 | 4211.38 ms | 32.1% bf16 MFU | 122057 tok/s step 1832/19560 | loss 3.803455 (-2.90z)| norm 0.2736 (-0.58z)| lr 7.96e-04 | 4325.00 ms | 31.2% bf16 MFU | 122016 tok/s step 1833/19560 | loss 3.921932 (-0.36z)| norm 0.2796 (-0.46z)| lr 7.96e-04 | 4440.30 ms | 30.4% bf16 MFU | 121819 tok/s step 1834/19560 | loss 3.784238 (-3.20z)| norm 0.2688 (-0.67z)| lr 7.96e-04 | 4263.44 ms | 31.7% bf16 MFU | 121876 tok/s step 1835/19560 | loss 3.894197 (-0.88z)| norm 0.2867 (-0.29z)| lr 7.96e-04 | 4181.48 ms | 32.3% bf16 MFU | 122052 tok/s step 1836/19560 | loss 3.953839 (+0.38z)| norm 0.2718 (-0.67z)| lr 7.96e-04 | 4183.08 ms | 32.3% bf16 MFU | 122216 tok/s step 1837/19560 | loss 3.972405 (+0.81z)| norm 0.2730 (-0.66z)| lr 7.96e-04 | 4233.32 ms | 31.9% bf16 MFU | 122297 tok/s step 1838/19560 | loss 3.826827 (-2.27z)| norm 0.2693 (-0.76z)| lr 7.96e-04 | 4244.43 ms | 31.8% bf16 MFU | 122359 tok/s step 1839/19560 | loss 3.875150 (-1.23z)| norm 0.2955 (+0.02z)| lr 7.96e-04 | 4381.17 ms | 30.8% bf16 MFU | 122224 tok/s step 1840/19560 | loss 3.912335 (-0.43z)| norm 0.2448 (-1.45z)| lr 7.96e-04 | 4173.43 ms | 32.4% bf16 MFU | 122394 tok/s step 1841/19560 | loss 3.844389 (-1.84z)| norm 0.2676 (-0.77z)| lr 7.96e-04 | 4193.82 ms | 32.2% bf16 MFU | 122525 tok/s step 1842/19560 | loss 3.870443 (-1.27z)| norm 0.2683 (-0.74z)| lr 7.96e-04 | 4180.93 ms | 32.3% bf16 MFU | 122669 tok/s step 1843/19560 | loss 3.917687 (-0.26z)| norm 0.2966 (+0.09z)| lr 7.96e-04 | 4187.62 ms | 32.2% bf16 MFU | 122796 tok/s step 1844/19560 | loss 3.875671 (-1.14z)| norm 0.2672 (-0.77z)| lr 7.96e-04 | 4202.85 ms | 32.1% bf16 MFU | 122893 tok/s step 1845/19560 | loss 3.885781 (-0.91z)| norm 0.2709 (-0.66z)| lr 7.96e-04 | 4167.67 ms | 32.4% bf16 MFU | 123038 tok/s step 1846/19560 | loss 3.891276 (-0.78z)| norm 0.2976 (+0.12z)| lr 7.96e-04 | 4196.59 ms | 32.2% bf16 MFU | 123133 tok/s step 1847/19560 | loss 3.878683 (-1.03z)| norm 0.2836 (-0.31z)| lr 7.96e-04 | 4202.91 ms | 32.1% bf16 MFU | 123214 tok/s step 1848/19560 | loss 3.898259 (-0.61z)| norm 0.2799 (-0.43z)| lr 7.96e-04 | 4259.47 ms | 31.7% bf16 MFU | 123207 tok/s step 1849/19560 | loss 3.862540 (-1.34z)| norm 0.2828 (-0.35z)| lr 7.96e-04 | 4192.74 ms | 32.2% bf16 MFU | 123299 tok/s step 1850/19560 | loss 3.863751 (-1.29z)| norm 0.2841 (-0.32z)| lr 7.96e-04 | 4259.47 ms | 31.7% bf16 MFU | 123289 tok/s step 1851/19560 | loss 3.918480 (-0.16z)| norm 0.3000 (+0.15z)| lr 7.96e-04 | 4328.59 ms | 31.2% bf16 MFU | 123180 tok/s step 1852/19560 | loss 3.985061 (+1.24z)| norm 0.3091 (+0.42z)| lr 7.96e-04 | 4171.07 ms | 32.4% bf16 MFU | 123306 tok/s step 1853/19560 | loss 3.829578 (-1.97z)| norm 0.2955 (+0.00z)| lr 7.96e-04 | 4171.67 ms | 32.4% bf16 MFU | 123425 tok/s step 1854/19560 | loss 3.920078 (-0.09z)| norm 0.3119 (+0.50z)| lr 7.96e-04 | 4196.37 ms | 32.2% bf16 MFU | 123500 tok/s step 1855/19560 | loss 3.954282 (+0.63z)| norm 0.2904 (-0.15z)| lr 7.96e-04 | 4216.67 ms | 32.0% bf16 MFU | 123542 tok/s step 1856/19560 | loss 3.833303 (-1.86z)| norm 0.2800 (-0.48z)| lr 7.96e-04 | 4281.79 ms | 31.5% bf16 MFU | 123487 tok/s step 1857/19560 | loss 3.934878 (+0.24z)| norm 0.2942 (-0.05z)| lr 7.96e-04 | 4187.99 ms | 32.2% bf16 MFU | 123572 tok/s step 1858/19560 | loss 3.929305 (+0.12z)| norm 0.3006 (+0.14z)| lr 7.96e-04 | 4199.60 ms | 32.2% bf16 MFU | 123636 tok/s step 1859/19560 | loss 3.859539 (-1.32z)| norm 0.3027 (+0.20z)| lr 7.96e-04 | 4173.80 ms | 32.3% bf16 MFU | 123735 tok/s step 1860/19560 | loss 3.883039 (-0.82z)| norm 0.3128 (+0.50z)| lr 7.96e-04 | 4183.30 ms | 32.3% bf16 MFU | 123815 tok/s step 1861/19560 | loss 3.897728 (-0.51z)| norm 0.3405 (+1.34z)| lr 7.96e-04 | 4195.20 ms | 32.2% bf16 MFU | 123872 tok/s step 1862/19560 | loss 3.853005 (-1.41z)| norm 0.3905 (+2.77z)| lr 7.96e-04 | 4175.27 ms | 32.3% bf16 MFU | 123957 tok/s step 1863/19560 | loss 3.805871 (-2.31z)| norm 0.3468 (+1.44z)| lr 7.96e-04 | 4188.05 ms | 32.2% bf16 MFU | 124019 tok/s step 1864/19560 | loss 3.921319 (+0.02z)| norm 0.2971 (-0.05z)| lr 7.96e-04 | 4210.16 ms | 32.1% bf16 MFU | 124044 tok/s step 1865/19560 | loss 3.893439 (-0.53z)| norm 0.2584 (-1.19z)| lr 7.96e-04 | 4176.56 ms | 32.3% bf16 MFU | 124119 tok/s step 1866/19560 | loss 3.908709 (-0.22z)| norm 0.2665 (-0.96z)| lr 7.96e-04 | 4207.28 ms | 32.1% bf16 MFU | 124143 tok/s step 1867/19560 | loss 4.020951 (+2.01z)| norm 0.2732 (-0.76z)| lr 7.96e-04 | 4180.66 ms | 32.3% bf16 MFU | 124207 tok/s step 1868/19560 | loss 3.832806 (-1.72z)| norm 0.2964 (-0.08z)| lr 7.96e-04 | 4179.76 ms | 32.3% bf16 MFU | 124268 tok/s step 1869/19560 | loss 3.822199 (-1.89z)| norm 0.3029 (+0.11z)| lr 7.96e-04 | 4187.75 ms | 32.2% bf16 MFU | 124314 tok/s step 1870/19560 | loss 3.861270 (-1.11z)| norm 0.3548 (+1.64z)| lr 7.96e-04 | 4190.52 ms | 32.2% bf16 MFU | 124354 tok/s step 1871/19560 | loss 3.908649 (-0.19z)| norm 0.3463 (+1.36z)| lr 7.96e-04 | 4184.66 ms | 32.3% bf16 MFU | 124401 tok/s step 1872/19560 | loss 3.848080 (-1.34z)| norm 0.3096 (+0.27z)| lr 7.96e-04 | 4258.73 ms | 31.7% bf16 MFU | 124336 tok/s step 1873/19560 | loss 3.859423 (-1.11z)| norm 0.2475 (-1.56z)| lr 7.96e-04 | 4203.84 ms | 32.1% bf16 MFU | 124355 tok/s step 1874/19560 | loss 3.910051 (-0.10z)| norm 0.2399 (-1.74z)| lr 7.96e-04 | 4199.86 ms | 32.1% bf16 MFU | 124379 tok/s step 1875/19560 | loss 3.871737 (-0.85z)| norm 0.2662 (-0.96z)| lr 7.96e-04 | 4179.85 ms | 32.3% bf16 MFU | 124432 tok/s step 1876/19560 | loss 3.841414 (-1.43z)| norm 0.2530 (-1.33z)| lr 7.96e-04 | 4206.89 ms | 32.1% bf16 MFU | 124442 tok/s step 1877/19560 | loss 3.860693 (-1.03z)| norm 0.2574 (-1.18z)| lr 7.96e-04 | 4187.85 ms | 32.2% bf16 MFU | 124479 tok/s step 1878/19560 | loss 3.824953 (-1.71z)| norm 0.2560 (-1.20z)| lr 7.96e-04 | 4193.98 ms | 32.2% bf16 MFU | 124506 tok/s step 1879/19560 | loss 3.802472 (-2.11z)| norm 0.2405 (-1.62z)| lr 7.96e-04 | 4197.57 ms | 32.2% bf16 MFU | 124526 tok/s step 1880/19560 | loss 4.013439 (+1.95z)| norm 0.2559 (-1.16z)| lr 7.96e-04 | 4194.86 ms | 32.2% bf16 MFU | 124549 tok/s step 1881/19560 | loss 3.850268 (-1.16z)| norm 0.2609 (-1.02z)| lr 7.96e-04 | 4200.88 ms | 32.1% bf16 MFU | 124561 tok/s step 1882/19560 | loss 3.856183 (-1.04z)| norm 0.2420 (-1.54z)| lr 7.96e-04 | 4196.36 ms | 32.2% bf16 MFU | 124580 tok/s step 1883/19560 | loss 3.810034 (-1.87z)| norm 0.2559 (-1.13z)| lr 7.96e-04 | 4209.25 ms | 32.1% bf16 MFU | 124579 tok/s step 1884/19560 | loss 4.008335 (+1.83z)| norm 0.2888 (-0.20z)| lr 7.96e-04 | 4197.09 ms | 32.2% bf16 MFU | 124596 tok/s step 1885/19560 | loss 3.904304 (-0.10z)| norm 0.3144 (+0.52z)| lr 7.96e-04 | 4184.25 ms | 32.3% bf16 MFU | 124631 tok/s step 1886/19560 | loss 3.822801 (-1.60z)| norm 0.2839 (-0.34z)| lr 7.96e-04 | 4203.79 ms | 32.1% bf16 MFU | 124635 tok/s step 1887/19560 | loss 3.860204 (-0.89z)| norm 0.2680 (-0.77z)| lr 7.96e-04 | 4175.64 ms | 32.3% bf16 MFU | 124682 tok/s step 1888/19560 | loss 3.854889 (-0.97z)| norm 0.3115 (+0.46z)| lr 7.96e-04 | 4215.08 ms | 32.0% bf16 MFU | 124667 tok/s step 1889/19560 | loss 3.926440 (+0.36z)| norm 0.2764 (-0.53z)| lr 7.95e-04 | 4188.59 ms | 32.2% bf16 MFU | 124692 tok/s step 1890/19560 | loss 3.932635 (+0.49z)| norm 0.2613 (-0.94z)| lr 7.95e-04 | 4170.99 ms | 32.4% bf16 MFU | 124742 tok/s step 1891/19560 | loss 3.862937 (-0.81z)| norm 0.2489 (-1.27z)| lr 7.95e-04 | 5890.79 ms | 22.9% bf16 MFU | 122955 tok/s step 1892/19560 | loss 3.885096 (-0.38z)| norm 0.2656 (-0.80z)| lr 7.95e-04 | 4186.19 ms | 32.3% bf16 MFU | 123070 tok/s step 1893/19560 | loss 3.907516 (+0.06z)| norm 0.3019 (+0.23z)| lr 7.95e-04 | 4219.36 ms | 32.0% bf16 MFU | 123129 tok/s step 1894/19560 | loss 3.886682 (-0.33z)| norm 0.3210 (+0.76z)| lr 7.95e-04 | 4173.80 ms | 32.3% bf16 MFU | 123253 tok/s step 1895/19560 | loss 3.879469 (-0.46z)| norm 0.3243 (+0.84z)| lr 7.95e-04 | 4193.60 ms | 32.2% bf16 MFU | 123342 tok/s step 1896/19560 | loss 3.888696 (-0.28z)| norm 0.3419 (+1.31z)| lr 7.95e-04 | 4280.01 ms | 31.5% bf16 MFU | 123299 tok/s step 1897/19560 | loss 3.831166 (-1.36z)| norm 0.3999 (+2.82z)| lr 7.95e-04 | 4195.65 ms | 32.2% bf16 MFU | 123382 tok/s step 1898/19560 | loss 3.830893 (-1.35z)| norm 0.3869 (+2.39z)| lr 7.95e-04 | 4197.94 ms | 32.2% bf16 MFU | 123458 tok/s step 1899/19560 | loss 3.914338 (+0.24z)| norm 0.3594 (+1.63z)| lr 7.95e-04 | 4223.69 ms | 32.0% bf16 MFU | 123492 tok/s step 1900/19560 | loss 3.878192 (-0.45z)| norm 0.2897 (-0.21z)| lr 7.95e-04 | 4185.01 ms | 32.3% bf16 MFU | 123581 tok/s step 1901/19560 | loss 3.880752 (-0.39z)| norm 0.2788 (-0.49z)| lr 7.95e-04 | 4174.15 ms | 32.3% bf16 MFU | 123682 tok/s step 1902/19560 | loss 3.898402 (-0.05z)| norm 0.2758 (-0.57z)| lr 7.95e-04 | 4179.66 ms | 32.3% bf16 MFU | 123770 tok/s step 1903/19560 | loss 3.898142 (-0.04z)| norm 0.2694 (-0.73z)| lr 7.95e-04 | 4211.93 ms | 32.1% bf16 MFU | 123805 tok/s step 1904/19560 | loss 3.847147 (-1.01z)| norm 0.2750 (-0.58z)| lr 7.95e-04 | 4193.51 ms | 32.2% bf16 MFU | 123866 tok/s step 1905/19560 | loss 3.883690 (-0.30z)| norm 0.2658 (-0.81z)| lr 7.95e-04 | 4230.75 ms | 31.9% bf16 MFU | 123869 tok/s step 1906/19560 | loss 3.908600 (+0.18z)| norm 0.2281 (-1.77z)| lr 7.95e-04 | 4201.43 ms | 32.1% bf16 MFU | 123915 tok/s step 1907/19560 | loss 3.777928 (-2.27z)| norm 0.2433 (-1.36z)| lr 7.95e-04 | 4212.01 ms | 32.1% bf16 MFU | 123943 tok/s step 1908/19560 | loss 3.905197 (+0.15z)| norm 0.2596 (-0.93z)| lr 7.95e-04 | 4221.95 ms | 32.0% bf16 MFU | 123955 tok/s step 1909/19560 | loss 3.824802 (-1.36z)| norm 0.2777 (-0.47z)| lr 7.95e-04 | 4194.94 ms | 32.2% bf16 MFU | 124006 tok/s step 1910/19560 | loss 3.938508 (+0.79z)| norm 0.2869 (-0.24z)| lr 7.95e-04 | 4202.75 ms | 32.1% bf16 MFU | 124043 tok/s step 1911/19560 | loss 3.867785 (-0.54z)| norm 0.3114 (+0.39z)| lr 7.95e-04 | 4216.04 ms | 32.0% bf16 MFU | 124059 tok/s step 1912/19560 | loss 3.868296 (-0.52z)| norm 0.2931 (-0.10z)| lr 7.95e-04 | 4250.47 ms | 31.8% bf16 MFU | 124023 tok/s step 1913/19560 | loss 3.842178 (-1.00z)| norm 0.2715 (-0.67z)| lr 7.95e-04 | 4185.79 ms | 32.3% bf16 MFU | 124085 tok/s step 1914/19560 | loss 3.848887 (-0.86z)| norm 0.2755 (-0.55z)| lr 7.95e-04 | 4215.50 ms | 32.0% bf16 MFU | 124099 tok/s step 1915/19560 | loss 3.859932 (-0.64z)| norm 0.2733 (-0.60z)| lr 7.95e-04 | 4189.97 ms | 32.2% bf16 MFU | 124151 tok/s step 1916/19560 | loss 3.894129 (+0.02z)| norm 0.2636 (-0.84z)| lr 7.95e-04 | 4274.87 ms | 31.6% bf16 MFU | 124075 tok/s step 1917/19560 | loss 3.895699 (+0.05z)| norm 0.2570 (-1.00z)| lr 7.95e-04 | 4185.83 ms | 32.3% bf16 MFU | 124134 tok/s step 1918/19560 | loss 3.944977 (+0.98z)| norm 0.2873 (-0.21z)| lr 7.95e-04 | 4210.14 ms | 32.1% bf16 MFU | 124154 tok/s step 1919/19560 | loss 3.879448 (-0.27z)| norm 0.2999 (+0.12z)| lr 7.95e-04 | 4180.00 ms | 32.3% bf16 MFU | 124218 tok/s step 1920/19560 | loss 3.857019 (-0.69z)| norm 0.3175 (+0.58z)| lr 7.95e-04 | 4223.09 ms | 32.0% bf16 MFU | 124214 tok/s step 1921/19560 | loss 3.943001 (+0.95z)| norm 0.2895 (-0.15z)| lr 7.95e-04 | 4255.14 ms | 31.7% bf16 MFU | 124164 tok/s step 1922/19560 | loss 3.870165 (-0.44z)| norm 0.2713 (-0.61z)| lr 7.95e-04 | 4230.96 ms | 31.9% bf16 MFU | 124152 tok/s step 1923/19560 | loss 3.864610 (-0.54z)| norm 0.2726 (-0.57z)| lr 7.95e-04 | 4276.11 ms | 31.6% bf16 MFU | 124075 tok/s step 1924/19560 | loss 3.871669 (-0.39z)| norm 0.4259 (+3.40z)| lr 7.95e-04 | 4185.85 ms | 32.3% bf16 MFU | 124133 tok/s step 1925/19560 | loss 3.846959 (-0.85z)| norm 0.3009 (+0.17z)| lr 7.95e-04 | 4243.79 ms | 31.8% bf16 MFU | 124104 tok/s step 1926/19560 | loss 3.840972 (-0.95z)| norm 0.2825 (-0.31z)| lr 7.95e-04 | 4187.98 ms | 32.2% bf16 MFU | 124158 tok/s step 1927/19560 | loss 3.926911 (+0.71z)| norm 0.3258 (+0.80z)| lr 7.95e-04 | 4183.65 ms | 32.3% bf16 MFU | 124216 tok/s step 1928/19560 | loss 3.907235 (+0.34z)| norm 0.3413 (+1.19z)| lr 7.95e-04 | 4191.95 ms | 32.2% bf16 MFU | 124259 tok/s step 1929/19560 | loss 3.883154 (-0.12z)| norm 0.3193 (+0.62z)| lr 7.95e-04 | 4222.73 ms | 32.0% bf16 MFU | 124254 tok/s step 1930/19560 | loss 3.851868 (-0.72z)| norm 0.2790 (-0.42z)| lr 7.95e-04 | 4183.89 ms | 32.3% bf16 MFU | 124307 tok/s step 1931/19560 | loss 3.898504 (+0.20z)| norm 0.2455 (-1.26z)| lr 7.95e-04 | 4226.79 ms | 31.9% bf16 MFU | 124293 tok/s step 1932/19560 | loss 3.852284 (-0.70z)| norm 0.2634 (-0.80z)| lr 7.95e-04 | 4209.36 ms | 32.1% bf16 MFU | 124306 tok/s step 1933/19560 | loss 3.831805 (-1.09z)| norm 0.2586 (-0.91z)| lr 7.95e-04 | 4191.50 ms | 32.2% bf16 MFU | 124345 tok/s step 1934/19560 | loss 3.878065 (-0.17z)| norm 0.3284 (+0.85z)| lr 7.95e-04 | 4331.16 ms | 31.2% bf16 MFU | 124180 tok/s step 1935/19560 | loss 3.780429 (-2.05z)| norm 0.3449 (+1.25z)| lr 7.95e-04 | 4190.10 ms | 32.2% bf16 MFU | 124228 tok/s step 1936/19560 | loss 3.862886 (-0.44z)| norm 0.2925 (-0.07z)| lr 7.95e-04 | 4218.03 ms | 32.0% bf16 MFU | 124231 tok/s step 1937/19560 | loss 3.900837 (+0.32z)| norm 0.2827 (-0.31z)| lr 7.95e-04 | 4238.07 ms | 31.9% bf16 MFU | 124205 tok/s step 1938/19560 | loss 3.857145 (-0.54z)| norm 0.2902 (-0.12z)| lr 7.95e-04 | 4186.76 ms | 32.2% bf16 MFU | 124256 tok/s step 1939/19560 | loss 3.835035 (-0.96z)| norm 0.2782 (-0.41z)| lr 7.95e-04 | 4193.77 ms | 32.2% bf16 MFU | 124294 tok/s step 1940/19560 | loss 3.916857 (+0.67z)| norm 0.2749 (-0.50z)| lr 7.95e-04 | 4194.95 ms | 32.2% bf16 MFU | 124328 tok/s step 1941/19560 | loss 3.859036 (-0.47z)| norm 0.2800 (-0.36z)| lr 7.95e-04 | 4349.32 ms | 31.0% bf16 MFU | 124139 tok/s step 1942/19560 | loss 3.894135 (+0.25z)| norm 0.2763 (-0.46z)| lr 7.95e-04 | 4181.94 ms | 32.3% bf16 MFU | 124201 tok/s step 1943/19560 | loss 3.896852 (+0.32z)| norm 0.2609 (-0.89z)| lr 7.95e-04 | 4176.89 ms | 32.3% bf16 MFU | 124267 tok/s step 1944/19560 | loss 3.896999 (+0.33z)| norm 0.3094 (+0.51z)| lr 7.95e-04 | 4171.28 ms | 32.4% bf16 MFU | 124338 tok/s step 1945/19560 | loss 3.786460 (-1.92z)| norm 0.3439 (+1.50z)| lr 7.95e-04 | 4193.58 ms | 32.2% bf16 MFU | 124372 tok/s step 1946/19560 | loss 3.852876 (-0.55z)| norm 0.3265 (+1.02z)| lr 7.95e-04 | 4179.29 ms | 32.3% bf16 MFU | 124426 tok/s step 1947/19560 | loss 3.864880 (-0.30z)| norm 0.3053 (+0.41z)| lr 7.95e-04 | 4180.35 ms | 32.3% bf16 MFU | 124475 tok/s step 1948/19560 | loss 3.869988 (-0.19z)| norm 0.2933 (+0.07z)| lr 7.95e-04 | 4216.78 ms | 32.0% bf16 MFU | 124468 tok/s step 1949/19560 | loss 3.858798 (-0.41z)| norm 0.2690 (-0.63z)| lr 7.95e-04 | 4242.06 ms | 31.8% bf16 MFU | 124425 tok/s step 1950/19560 | loss 3.984090 (+2.28z)| norm 0.2806 (-0.29z)| lr 7.95e-04 | 4339.26 ms | 31.1% bf16 MFU | 124245 tok/s step 1951/19560 | loss 3.873325 (-0.10z)| norm 0.2990 (+0.24z)| lr 7.95e-04 | 4275.99 ms | 31.6% bf16 MFU | 124163 tok/s step 1952/19560 | loss 3.884893 (+0.14z)| norm 0.2788 (-0.35z)| lr 7.95e-04 | 4192.67 ms | 32.2% bf16 MFU | 124207 tok/s step 1953/19560 | loss 3.801183 (-1.66z)| norm 0.2548 (-1.06z)| lr 7.95e-04 | 4261.44 ms | 31.7% bf16 MFU | 124148 tok/s step 1954/19560 | loss 3.897715 (+0.40z)| norm 0.2557 (-1.02z)| lr 7.95e-04 | 4206.75 ms | 32.1% bf16 MFU | 124173 tok/s step 1955/19560 | loss 3.935403 (+1.21z)| norm 0.2946 (+0.12z)| lr 7.95e-04 | 4187.33 ms | 32.2% bf16 MFU | 124224 tok/s step 1956/19560 | loss 3.864153 (-0.33z)| norm 0.2788 (-0.33z)| lr 7.95e-04 | 4234.38 ms | 31.9% bf16 MFU | 124204 tok/s step 1957/19560 | loss 3.872243 (-0.15z)| norm 0.2408 (-1.44z)| lr 7.95e-04 | 4213.82 ms | 32.0% bf16 MFU | 124215 tok/s step 1958/19560 | loss 3.811872 (-1.44z)| norm 0.2711 (-0.53z)| lr 7.95e-04 | 4188.76 ms | 32.2% bf16 MFU | 124262 tok/s step 1959/19560 | loss 3.878301 (-0.01z)| norm 0.2718 (-0.50z)| lr 7.95e-04 | 4236.26 ms | 31.9% bf16 MFU | 124237 tok/s step 1960/19560 | loss 3.858354 (-0.45z)| norm 0.2831 (-0.16z)| lr 7.95e-04 | 4189.10 ms | 32.2% bf16 MFU | 124283 tok/s step 1961/19560 | loss 3.871731 (-0.15z)| norm 0.2761 (-0.37z)| lr 7.95e-04 | 4204.02 ms | 32.1% bf16 MFU | 124305 tok/s step 1962/19560 | loss 3.892482 (+0.29z)| norm 0.2861 (-0.08z)| lr 7.95e-04 | 4176.96 ms | 32.3% bf16 MFU | 124365 tok/s step 1963/19560 | loss 3.803568 (-1.67z)| norm 0.2709 (-0.53z)| lr 7.95e-04 | 4184.35 ms | 32.3% bf16 MFU | 124412 tok/s step 1964/19560 | loss 3.787731 (-1.98z)| norm 0.2759 (-0.38z)| lr 7.95e-04 | 4187.88 ms | 32.2% bf16 MFU | 124451 tok/s step 1965/19560 | loss 3.856817 (-0.44z)| norm 0.2715 (-0.51z)| lr 7.95e-04 | 4302.94 ms | 31.4% bf16 MFU | 124321 tok/s step 1966/19560 | loss 3.821683 (-1.23z)| norm 0.2667 (-0.65z)| lr 7.95e-04 | 4188.65 ms | 32.2% bf16 MFU | 124363 tok/s step 1967/19560 | loss 3.883229 (+0.15z)| norm 0.3272 (+1.14z)| lr 7.95e-04 | 4199.99 ms | 32.1% bf16 MFU | 124386 tok/s step 1968/19560 | loss 3.892759 (+0.37z)| norm 0.3353 (+1.36z)| lr 7.95e-04 | 4178.62 ms | 32.3% bf16 MFU | 124440 tok/s step 1969/19560 | loss 3.811249 (-1.45z)| norm 0.2493 (-1.18z)| lr 7.95e-04 | 4188.41 ms | 32.2% bf16 MFU | 124477 tok/s step 1970/19560 | loss 3.808013 (-1.50z)| norm 0.2825 (-0.21z)| lr 7.95e-04 | 4185.90 ms | 32.3% bf16 MFU | 124516 tok/s step 1971/19560 | loss 3.845096 (-0.67z)| norm 0.3622 (+2.10z)| lr 7.95e-04 | 4205.56 ms | 32.1% bf16 MFU | 124523 tok/s step 1972/19560 | loss 3.879143 (+0.09z)| norm 0.3450 (+1.57z)| lr 7.95e-04 | 4206.71 ms | 32.1% bf16 MFU | 124529 tok/s step 1973/19560 | loss 3.827833 (-1.04z)| norm 0.2687 (-0.63z)| lr 7.95e-04 | 4200.18 ms | 32.1% bf16 MFU | 124544 tok/s step 1974/19560 | loss 3.847726 (-0.59z)| norm 0.2950 (+0.13z)| lr 7.95e-04 | 4192.93 ms | 32.2% bf16 MFU | 124568 tok/s step 1975/19560 | loss 3.853660 (-0.45z)| norm 0.2863 (-0.13z)| lr 7.95e-04 | 4225.64 ms | 32.0% bf16 MFU | 124544 tok/s step 1976/19560 | loss 3.901422 (+0.60z)| norm 0.2631 (-0.79z)| lr 7.95e-04 | 4211.74 ms | 32.1% bf16 MFU | 124541 tok/s step 1977/19560 | loss 3.881140 (+0.15z)| norm 0.3097 (+0.55z)| lr 7.95e-04 | 4322.67 ms | 31.2% bf16 MFU | 124378 tok/s step 1978/19560 | loss 3.822835 (-1.12z)| norm 0.2878 (-0.08z)| lr 7.95e-04 | 4192.22 ms | 32.2% bf16 MFU | 124412 tok/s step 1979/19560 | loss 3.948046 (+1.61z)| norm 0.2636 (-0.77z)| lr 7.95e-04 | 4182.25 ms | 32.3% bf16 MFU | 124460 tok/s step 1980/19560 | loss 3.899368 (+0.58z)| norm 0.2483 (-1.19z)| lr 7.95e-04 | 4208.17 ms | 32.1% bf16 MFU | 124466 tok/s step 1981/19560 | loss 3.840612 (-0.74z)| norm 0.2784 (-0.33z)| lr 7.95e-04 | 4184.76 ms | 32.3% bf16 MFU | 124507 tok/s step 1982/19560 | loss 3.872255 (-0.02z)| norm 0.2716 (-0.51z)| lr 7.94e-04 | 4227.95 ms | 31.9% bf16 MFU | 124482 tok/s step 1983/19560 | loss 3.843524 (-0.66z)| norm 0.2718 (-0.50z)| lr 7.94e-04 | 4215.20 ms | 32.0% bf16 MFU | 124477 tok/s step 1984/19560 | loss 3.845810 (-0.61z)| norm 0.2715 (-0.51z)| lr 7.94e-04 | 4269.45 ms | 31.6% bf16 MFU | 124393 tok/s step 1985/19560 | loss 3.881221 (+0.21z)| norm 0.2547 (-0.97z)| lr 7.94e-04 | 4210.18 ms | 32.1% bf16 MFU | 124400 tok/s step 1986/19560 | loss 3.904484 (+0.75z)| norm 0.3064 (+0.49z)| lr 7.94e-04 | 4185.86 ms | 32.3% bf16 MFU | 124442 tok/s step 1987/19560 | loss 3.900183 (+0.64z)| norm 0.2859 (-0.09z)| lr 7.94e-04 | 4210.10 ms | 32.1% bf16 MFU | 124447 tok/s step 1988/19560 | loss 3.849453 (-0.52z)| norm 0.2872 (-0.04z)| lr 7.94e-04 | 4191.97 ms | 32.2% bf16 MFU | 124478 tok/s step 1989/19560 | loss 3.895802 (+0.55z)| norm 0.2826 (-0.16z)| lr 7.94e-04 | 4191.84 ms | 32.2% bf16 MFU | 124508 tok/s step 1990/19560 | loss 3.892006 (+0.46z)| norm 0.2953 (+0.23z)| lr 7.94e-04 | 4369.58 ms | 30.9% bf16 MFU | 124282 tok/s step 1991/19560 | loss 3.845532 (-0.63z)| norm 0.2921 (+0.15z)| lr 7.94e-04 | 4190.20 ms | 32.2% bf16 MFU | 124324 tok/s step 1992/19560 | loss 3.796432 (-1.73z)| norm 0.2725 (-0.43z)| lr 7.94e-04 | 4184.44 ms | 32.3% bf16 MFU | 124372 tok/s step 1993/19560 | loss 3.836863 (-0.79z)| norm 0.2962 (+0.27z)| lr 7.94e-04 | 4201.11 ms | 32.1% bf16 MFU | 124393 tok/s step 1994/19560 | loss 3.826731 (-1.00z)| norm 0.2747 (-0.38z)| lr 7.94e-04 | 4530.86 ms | 29.8% bf16 MFU | 123960 tok/s step 1995/19560 | loss 3.878356 (+0.22z)| norm 0.2856 (-0.05z)| lr 7.94e-04 | 4184.71 ms | 32.3% bf16 MFU | 124026 tok/s step 1996/19560 | loss 3.861865 (-0.19z)| norm 0.2943 (+0.21z)| lr 7.94e-04 | 4197.31 ms | 32.2% bf16 MFU | 124070 tok/s step 1997/19560 | loss 3.953548 (+1.99z)| norm 0.2670 (-0.60z)| lr 7.94e-04 | 4184.12 ms | 32.3% bf16 MFU | 124132 tok/s step 1998/19560 | loss 3.846655 (-0.57z)| norm 0.2568 (-0.90z)| lr 7.94e-04 | 4188.71 ms | 32.2% bf16 MFU | 124184 tok/s step 1999/19560 | loss 3.838487 (-0.75z)| norm 0.2292 (-1.73z)| lr 7.94e-04 | 4293.92 ms | 31.4% bf16 MFU | 124079 tok/s step 2000/19560 | loss 3.843350 (-0.64z)| norm 0.2482 (-1.12z)| lr 7.94e-04 | 4191.08 ms | 32.2% bf16 MFU | 124130 tok/s val loss 3.861422 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2630/10042 = 0.261900 step 2001/19560 | loss 3.872746 (+0.07z)| norm 0.2921 (+0.21z)| lr 7.94e-04 | 4294.63 ms | 31.4% bf16 MFU | 124028 tok/s step 2002/19560 | loss 3.796634 (-1.72z)| norm 0.3237 (+1.17z)| lr 7.94e-04 | 4470.50 ms | 30.2% bf16 MFU | 123690 tok/s step 2003/19560 | loss 3.812424 (-1.33z)| norm 0.3295 (+1.32z)| lr 7.94e-04 | 4251.38 ms | 31.8% bf16 MFU | 123672 tok/s step 2004/19560 | loss 3.856457 (-0.29z)| norm 0.2996 (+0.40z)| lr 7.94e-04 | 4328.11 ms | 31.2% bf16 MFU | 123545 tok/s step 2005/19560 | loss 3.771003 (-2.25z)| norm 0.2656 (-0.65z)| lr 7.94e-04 | 4455.05 ms | 30.3% bf16 MFU | 123252 tok/s step 2006/19560 | loss 3.818822 (-1.14z)| norm 0.2412 (-1.40z)| lr 7.94e-04 | 4261.29 ms | 31.7% bf16 MFU | 123241 tok/s step 2007/19560 | loss 3.869451 (+0.02z)| norm 0.2776 (-0.29z)| lr 7.94e-04 | 4177.85 ms | 32.3% bf16 MFU | 123354 tok/s step 2008/19560 | loss 3.853216 (-0.34z)| norm 0.3105 (+0.72z)| lr 7.94e-04 | 4305.72 ms | 31.4% bf16 MFU | 123274 tok/s step 2009/19560 | loss 3.878131 (+0.26z)| norm 0.3072 (+0.60z)| lr 7.94e-04 | 4230.16 ms | 31.9% bf16 MFU | 123308 tok/s step 2010/19560 | loss 3.849599 (-0.44z)| norm 0.2637 (-0.76z)| lr 7.94e-04 | 4218.55 ms | 32.0% bf16 MFU | 123356 tok/s step 2011/19560 | loss 3.782621 (-2.05z)| norm 0.2811 (-0.22z)| lr 7.94e-04 | 4193.37 ms | 32.2% bf16 MFU | 123440 tok/s step 2012/19560 | loss 3.840975 (-0.63z)| norm 0.2407 (-1.46z)| lr 7.94e-04 | 4183.41 ms | 32.3% bf16 MFU | 123534 tok/s step 2013/19560 | loss 3.867378 (+0.04z)| norm 0.2674 (-0.62z)| lr 7.94e-04 | 4728.73 ms | 28.6% bf16 MFU | 122901 tok/s step 2014/19560 | loss 3.816330 (-1.25z)| norm 0.2944 (+0.22z)| lr 7.94e-04 | 4540.58 ms | 29.7% bf16 MFU | 122529 tok/s step 2015/19560 | loss 3.781065 (-2.10z)| norm 0.2775 (-0.31z)| lr 7.94e-04 | 4652.13 ms | 29.0% bf16 MFU | 122038 tok/s step 2016/19560 | loss 3.830340 (-0.86z)| norm 0.2541 (-1.03z)| lr 7.94e-04 | 4574.41 ms | 29.5% bf16 MFU | 121667 tok/s step 2017/19560 | loss 3.844752 (-0.49z)| norm 0.2603 (-0.83z)| lr 7.94e-04 | 4227.87 ms | 31.9% bf16 MFU | 121784 tok/s step 2018/19560 | loss 3.812774 (-1.27z)| norm 0.2670 (-0.62z)| lr 7.94e-04 | 4279.72 ms | 31.5% bf16 MFU | 121820 tok/s step 2019/19560 | loss 3.779852 (-2.05z)| norm 0.2563 (-0.96z)| lr 7.94e-04 | 4290.77 ms | 31.5% bf16 MFU | 121838 tok/s step 2020/19560 | loss 3.817626 (-1.10z)| norm 0.2532 (-1.05z)| lr 7.94e-04 | 4278.37 ms | 31.6% bf16 MFU | 121873 tok/s step 2021/19560 | loss 3.800720 (-1.49z)| norm 0.2750 (-0.36z)| lr 7.94e-04 | 4253.59 ms | 31.7% bf16 MFU | 121943 tok/s step 2022/19560 | loss 3.853593 (-0.18z)| norm 0.2752 (-0.35z)| lr 7.94e-04 | 4302.46 ms | 31.4% bf16 MFU | 121938 tok/s step 2023/19560 | loss 3.873874 (+0.32z)| norm 0.2362 (-1.54z)| lr 7.94e-04 | 4227.07 ms | 31.9% bf16 MFU | 122043 tok/s step 2024/19560 | loss 3.820575 (-0.98z)| norm 0.2802 (-0.16z)| lr 7.94e-04 | 4195.37 ms | 32.2% bf16 MFU | 122189 tok/s step 2025/19560 | loss 3.879729 (+0.47z)| norm 0.3089 (+0.81z)| lr 7.94e-04 | 4207.29 ms | 32.1% bf16 MFU | 122311 tok/s step 2026/19560 | loss 3.905994 (+1.10z)| norm 0.3357 (+1.76z)| lr 7.94e-04 | 4184.68 ms | 32.3% bf16 MFU | 122459 tok/s step 2027/19560 | loss 3.760356 (-2.42z)| norm 0.3343 (+1.75z)| lr 7.94e-04 | 4167.78 ms | 32.4% bf16 MFU | 122626 tok/s step 2028/19560 | loss 3.811413 (-1.16z)| norm 0.2898 (+0.20z)| lr 7.94e-04 | 4190.36 ms | 32.2% bf16 MFU | 122751 tok/s step 2029/19560 | loss 3.808346 (-1.22z)| norm 0.3072 (+0.80z)| lr 7.94e-04 | 4209.56 ms | 32.1% bf16 MFU | 122841 tok/s step 2030/19560 | loss 3.834090 (-0.59z)| norm 0.2969 (+0.43z)| lr 7.94e-04 | 4202.78 ms | 32.1% bf16 MFU | 122936 tok/s step 2031/19560 | loss 3.822792 (-0.85z)| norm 0.2625 (-0.75z)| lr 7.94e-04 | 4170.13 ms | 32.4% bf16 MFU | 123075 tok/s step 2032/19560 | loss 3.881145 (+0.55z)| norm 0.2603 (-0.82z)| lr 7.94e-04 | 4272.80 ms | 31.6% bf16 MFU | 123057 tok/s step 2033/19560 | loss 3.815435 (-1.01z)| norm 0.2865 (+0.08z)| lr 7.94e-04 | 4221.65 ms | 32.0% bf16 MFU | 123113 tok/s step 2034/19560 | loss 3.854277 (-0.07z)| norm 0.2840 (-0.03z)| lr 7.94e-04 | 4222.86 ms | 32.0% bf16 MFU | 123166 tok/s step 2035/19560 | loss 3.804022 (-1.30z)| norm 0.2819 (-0.11z)| lr 7.94e-04 | 4180.03 ms | 32.3% bf16 MFU | 123279 tok/s step 2036/19560 | loss 3.804388 (-1.27z)| norm 0.2516 (-1.18z)| lr 7.94e-04 | 4182.24 ms | 32.3% bf16 MFU | 123383 tok/s step 2037/19560 | loss 3.888170 (+0.75z)| norm 0.2571 (-0.98z)| lr 7.94e-04 | 4182.54 ms | 32.3% bf16 MFU | 123481 tok/s step 2038/19560 | loss 3.832662 (-0.58z)| norm 0.2625 (-0.78z)| lr 7.94e-04 | 4180.96 ms | 32.3% bf16 MFU | 123577 tok/s step 2039/19560 | loss 3.864038 (+0.19z)| norm 0.2194 (-2.23z)| lr 7.94e-04 | 4178.11 ms | 32.3% bf16 MFU | 123672 tok/s step 2040/19560 | loss 3.843652 (-0.31z)| norm 0.2448 (-1.33z)| lr 7.94e-04 | 4178.02 ms | 32.3% bf16 MFU | 123763 tok/s step 2041/19560 | loss 3.865871 (+0.24z)| norm 0.2801 (-0.12z)| lr 7.94e-04 | 4178.46 ms | 32.3% bf16 MFU | 123849 tok/s step 2042/19560 | loss 3.758428 (-2.34z)| norm 0.2895 (+0.20z)| lr 7.94e-04 | 4173.65 ms | 32.3% bf16 MFU | 123937 tok/s step 2043/19560 | loss 3.854714 (-0.02z)| norm 0.2854 (+0.06z)| lr 7.94e-04 | 4181.50 ms | 32.3% bf16 MFU | 124009 tok/s step 2044/19560 | loss 3.845342 (-0.24z)| norm 0.3186 (+1.18z)| lr 7.94e-04 | 4181.24 ms | 32.3% bf16 MFU | 124079 tok/s step 2045/19560 | loss 3.856698 (+0.04z)| norm 0.3407 (+1.89z)| lr 7.94e-04 | 4184.38 ms | 32.3% bf16 MFU | 124139 tok/s step 2046/19560 | loss 3.822118 (-0.78z)| norm 0.3382 (+1.77z)| lr 7.94e-04 | 4206.54 ms | 32.1% bf16 MFU | 124164 tok/s step 2047/19560 | loss 3.837289 (-0.40z)| norm 0.2866 (+0.05z)| lr 7.94e-04 | 4178.58 ms | 32.3% bf16 MFU | 124230 tok/s step 2048/19560 | loss 3.876609 (+0.56z)| norm 0.2757 (-0.31z)| lr 7.94e-04 | 4186.98 ms | 32.2% bf16 MFU | 124279 tok/s step 2049/19560 | loss 3.824307 (-0.72z)| norm 0.2671 (-0.59z)| lr 7.94e-04 | 4182.03 ms | 32.3% bf16 MFU | 124333 tok/s step 2050/19560 | loss 3.795963 (-1.40z)| norm 0.2647 (-0.67z)| lr 7.94e-04 | 4175.28 ms | 32.3% bf16 MFU | 124395 tok/s step 2051/19560 | loss 3.804185 (-1.18z)| norm 0.2564 (-0.94z)| lr 7.94e-04 | 4275.05 ms | 31.6% bf16 MFU | 124307 tok/s step 2052/19560 | loss 3.808115 (-1.07z)| norm 0.2826 (-0.03z)| lr 7.94e-04 | 4198.38 ms | 32.2% bf16 MFU | 124336 tok/s step 2053/19560 | loss 3.798298 (-1.29z)| norm 0.2918 (+0.31z)| lr 7.94e-04 | 4193.24 ms | 32.2% bf16 MFU | 124371 tok/s step 2054/19560 | loss 3.860705 (+0.24z)| norm 0.2786 (-0.17z)| lr 7.94e-04 | 4184.41 ms | 32.3% bf16 MFU | 124417 tok/s step 2055/19560 | loss 3.762680 (-2.13z)| norm 0.2581 (-0.91z)| lr 7.94e-04 | 4185.12 ms | 32.3% bf16 MFU | 124460 tok/s step 2056/19560 | loss 3.827358 (-0.54z)| norm 0.2332 (-1.82z)| lr 7.94e-04 | 4186.62 ms | 32.2% bf16 MFU | 124498 tok/s step 2057/19560 | loss 3.854596 (+0.14z)| norm 0.2336 (-1.77z)| lr 7.94e-04 | 4176.56 ms | 32.3% bf16 MFU | 124550 tok/s step 2058/19560 | loss 3.804991 (-1.07z)| norm 0.2840 (+0.10z)| lr 7.94e-04 | 4175.20 ms | 32.3% bf16 MFU | 124601 tok/s step 2059/19560 | loss 3.837015 (-0.27z)| norm 0.2764 (-0.19z)| lr 7.94e-04 | 4200.88 ms | 32.1% bf16 MFU | 124611 tok/s step 2060/19560 | loss 3.840550 (-0.18z)| norm 0.2973 (+0.58z)| lr 7.94e-04 | 4180.58 ms | 32.3% bf16 MFU | 124651 tok/s step 2061/19560 | loss 3.788138 (-1.46z)| norm 0.3268 (+1.65z)| lr 7.94e-04 | 4175.33 ms | 32.3% bf16 MFU | 124697 tok/s step 2062/19560 | loss 3.860610 (+0.32z)| norm 0.2899 (+0.29z)| lr 7.94e-04 | 4185.21 ms | 32.3% bf16 MFU | 124726 tok/s step 2063/19560 | loss 3.774351 (-1.79z)| norm 0.2828 (+0.05z)| lr 7.94e-04 | 4180.53 ms | 32.3% bf16 MFU | 124760 tok/s step 2064/19560 | loss 3.825380 (-0.53z)| norm 0.2663 (-0.58z)| lr 7.94e-04 | 4183.10 ms | 32.3% bf16 MFU | 124789 tok/s step 2065/19560 | loss 3.810702 (-0.88z)| norm 0.3046 (+0.88z)| lr 7.94e-04 | 4179.58 ms | 32.3% bf16 MFU | 124821 tok/s step 2066/19560 | loss 3.816534 (-0.73z)| norm 0.2909 (+0.36z)| lr 7.94e-04 | 4189.06 ms | 32.2% bf16 MFU | 124838 tok/s step 2067/19560 | loss 3.843651 (-0.06z)| norm 0.2748 (-0.26z)| lr 7.94e-04 | 4175.58 ms | 32.3% bf16 MFU | 124874 tok/s step 2068/19560 | loss 3.814387 (-0.77z)| norm 0.2505 (-1.17z)| lr 7.93e-04 | 4188.76 ms | 32.2% bf16 MFU | 124889 tok/s step 2069/19560 | loss 3.855761 (+0.26z)| norm 0.2650 (-0.61z)| lr 7.93e-04 | 4180.69 ms | 32.3% bf16 MFU | 124915 tok/s step 2070/19560 | loss 3.795187 (-1.22z)| norm 0.2607 (-0.77z)| lr 7.93e-04 | 4177.98 ms | 32.3% bf16 MFU | 124943 tok/s step 2071/19560 | loss 3.759017 (-2.07z)| norm 0.2602 (-0.79z)| lr 7.93e-04 | 4178.14 ms | 32.3% bf16 MFU | 124970 tok/s step 2072/19560 | loss 3.824362 (-0.46z)| norm 0.2767 (-0.16z)| lr 7.93e-04 | 4237.87 ms | 31.9% bf16 MFU | 124908 tok/s step 2073/19560 | loss 3.845294 (+0.05z)| norm 0.2672 (-0.50z)| lr 7.93e-04 | 4167.30 ms | 32.4% bf16 MFU | 124953 tok/s step 2074/19560 | loss 3.832353 (-0.27z)| norm 0.2648 (-0.59z)| lr 7.93e-04 | 4189.95 ms | 32.2% bf16 MFU | 124962 tok/s step 2075/19560 | loss 3.790370 (-1.29z)| norm 0.2765 (-0.12z)| lr 7.93e-04 | 4180.95 ms | 32.3% bf16 MFU | 124983 tok/s step 2076/19560 | loss 3.816563 (-0.64z)| norm 0.2683 (-0.43z)| lr 7.93e-04 | 4233.39 ms | 31.9% bf16 MFU | 124927 tok/s step 2077/19560 | loss 3.806701 (-0.87z)| norm 0.2570 (-0.88z)| lr 7.93e-04 | 4170.23 ms | 32.4% bf16 MFU | 124966 tok/s step 2078/19560 | loss 3.877690 (+0.95z)| norm 0.2369 (-1.64z)| lr 7.93e-04 | 4182.08 ms | 32.3% bf16 MFU | 124986 tok/s step 2079/19560 | loss 3.788463 (-1.34z)| norm 0.2313 (-1.82z)| lr 7.93e-04 | 4171.63 ms | 32.4% bf16 MFU | 125021 tok/s step 2080/19560 | loss 3.800303 (-1.02z)| norm 0.2680 (-0.40z)| lr 7.93e-04 | 4175.20 ms | 32.3% bf16 MFU | 125049 tok/s step 2081/19560 | loss 3.833004 (-0.18z)| norm 0.3122 (+1.29z)| lr 7.93e-04 | 4190.29 ms | 32.2% bf16 MFU | 125052 tok/s step 2082/19560 | loss 3.766498 (-1.87z)| norm 0.3359 (+2.14z)| lr 7.93e-04 | 4189.71 ms | 32.2% bf16 MFU | 125056 tok/s step 2083/19560 | loss 3.815999 (-0.58z)| norm 0.2989 (+0.74z)| lr 7.93e-04 | 4186.58 ms | 32.3% bf16 MFU | 125065 tok/s step 2084/19560 | loss 3.854297 (+0.43z)| norm 0.3051 (+0.96z)| lr 7.93e-04 | 4187.12 ms | 32.2% bf16 MFU | 125072 tok/s step 2085/19560 | loss 3.833539 (-0.11z)| norm 0.3024 (+0.85z)| lr 7.93e-04 | 4177.08 ms | 32.3% bf16 MFU | 125095 tok/s step 2086/19560 | loss 3.815762 (-0.58z)| norm 0.2663 (-0.52z)| lr 7.93e-04 | 4185.80 ms | 32.3% bf16 MFU | 125103 tok/s step 2087/19560 | loss 3.762524 (-1.95z)| norm 0.2403 (-1.48z)| lr 7.93e-04 | 4175.75 ms | 32.3% bf16 MFU | 125125 tok/s step 2088/19560 | loss 3.828597 (-0.21z)| norm 0.2567 (-0.85z)| lr 7.93e-04 | 4173.92 ms | 32.3% bf16 MFU | 125149 tok/s step 2089/19560 | loss 3.817171 (-0.50z)| norm 0.2571 (-0.83z)| lr 7.93e-04 | 4183.86 ms | 32.3% bf16 MFU | 125158 tok/s step 2090/19560 | loss 3.792615 (-1.13z)| norm 0.2758 (-0.13z)| lr 7.93e-04 | 4184.31 ms | 32.3% bf16 MFU | 125165 tok/s step 2091/19560 | loss 3.795474 (-1.05z)| norm 0.2560 (-0.87z)| lr 7.93e-04 | 4174.71 ms | 32.3% bf16 MFU | 125186 tok/s step 2092/19560 | loss 4.000441 (+4.05z)| norm 0.2540 (-0.93z)| lr 7.93e-04 | 4191.62 ms | 32.2% bf16 MFU | 125180 tok/s step 2093/19560 | loss 3.761685 (-1.82z)| norm 0.3068 (+1.02z)| lr 7.93e-04 | 4175.25 ms | 32.3% bf16 MFU | 125200 tok/s step 2094/19560 | loss 3.839400 (+0.08z)| norm 0.3689 (+3.16z)| lr 7.93e-04 | 4183.35 ms | 32.3% bf16 MFU | 125206 tok/s step 2095/19560 | loss 3.787613 (-1.17z)| norm 0.3455 (+2.30z)| lr 7.93e-04 | 4211.73 ms | 32.1% bf16 MFU | 125170 tok/s step 2096/19560 | loss 3.896265 (+1.48z)| norm 0.3052 (+0.90z)| lr 7.93e-04 | 4173.07 ms | 32.4% bf16 MFU | 125193 tok/s step 2097/19560 | loss 3.782639 (-1.28z)| norm 0.2824 (+0.08z)| lr 7.93e-04 | 4176.58 ms | 32.3% bf16 MFU | 125210 tok/s step 2098/19560 | loss 3.870806 (+0.85z)| norm 0.2625 (-0.63z)| lr 7.93e-04 | 4179.19 ms | 32.3% bf16 MFU | 125222 tok/s step 2099/19560 | loss 3.765556 (-1.67z)| norm 0.2696 (-0.36z)| lr 7.93e-04 | 4182.18 ms | 32.3% bf16 MFU | 125229 tok/s step 2100/19560 | loss 3.831657 (-0.08z)| norm 0.2534 (-0.95z)| lr 7.93e-04 | 4195.19 ms | 32.2% bf16 MFU | 125217 tok/s step 2101/19560 | loss 3.855039 (+0.48z)| norm 0.3029 (+0.90z)| lr 7.93e-04 | 4182.45 ms | 32.3% bf16 MFU | 125223 tok/s step 2102/19560 | loss 3.766764 (-1.62z)| norm 0.2776 (-0.05z)| lr 7.93e-04 | 4177.78 ms | 32.3% bf16 MFU | 125237 tok/s step 2103/19560 | loss 3.775747 (-1.38z)| norm 0.2473 (-1.17z)| lr 7.93e-04 | 4190.24 ms | 32.2% bf16 MFU | 125231 tok/s step 2104/19560 | loss 3.787930 (-1.07z)| norm 0.2713 (-0.27z)| lr 7.93e-04 | 4175.60 ms | 32.3% bf16 MFU | 125248 tok/s step 2105/19560 | loss 3.797141 (-0.84z)| norm 0.2541 (-0.90z)| lr 7.93e-04 | 4167.86 ms | 32.4% bf16 MFU | 125275 tok/s step 2106/19560 | loss 3.894848 (+1.47z)| norm 0.2416 (-1.35z)| lr 7.93e-04 | 4186.66 ms | 32.2% bf16 MFU | 125273 tok/s step 2107/19560 | loss 3.845331 (+0.32z)| norm 0.2587 (-0.71z)| lr 7.93e-04 | 4176.35 ms | 32.3% bf16 MFU | 125286 tok/s step 2108/19560 | loss 3.816528 (-0.37z)| norm 0.2722 (-0.21z)| lr 7.93e-04 | 4167.41 ms | 32.4% bf16 MFU | 125312 tok/s step 2109/19560 | loss 3.816849 (-0.35z)| norm 0.2880 (+0.37z)| lr 7.93e-04 | 4187.16 ms | 32.2% bf16 MFU | 125307 tok/s step 2110/19560 | loss 3.821152 (-0.24z)| norm 0.3087 (+1.13z)| lr 7.93e-04 | 4180.54 ms | 32.3% bf16 MFU | 125312 tok/s step 2111/19560 | loss 3.806353 (-0.60z)| norm 0.2883 (+0.37z)| lr 7.93e-04 | 4184.21 ms | 32.3% bf16 MFU | 125312 tok/s step 2112/19560 | loss 3.985685 (+3.61z)| norm 0.2909 (+0.46z)| lr 7.93e-04 | 4228.05 ms | 31.9% bf16 MFU | 125246 tok/s step 2113/19560 | loss 3.831544 (+0.01z)| norm 0.2903 (+0.43z)| lr 7.93e-04 | 4234.66 ms | 31.9% bf16 MFU | 125174 tok/s step 2114/19560 | loss 3.801157 (-0.70z)| norm 0.3298 (+1.88z)| lr 7.93e-04 | 4241.75 ms | 31.8% bf16 MFU | 125096 tok/s step 2115/19560 | loss 3.733681 (-2.25z)| norm 0.3600 (+2.88z)| lr 7.93e-04 | 4185.81 ms | 32.3% bf16 MFU | 125104 tok/s step 2116/19560 | loss 3.897657 (+1.59z)| norm 0.3297 (+1.76z)| lr 7.93e-04 | 4174.86 ms | 32.3% bf16 MFU | 125127 tok/s step 2117/19560 | loss 3.815278 (-0.32z)| norm 0.3209 (+1.43z)| lr 7.93e-04 | 4173.12 ms | 32.4% bf16 MFU | 125153 tok/s step 2118/19560 | loss 3.849837 (+0.51z)| norm 0.3187 (+1.34z)| lr 7.93e-04 | 4173.59 ms | 32.4% bf16 MFU | 125176 tok/s step 2119/19560 | loss 3.860085 (+0.75z)| norm 0.3081 (+0.96z)| lr 7.93e-04 | 4168.10 ms | 32.4% bf16 MFU | 125207 tok/s step 2120/19560 | loss 3.814425 (-0.34z)| norm 0.2816 (+0.03z)| lr 7.93e-04 | 4173.55 ms | 32.4% bf16 MFU | 125227 tok/s step 2121/19560 | loss 3.836463 (+0.18z)| norm 0.2887 (+0.28z)| lr 7.93e-04 | 4176.62 ms | 32.3% bf16 MFU | 125243 tok/s step 2122/19560 | loss 3.813657 (-0.36z)| norm 0.2991 (+0.64z)| lr 7.93e-04 | 4183.70 ms | 32.3% bf16 MFU | 125246 tok/s step 2123/19560 | loss 3.732496 (-2.23z)| norm 0.2973 (+0.57z)| lr 7.93e-04 | 4174.38 ms | 32.3% bf16 MFU | 125264 tok/s step 2124/19560 | loss 3.726816 (-2.29z)| norm 0.2532 (-0.95z)| lr 7.93e-04 | 4185.10 ms | 32.3% bf16 MFU | 125264 tok/s step 2125/19560 | loss 3.799250 (-0.62z)| norm 0.2708 (-0.34z)| lr 7.93e-04 | 4180.45 ms | 32.3% bf16 MFU | 125272 tok/s step 2126/19560 | loss 3.879709 (+1.28z)| norm 0.2545 (-0.90z)| lr 7.93e-04 | 4180.33 ms | 32.3% bf16 MFU | 125279 tok/s step 2127/19560 | loss 3.748464 (-1.78z)| norm 0.2912 (+0.36z)| lr 7.93e-04 | 4181.71 ms | 32.3% bf16 MFU | 125284 tok/s step 2128/19560 | loss 3.797407 (-0.63z)| norm 0.2455 (-1.24z)| lr 7.93e-04 | 4166.48 ms | 32.4% bf16 MFU | 125312 tok/s step 2129/19560 | loss 3.740808 (-1.91z)| norm 0.2346 (-1.59z)| lr 7.93e-04 | 4189.63 ms | 32.2% bf16 MFU | 125303 tok/s step 2130/19560 | loss 3.770184 (-1.22z)| norm 0.2444 (-1.23z)| lr 7.93e-04 | 4178.51 ms | 32.3% bf16 MFU | 125311 tok/s step 2131/19560 | loss 3.788960 (-0.78z)| norm 0.2342 (-1.57z)| lr 7.93e-04 | 4237.90 ms | 31.9% bf16 MFU | 125232 tok/s step 2132/19560 | loss 3.807131 (-0.36z)| norm 0.2352 (-1.50z)| lr 7.93e-04 | 4183.03 ms | 32.3% bf16 MFU | 125237 tok/s step 2133/19560 | loss 3.830389 (+0.17z)| norm 0.2418 (-1.26z)| lr 7.93e-04 | 4182.09 ms | 32.3% bf16 MFU | 125243 tok/s step 2134/19560 | loss 3.791808 (-0.72z)| norm 0.2434 (-1.21z)| lr 7.93e-04 | 4175.63 ms | 32.3% bf16 MFU | 125259 tok/s step 2135/19560 | loss 3.850286 (+0.64z)| norm 0.2525 (-0.88z)| lr 7.93e-04 | 4172.51 ms | 32.4% bf16 MFU | 125279 tok/s step 2136/19560 | loss 3.791224 (-0.72z)| norm 0.2531 (-0.85z)| lr 7.93e-04 | 4174.91 ms | 32.3% bf16 MFU | 125294 tok/s step 2137/19560 | loss 3.833205 (+0.26z)| norm 0.2436 (-1.16z)| lr 7.93e-04 | 4176.59 ms | 32.3% bf16 MFU | 125306 tok/s step 2138/19560 | loss 3.821191 (-0.01z)| norm 0.2457 (-1.08z)| lr 7.93e-04 | 4174.18 ms | 32.3% bf16 MFU | 125320 tok/s step 2139/19560 | loss 3.801751 (-0.47z)| norm 0.2817 (+0.15z)| lr 7.93e-04 | 4185.40 ms | 32.3% bf16 MFU | 125318 tok/s step 2140/19560 | loss 3.792699 (-0.67z)| norm 0.2725 (-0.17z)| lr 7.93e-04 | 4170.50 ms | 32.4% bf16 MFU | 125338 tok/s step 2141/19560 | loss 3.865975 (+1.04z)| norm 0.2814 (+0.13z)| lr 7.93e-04 | 4187.33 ms | 32.2% bf16 MFU | 125331 tok/s step 2142/19560 | loss 3.822543 (+0.02z)| norm 0.2550 (-0.76z)| lr 7.93e-04 | 4187.24 ms | 32.2% bf16 MFU | 125325 tok/s step 2143/19560 | loss 3.782649 (-0.91z)| norm 0.2513 (-0.88z)| lr 7.93e-04 | 4171.24 ms | 32.4% bf16 MFU | 125343 tok/s step 2144/19560 | loss 3.769615 (-1.20z)| norm 0.2401 (-1.26z)| lr 7.93e-04 | 4208.17 ms | 32.1% bf16 MFU | 125306 tok/s step 2145/19560 | loss 3.748470 (-1.65z)| norm 0.2496 (-0.93z)| lr 7.93e-04 | 4176.64 ms | 32.3% bf16 MFU | 125317 tok/s step 2146/19560 | loss 3.728308 (-2.07z)| norm 0.2963 (+0.65z)| lr 7.93e-04 | 4162.48 ms | 32.4% bf16 MFU | 125349 tok/s step 2147/19560 | loss 3.818785 (-0.03z)| norm 0.3380 (+2.02z)| lr 7.92e-04 | 4168.60 ms | 32.4% bf16 MFU | 125370 tok/s step 2148/19560 | loss 3.810366 (-0.22z)| norm 0.3328 (+1.81z)| lr 7.92e-04 | 4171.39 ms | 32.4% bf16 MFU | 125386 tok/s step 2149/19560 | loss 3.757051 (-1.41z)| norm 0.3089 (+1.00z)| lr 7.92e-04 | 4163.12 ms | 32.4% bf16 MFU | 125413 tok/s step 2150/19560 | loss 3.716516 (-2.26z)| norm 0.2938 (+0.50z)| lr 7.92e-04 | 4167.98 ms | 32.4% bf16 MFU | 125432 tok/s step 2151/19560 | loss 3.804048 (-0.31z)| norm 0.2755 (-0.12z)| lr 7.92e-04 | 4172.73 ms | 32.4% bf16 MFU | 125443 tok/s step 2152/19560 | loss 3.827411 (+0.21z)| norm 0.2754 (-0.12z)| lr 7.92e-04 | 4180.40 ms | 32.3% bf16 MFU | 125441 tok/s step 2153/19560 | loss 3.786688 (-0.68z)| norm 0.2851 (+0.21z)| lr 7.92e-04 | 4271.44 ms | 31.6% bf16 MFU | 125306 tok/s step 2154/19560 | loss 3.711889 (-2.31z)| norm 0.3146 (+1.21z)| lr 7.92e-04 | 4176.72 ms | 32.3% bf16 MFU | 125317 tok/s step 2155/19560 | loss 3.793771 (-0.50z)| norm 0.2817 (+0.12z)| lr 7.92e-04 | 4170.23 ms | 32.4% bf16 MFU | 125338 tok/s step 2156/19560 | loss 3.768886 (-1.04z)| norm 0.2710 (-0.24z)| lr 7.92e-04 | 4174.68 ms | 32.3% bf16 MFU | 125350 tok/s step 2157/19560 | loss 3.839776 (+0.53z)| norm 0.3027 (+0.84z)| lr 7.92e-04 | 4176.85 ms | 32.3% bf16 MFU | 125359 tok/s step 2158/19560 | loss 3.900973 (+1.86z)| norm 0.3543 (+2.53z)| lr 7.92e-04 | 4172.47 ms | 32.4% bf16 MFU | 125373 tok/s step 2159/19560 | loss 3.886407 (+1.51z)| norm 0.3488 (+2.28z)| lr 7.92e-04 | 4165.33 ms | 32.4% bf16 MFU | 125398 tok/s step 2160/19560 | loss 3.867026 (+1.10z)| norm 0.3190 (+1.28z)| lr 7.92e-04 | 4175.41 ms | 32.3% bf16 MFU | 125407 tok/s step 2161/19560 | loss 3.822755 (+0.13z)| norm 0.2921 (+0.40z)| lr 7.92e-04 | 4180.88 ms | 32.3% bf16 MFU | 125406 tok/s step 2162/19560 | loss 3.835368 (+0.41z)| norm 0.2932 (+0.44z)| lr 7.92e-04 | 4164.59 ms | 32.4% bf16 MFU | 125431 tok/s step 2163/19560 | loss 3.815469 (-0.03z)| norm 0.2755 (-0.14z)| lr 7.92e-04 | 4166.37 ms | 32.4% bf16 MFU | 125451 tok/s step 2164/19560 | loss 3.793267 (-0.51z)| norm 0.2838 (+0.12z)| lr 7.92e-04 | 4176.03 ms | 32.3% bf16 MFU | 125456 tok/s step 2165/19560 | loss 3.899119 (+1.79z)| norm 0.2360 (-1.42z)| lr 7.92e-04 | 4175.01 ms | 32.3% bf16 MFU | 125462 tok/s step 2166/19560 | loss 3.803748 (-0.28z)| norm 0.2537 (-0.84z)| lr 7.92e-04 | 4168.36 ms | 32.4% bf16 MFU | 125478 tok/s step 2167/19560 | loss 3.777641 (-0.84z)| norm 0.2671 (-0.43z)| lr 7.92e-04 | 4185.04 ms | 32.3% bf16 MFU | 125468 tok/s step 2168/19560 | loss 3.884010 (+1.47z)| norm 0.3017 (+0.70z)| lr 7.92e-04 | 4177.71 ms | 32.3% bf16 MFU | 125469 tok/s step 2169/19560 | loss 3.804879 (-0.24z)| norm 0.2829 (+0.07z)| lr 7.92e-04 | 4177.09 ms | 32.3% bf16 MFU | 125471 tok/s step 2170/19560 | loss 3.730888 (-1.83z)| norm 0.2750 (-0.18z)| lr 7.92e-04 | 4157.02 ms | 32.5% bf16 MFU | 125504 tok/s step 2171/19560 | loss 3.795969 (-0.41z)| norm 0.2825 (+0.07z)| lr 7.92e-04 | 4208.56 ms | 32.1% bf16 MFU | 125458 tok/s step 2172/19560 | loss 3.821440 (+0.14z)| norm 0.3355 (+1.80z)| lr 7.92e-04 | 4202.71 ms | 32.1% bf16 MFU | 125422 tok/s step 2173/19560 | loss 3.812767 (-0.04z)| norm 0.3434 (+2.06z)| lr 7.92e-04 | 4173.19 ms | 32.4% bf16 MFU | 125433 tok/s step 2174/19560 | loss 3.795278 (-0.42z)| norm 0.3377 (+1.87z)| lr 7.92e-04 | 4171.67 ms | 32.4% bf16 MFU | 125445 tok/s step 2175/19560 | loss 3.729220 (-1.81z)| norm 0.3042 (+0.77z)| lr 7.92e-04 | 4175.99 ms | 32.3% bf16 MFU | 125450 tok/s step 2176/19560 | loss 3.847446 (+0.74z)| norm 0.3241 (+1.39z)| lr 7.92e-04 | 4171.16 ms | 32.4% bf16 MFU | 125462 tok/s step 2177/19560 | loss 3.798198 (-0.32z)| norm 0.2698 (-0.37z)| lr 7.92e-04 | 4176.91 ms | 32.3% bf16 MFU | 125465 tok/s step 2178/19560 | loss 3.783611 (-0.63z)| norm 0.2773 (-0.13z)| lr 7.92e-04 | 4169.46 ms | 32.4% bf16 MFU | 125479 tok/s step 2179/19560 | loss 3.821602 (+0.18z)| norm 0.2626 (-0.61z)| lr 7.92e-04 | 4179.31 ms | 32.3% bf16 MFU | 125478 tok/s step 2180/19560 | loss 3.811545 (-0.03z)| norm 0.2463 (-1.12z)| lr 7.92e-04 | 4184.86 ms | 32.3% bf16 MFU | 125468 tok/s step 2181/19560 | loss 3.786275 (-0.58z)| norm 0.2505 (-0.97z)| lr 7.92e-04 | 4170.21 ms | 32.4% bf16 MFU | 125481 tok/s step 2182/19560 | loss 3.776318 (-0.78z)| norm 0.2474 (-1.06z)| lr 7.92e-04 | 4213.86 ms | 32.0% bf16 MFU | 125428 tok/s step 2183/19560 | loss 3.819427 (+0.14z)| norm 0.2491 (-1.00z)| lr 7.92e-04 | 4183.30 ms | 32.3% bf16 MFU | 125423 tok/s step 2184/19560 | loss 3.821397 (+0.19z)| norm 0.2380 (-1.36z)| lr 7.92e-04 | 4176.06 ms | 32.3% bf16 MFU | 125429 tok/s step 2185/19560 | loss 3.809455 (-0.06z)| norm 0.2339 (-1.49z)| lr 7.92e-04 | 4171.29 ms | 32.4% bf16 MFU | 125442 tok/s step 2186/19560 | loss 3.827501 (+0.32z)| norm 0.2543 (-0.83z)| lr 7.92e-04 | 4178.93 ms | 32.3% bf16 MFU | 125443 tok/s step 2187/19560 | loss 3.752702 (-1.28z)| norm 0.2602 (-0.63z)| lr 7.92e-04 | 4167.37 ms | 32.4% bf16 MFU | 125461 tok/s step 2188/19560 | loss 3.720005 (-1.94z)| norm 0.2831 (+0.10z)| lr 7.92e-04 | 4166.94 ms | 32.4% bf16 MFU | 125479 tok/s step 2189/19560 | loss 3.799548 (-0.25z)| norm 0.3010 (+0.68z)| lr 7.92e-04 | 4228.90 ms | 31.9% bf16 MFU | 125404 tok/s step 2190/19560 | loss 3.767818 (-0.91z)| norm 0.2765 (-0.10z)| lr 7.92e-04 | 4228.22 ms | 31.9% bf16 MFU | 125334 tok/s step 2191/19560 | loss 3.846496 (+0.76z)| norm 0.2456 (-1.09z)| lr 7.92e-04 | 4172.56 ms | 32.4% bf16 MFU | 125349 tok/s step 2192/19560 | loss 3.846915 (+0.76z)| norm 0.2221 (-1.81z)| lr 7.92e-04 | 4189.54 ms | 32.2% bf16 MFU | 125339 tok/s step 2193/19560 | loss 3.789978 (-0.45z)| norm 0.2555 (-0.73z)| lr 7.92e-04 | 4169.05 ms | 32.4% bf16 MFU | 125360 tok/s step 2194/19560 | loss 3.794025 (-0.36z)| norm 0.2554 (-0.73z)| lr 7.92e-04 | 4179.03 ms | 32.3% bf16 MFU | 125365 tok/s step 2195/19560 | loss 3.774296 (-0.77z)| norm 0.2501 (-0.89z)| lr 7.92e-04 | 4195.65 ms | 32.2% bf16 MFU | 125345 tok/s step 2196/19560 | loss 3.870878 (+1.28z)| norm 0.2612 (-0.54z)| lr 7.92e-04 | 4203.93 ms | 32.1% bf16 MFU | 125313 tok/s step 2197/19560 | loss 3.936436 (+2.59z)| norm 0.2859 (+0.24z)| lr 7.92e-04 | 4176.19 ms | 32.3% bf16 MFU | 125325 tok/s step 2198/19560 | loss 3.777740 (-0.69z)| norm 0.2884 (+0.31z)| lr 7.92e-04 | 4165.30 ms | 32.4% bf16 MFU | 125352 tok/s step 2199/19560 | loss 3.848786 (+0.76z)| norm 0.2731 (-0.18z)| lr 7.92e-04 | 4177.26 ms | 32.3% bf16 MFU | 125360 tok/s step 2200/19560 | loss 3.838942 (+0.56z)| norm 0.2440 (-1.09z)| lr 7.92e-04 | 4166.67 ms | 32.4% bf16 MFU | 125383 tok/s step 2201/19560 | loss 3.769103 (-0.87z)| norm 0.2739 (-0.15z)| lr 7.92e-04 | 4196.26 ms | 32.2% bf16 MFU | 125361 tok/s step 2202/19560 | loss 3.742598 (-1.40z)| norm 0.2792 (+0.02z)| lr 7.92e-04 | 4192.02 ms | 32.2% bf16 MFU | 125346 tok/s step 2203/19560 | loss 3.792302 (-0.38z)| norm 0.2686 (-0.32z)| lr 7.92e-04 | 4185.54 ms | 32.3% bf16 MFU | 125342 tok/s step 2204/19560 | loss 3.785302 (-0.52z)| norm 0.2706 (-0.25z)| lr 7.92e-04 | 5079.03 ms | 26.6% bf16 MFU | 124236 tok/s step 2205/19560 | loss 3.800265 (-0.21z)| norm 0.3070 (+0.89z)| lr 7.92e-04 | 4775.70 ms | 28.3% bf16 MFU | 123514 tok/s step 2206/19560 | loss 3.814036 (+0.09z)| norm 0.2691 (-0.32z)| lr 7.92e-04 | 4385.49 ms | 30.8% bf16 MFU | 123316 tok/s step 2207/19560 | loss 3.779819 (-0.62z)| norm 0.2776 (-0.06z)| lr 7.92e-04 | 4494.11 ms | 30.0% bf16 MFU | 122983 tok/s step 2208/19560 | loss 3.771760 (-0.78z)| norm 0.3245 (+1.42z)| lr 7.92e-04 | 4296.32 ms | 31.4% bf16 MFU | 122935 tok/s step 2209/19560 | loss 3.834714 (+0.52z)| norm 0.3414 (+1.93z)| lr 7.92e-04 | 4212.54 ms | 32.1% bf16 MFU | 123011 tok/s step 2210/19560 | loss 3.778064 (-0.65z)| norm 0.3736 (+2.87z)| lr 7.92e-04 | 4164.11 ms | 32.4% bf16 MFU | 123156 tok/s step 2211/19560 | loss 3.819204 (+0.20z)| norm 0.3335 (+1.61z)| lr 7.92e-04 | 4315.32 ms | 31.3% bf16 MFU | 123073 tok/s step 2212/19560 | loss 3.849942 (+0.83z)| norm 0.3127 (+0.97z)| lr 7.92e-04 | 4183.23 ms | 32.3% bf16 MFU | 123186 tok/s step 2213/19560 | loss 3.747767 (-1.26z)| norm 0.2756 (-0.15z)| lr 7.92e-04 | 4303.63 ms | 31.4% bf16 MFU | 123118 tok/s step 2214/19560 | loss 3.755911 (-1.08z)| norm 0.2619 (-0.57z)| lr 7.92e-04 | 4282.38 ms | 31.5% bf16 MFU | 123083 tok/s step 2215/19560 | loss 3.771534 (-0.76z)| norm 0.3009 (+0.61z)| lr 7.92e-04 | 4267.21 ms | 31.6% bf16 MFU | 123073 tok/s step 2216/19560 | loss 3.790734 (-0.36z)| norm 0.2944 (+0.40z)| lr 7.92e-04 | 4320.54 ms | 31.3% bf16 MFU | 122986 tok/s step 2217/19560 | loss 3.784134 (-0.49z)| norm 0.2652 (-0.50z)| lr 7.92e-04 | 4180.83 ms | 32.3% bf16 MFU | 123107 tok/s step 2218/19560 | loss 3.840668 (+0.66z)| norm 0.2434 (-1.16z)| lr 7.92e-04 | 4178.75 ms | 32.3% bf16 MFU | 123225 tok/s step 2219/19560 | loss 3.806748 (-0.04z)| norm 0.2700 (-0.35z)| lr 7.92e-04 | 4166.27 ms | 32.4% bf16 MFU | 123356 tok/s step 2220/19560 | loss 3.825566 (+0.40z)| norm 0.2702 (-0.35z)| lr 7.92e-04 | 4166.85 ms | 32.4% bf16 MFU | 123479 tok/s step 2221/19560 | loss 3.750385 (-1.23z)| norm 0.2490 (-0.98z)| lr 7.92e-04 | 4169.59 ms | 32.4% bf16 MFU | 123592 tok/s step 2222/19560 | loss 3.794717 (-0.26z)| norm 0.2591 (-0.66z)| lr 7.91e-04 | 4166.02 ms | 32.4% bf16 MFU | 123705 tok/s step 2223/19560 | loss 3.780961 (-0.56z)| norm 0.2784 (-0.04z)| lr 7.91e-04 | 4166.51 ms | 32.4% bf16 MFU | 123812 tok/s step 2224/19560 | loss 3.817241 (+0.25z)| norm 0.2509 (-0.91z)| lr 7.91e-04 | 4178.59 ms | 32.3% bf16 MFU | 123894 tok/s step 2225/19560 | loss 3.750784 (-1.21z)| norm 0.2813 (+0.07z)| lr 7.91e-04 | 4238.81 ms | 31.9% bf16 MFU | 123884 tok/s step 2226/19560 | loss 3.799157 (-0.14z)| norm 0.3194 (+1.27z)| lr 7.91e-04 | 4169.59 ms | 32.4% bf16 MFU | 123977 tok/s step 2227/19560 | loss 3.806790 (+0.03z)| norm 0.3107 (+0.98z)| lr 7.91e-04 | 4169.06 ms | 32.4% bf16 MFU | 124066 tok/s step 2228/19560 | loss 3.757016 (-1.06z)| norm 0.3024 (+0.70z)| lr 7.91e-04 | 4179.36 ms | 32.3% bf16 MFU | 124135 tok/s step 2229/19560 | loss 3.810536 (+0.13z)| norm 0.2666 (-0.43z)| lr 7.91e-04 | 4191.28 ms | 32.2% bf16 MFU | 124183 tok/s step 2230/19560 | loss 3.768809 (-0.80z)| norm 0.2940 (+0.44z)| lr 7.91e-04 | 4167.17 ms | 32.4% bf16 MFU | 124264 tok/s step 2231/19560 | loss 3.743314 (-1.35z)| norm 0.2443 (-1.14z)| lr 7.91e-04 | 4167.05 ms | 32.4% bf16 MFU | 124342 tok/s step 2232/19560 | loss 3.792265 (-0.27z)| norm 0.2417 (-1.21z)| lr 7.91e-04 | 4271.60 ms | 31.6% bf16 MFU | 124262 tok/s step 2233/19560 | loss 3.779848 (-0.54z)| norm 0.2344 (-1.43z)| lr 7.91e-04 | 4210.35 ms | 32.1% bf16 MFU | 124275 tok/s step 2234/19560 | loss 3.807727 (+0.09z)| norm 0.2483 (-1.00z)| lr 7.91e-04 | 4168.89 ms | 32.4% bf16 MFU | 124349 tok/s step 2235/19560 | loss 3.764222 (-0.87z)| norm 0.2569 (-0.72z)| lr 7.91e-04 | 4172.13 ms | 32.4% bf16 MFU | 124415 tok/s step 2236/19560 | loss 3.796693 (-0.14z)| norm 0.2483 (-0.99z)| lr 7.91e-04 | 4285.51 ms | 31.5% bf16 MFU | 124311 tok/s step 2237/19560 | loss 3.788744 (-0.31z)| norm 0.2445 (-1.09z)| lr 7.91e-04 | 4326.74 ms | 31.2% bf16 MFU | 124154 tok/s step 2238/19560 | loss 3.778455 (-0.54z)| norm 0.2641 (-0.46z)| lr 7.91e-04 | 4177.18 ms | 32.3% bf16 MFU | 124222 tok/s step 2239/19560 | loss 3.794055 (-0.18z)| norm 0.2488 (-0.93z)| lr 7.91e-04 | 4231.42 ms | 31.9% bf16 MFU | 124206 tok/s step 2240/19560 | loss 3.790908 (-0.24z)| norm 0.2473 (-0.97z)| lr 7.91e-04 | 4232.71 ms | 31.9% bf16 MFU | 124189 tok/s step 2241/19560 | loss 3.835260 (+0.83z)| norm 0.2864 (+0.25z)| lr 7.91e-04 | 4214.77 ms | 32.0% bf16 MFU | 124199 tok/s step 2242/19560 | loss 3.809254 (+0.20z)| norm 0.3191 (+1.28z)| lr 7.91e-04 | 4181.86 ms | 32.3% bf16 MFU | 124258 tok/s step 2243/19560 | loss 3.757088 (-1.06z)| norm 0.3232 (+1.45z)| lr 7.91e-04 | 4208.41 ms | 32.1% bf16 MFU | 124274 tok/s step 2244/19560 | loss 3.700524 (-2.40z)| norm 0.2960 (+0.59z)| lr 7.91e-04 | 4214.80 ms | 32.0% bf16 MFU | 124280 tok/s step 2245/19560 | loss 3.783059 (-0.39z)| norm 0.2821 (+0.16z)| lr 7.91e-04 | 4175.64 ms | 32.3% bf16 MFU | 124344 tok/s step 2246/19560 | loss 3.833851 (+0.85z)| norm 0.2762 (-0.03z)| lr 7.91e-04 | 4175.05 ms | 32.3% bf16 MFU | 124406 tok/s step 2247/19560 | loss 3.773271 (-0.61z)| norm 0.2784 (+0.05z)| lr 7.91e-04 | 4175.59 ms | 32.3% bf16 MFU | 124463 tok/s step 2248/19560 | loss 3.818267 (+0.49z)| norm 0.2504 (-0.85z)| lr 7.91e-04 | 4227.09 ms | 31.9% bf16 MFU | 124442 tok/s step 2249/19560 | loss 3.775094 (-0.56z)| norm 0.2587 (-0.58z)| lr 7.91e-04 | 4157.20 ms | 32.5% bf16 MFU | 124525 tok/s step 2250/19560 | loss 3.785426 (-0.30z)| norm 0.2803 (+0.14z)| lr 7.91e-04 | 4222.70 ms | 32.0% bf16 MFU | 124507 tok/s val loss 3.805050 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2633/10042 = 0.262199 step 2251/19560 | loss 3.816329 (+0.45z)| norm 0.3114 (+1.15z)| lr 7.91e-04 | 4182.14 ms | 32.3% bf16 MFU | 124550 tok/s step 2252/19560 | loss 3.806577 (+0.19z)| norm 0.3208 (+1.43z)| lr 7.91e-04 | 4168.78 ms | 32.4% bf16 MFU | 124611 tok/s step 2253/19560 | loss 3.851024 (+1.29z)| norm 0.3116 (+1.11z)| lr 7.91e-04 | 4166.67 ms | 32.4% bf16 MFU | 124672 tok/s step 2254/19560 | loss 3.748190 (-1.26z)| norm 0.2930 (+0.50z)| lr 7.91e-04 | 4169.83 ms | 32.4% bf16 MFU | 124725 tok/s step 2255/19560 | loss 3.721773 (-1.91z)| norm 0.3042 (+0.86z)| lr 7.91e-04 | 4178.41 ms | 32.3% bf16 MFU | 124762 tok/s step 2256/19560 | loss 3.742567 (-1.37z)| norm 0.2655 (-0.39z)| lr 7.91e-04 | 4264.19 ms | 31.7% bf16 MFU | 124672 tok/s step 2257/19560 | loss 3.748516 (-1.22z)| norm 0.2683 (-0.31z)| lr 7.91e-04 | 4208.89 ms | 32.1% bf16 MFU | 124666 tok/s step 2258/19560 | loss 3.834444 (+0.90z)| norm 0.2626 (-0.51z)| lr 7.91e-04 | 4307.99 ms | 31.3% bf16 MFU | 124518 tok/s step 2259/19560 | loss 3.888308 (+2.17z)| norm 0.2681 (-0.34z)| lr 7.91e-04 | 4171.39 ms | 32.4% bf16 MFU | 124577 tok/s step 2260/19560 | loss 3.778116 (-0.50z)| norm 0.2561 (-0.74z)| lr 7.91e-04 | 4172.36 ms | 32.4% bf16 MFU | 124631 tok/s step 2261/19560 | loss 3.714561 (-2.00z)| norm 0.2537 (-0.83z)| lr 7.91e-04 | 4168.01 ms | 32.4% bf16 MFU | 124689 tok/s step 2262/19560 | loss 3.814360 (+0.39z)| norm 0.2722 (-0.22z)| lr 7.91e-04 | 4166.27 ms | 32.4% bf16 MFU | 124746 tok/s step 2263/19560 | loss 3.793740 (-0.09z)| norm 0.2786 (-0.01z)| lr 7.91e-04 | 4177.64 ms | 32.3% bf16 MFU | 124784 tok/s step 2264/19560 | loss 3.845057 (+1.13z)| norm 0.3560 (+2.50z)| lr 7.91e-04 | 4171.27 ms | 32.4% bf16 MFU | 124829 tok/s step 2265/19560 | loss 3.802611 (+0.12z)| norm 0.3754 (+3.01z)| lr 7.91e-04 | 4263.38 ms | 31.7% bf16 MFU | 124736 tok/s step 2266/19560 | loss 3.773350 (-0.58z)| norm 0.2887 (+0.24z)| lr 7.91e-04 | 4172.16 ms | 32.4% bf16 MFU | 124783 tok/s step 2267/19560 | loss 3.779524 (-0.43z)| norm 0.2821 (+0.03z)| lr 7.91e-04 | 4432.62 ms | 30.5% bf16 MFU | 124458 tok/s step 2268/19560 | loss 3.741616 (-1.32z)| norm 0.2629 (-0.58z)| lr 7.91e-04 | 4176.47 ms | 32.3% bf16 MFU | 124511 tok/s step 2269/19560 | loss 3.824382 (+0.67z)| norm 0.2892 (+0.26z)| lr 7.91e-04 | 4174.68 ms | 32.3% bf16 MFU | 124565 tok/s step 2270/19560 | loss 3.838835 (+1.01z)| norm 0.2827 (+0.04z)| lr 7.91e-04 | 4170.89 ms | 32.4% bf16 MFU | 124622 tok/s step 2271/19560 | loss 3.773310 (-0.56z)| norm 0.2697 (-0.38z)| lr 7.91e-04 | 4257.40 ms | 31.7% bf16 MFU | 124548 tok/s step 2272/19560 | loss 3.767866 (-0.69z)| norm 0.2804 (-0.05z)| lr 7.91e-04 | 4167.16 ms | 32.4% bf16 MFU | 124612 tok/s step 2273/19560 | loss 3.772779 (-0.58z)| norm 0.2410 (-1.32z)| lr 7.91e-04 | 4187.37 ms | 32.2% bf16 MFU | 124641 tok/s step 2274/19560 | loss 3.777732 (-0.47z)| norm 0.2586 (-0.74z)| lr 7.91e-04 | 4163.85 ms | 32.4% bf16 MFU | 124705 tok/s step 2275/19560 | loss 3.821599 (+0.59z)| norm 0.2694 (-0.38z)| lr 7.91e-04 | 4170.11 ms | 32.4% bf16 MFU | 124756 tok/s step 2276/19560 | loss 3.794566 (-0.06z)| norm 0.2788 (-0.06z)| lr 7.91e-04 | 4280.19 ms | 31.5% bf16 MFU | 124643 tok/s step 2277/19560 | loss 3.695451 (-2.42z)| norm 0.2594 (-0.69z)| lr 7.91e-04 | 4226.66 ms | 31.9% bf16 MFU | 124613 tok/s step 2278/19560 | loss 3.799255 (+0.05z)| norm 0.2359 (-1.44z)| lr 7.91e-04 | 4198.59 ms | 32.2% bf16 MFU | 124626 tok/s step 2279/19560 | loss 3.820318 (+0.56z)| norm 0.2360 (-1.41z)| lr 7.91e-04 | 4161.19 ms | 32.4% bf16 MFU | 124694 tok/s step 2280/19560 | loss 3.769558 (-0.66z)| norm 0.2518 (-0.89z)| lr 7.91e-04 | 4170.51 ms | 32.4% bf16 MFU | 124745 tok/s step 2281/19560 | loss 3.805129 (+0.20z)| norm 0.2802 (+0.03z)| lr 7.91e-04 | 4176.82 ms | 32.3% bf16 MFU | 124784 tok/s step 2282/19560 | loss 3.795011 (-0.07z)| norm 0.2741 (-0.16z)| lr 7.91e-04 | 4188.87 ms | 32.2% bf16 MFU | 124803 tok/s step 2283/19560 | loss 3.726208 (-1.73z)| norm 0.2767 (-0.07z)| lr 7.91e-04 | 4176.92 ms | 32.3% bf16 MFU | 124839 tok/s step 2284/19560 | loss 3.767372 (-0.73z)| norm 0.2357 (-1.39z)| lr 7.91e-04 | 4177.97 ms | 32.3% bf16 MFU | 124871 tok/s step 2285/19560 | loss 3.806603 (+0.24z)| norm 0.2608 (-0.56z)| lr 7.91e-04 | 4183.40 ms | 32.3% bf16 MFU | 124894 tok/s step 2286/19560 | loss 3.746540 (-1.23z)| norm 0.2731 (-0.15z)| lr 7.91e-04 | 4173.25 ms | 32.4% bf16 MFU | 124931 tok/s step 2287/19560 | loss 3.786381 (-0.22z)| norm 0.2865 (+0.32z)| lr 7.91e-04 | 4166.13 ms | 32.4% bf16 MFU | 124977 tok/s step 2288/19560 | loss 3.846216 (+1.32z)| norm 0.2924 (+0.53z)| lr 7.91e-04 | 4186.98 ms | 32.2% bf16 MFU | 124989 tok/s step 2289/19560 | loss 3.774347 (-0.51z)| norm 0.2943 (+0.60z)| lr 7.91e-04 | 4165.72 ms | 32.4% bf16 MFU | 125032 tok/s step 2290/19560 | loss 3.775621 (-0.47z)| norm 0.2927 (+0.54z)| lr 7.91e-04 | 4172.89 ms | 32.4% bf16 MFU | 125063 tok/s step 2291/19560 | loss 3.774041 (-0.50z)| norm 0.2926 (+0.53z)| lr 7.91e-04 | 4226.99 ms | 31.9% bf16 MFU | 125011 tok/s step 2292/19560 | loss 3.739799 (-1.36z)| norm 0.3045 (+0.93z)| lr 7.90e-04 | 4181.99 ms | 32.3% bf16 MFU | 125029 tok/s step 2293/19560 | loss 3.748899 (-1.13z)| norm 0.3099 (+1.10z)| lr 7.90e-04 | 4165.15 ms | 32.4% bf16 MFU | 125071 tok/s step 2294/19560 | loss 3.796189 (+0.11z)| norm 0.2775 (-0.01z)| lr 7.90e-04 | 4189.80 ms | 32.2% bf16 MFU | 125074 tok/s step 2295/19560 | loss 3.877324 (+2.18z)| norm 0.2623 (-0.53z)| lr 7.90e-04 | 4169.96 ms | 32.4% bf16 MFU | 125107 tok/s step 2296/19560 | loss 3.761783 (-0.79z)| norm 0.2603 (-0.59z)| lr 7.90e-04 | 4181.52 ms | 32.3% bf16 MFU | 125121 tok/s step 2297/19560 | loss 3.779144 (-0.33z)| norm 0.3014 (+0.81z)| lr 7.90e-04 | 4175.69 ms | 32.3% bf16 MFU | 125143 tok/s step 2298/19560 | loss 3.791591 (-0.01z)| norm 0.2840 (+0.21z)| lr 7.90e-04 | 4178.36 ms | 32.3% bf16 MFU | 125159 tok/s step 2299/19560 | loss 3.753792 (-1.00z)| norm 0.2520 (-0.87z)| lr 7.90e-04 | 4178.02 ms | 32.3% bf16 MFU | 125176 tok/s step 2300/19560 | loss 3.742729 (-1.27z)| norm 0.2406 (-1.25z)| lr 7.90e-04 | 4186.88 ms | 32.2% bf16 MFU | 125178 tok/s step 2301/19560 | loss 3.737759 (-1.38z)| norm 0.2674 (-0.31z)| lr 7.90e-04 | 4166.56 ms | 32.4% bf16 MFU | 125211 tok/s step 2302/19560 | loss 3.812067 (+0.56z)| norm 0.2871 (+0.40z)| lr 7.90e-04 | 4181.89 ms | 32.3% bf16 MFU | 125219 tok/s step 2303/19560 | loss 3.784383 (-0.18z)| norm 0.2825 (+0.25z)| lr 7.90e-04 | 4182.62 ms | 32.3% bf16 MFU | 125225 tok/s step 2304/19560 | loss 3.760310 (-0.80z)| norm 0.2616 (-0.49z)| lr 7.90e-04 | 4183.67 ms | 32.3% bf16 MFU | 125230 tok/s step 2305/19560 | loss 3.771060 (-0.51z)| norm 0.2605 (-0.53z)| lr 7.90e-04 | 4175.71 ms | 32.3% bf16 MFU | 125246 tok/s step 2306/19560 | loss 3.846919 (+1.48z)| norm 0.2605 (-0.53z)| lr 7.90e-04 | 4168.03 ms | 32.4% bf16 MFU | 125273 tok/s step 2307/19560 | loss 3.794859 (+0.12z)| norm 0.2383 (-1.31z)| lr 7.90e-04 | 4180.49 ms | 32.3% bf16 MFU | 125280 tok/s step 2308/19560 | loss 3.779334 (-0.29z)| norm 0.2488 (-0.94z)| lr 7.90e-04 | 4166.73 ms | 32.4% bf16 MFU | 125308 tok/s step 2309/19560 | loss 3.783481 (-0.18z)| norm 0.2513 (-0.85z)| lr 7.90e-04 | 4177.83 ms | 32.3% bf16 MFU | 125317 tok/s step 2310/19560 | loss 3.792143 (+0.05z)| norm 0.2966 (+0.77z)| lr 7.90e-04 | 4170.48 ms | 32.4% bf16 MFU | 125337 tok/s step 2311/19560 | loss 3.750094 (-1.04z)| norm 0.3181 (+1.52z)| lr 7.90e-04 | 4186.51 ms | 32.3% bf16 MFU | 125332 tok/s step 2312/19560 | loss 3.860353 (+1.83z)| norm 0.3403 (+2.26z)| lr 7.90e-04 | 4208.54 ms | 32.1% bf16 MFU | 125294 tok/s step 2313/19560 | loss 3.824666 (+0.90z)| norm 0.3293 (+1.84z)| lr 7.90e-04 | 4188.15 ms | 32.2% bf16 MFU | 125288 tok/s step 2314/19560 | loss 3.750690 (-1.01z)| norm 0.2830 (+0.19z)| lr 7.90e-04 | 4198.86 ms | 32.2% bf16 MFU | 125267 tok/s step 2315/19560 | loss 3.821919 (+0.83z)| norm 0.2906 (+0.46z)| lr 7.90e-04 | 4169.52 ms | 32.4% bf16 MFU | 125291 tok/s step 2316/19560 | loss 3.811949 (+0.56z)| norm 0.2732 (-0.16z)| lr 7.90e-04 | 4169.40 ms | 32.4% bf16 MFU | 125314 tok/s step 2317/19560 | loss 3.796868 (+0.16z)| norm 0.2588 (-0.66z)| lr 7.90e-04 | 4179.30 ms | 32.3% bf16 MFU | 125320 tok/s step 2318/19560 | loss 3.783811 (-0.19z)| norm 0.2683 (-0.32z)| lr 7.90e-04 | 4187.41 ms | 32.2% bf16 MFU | 125315 tok/s step 2319/19560 | loss 3.788184 (-0.06z)| norm 0.2383 (-1.37z)| lr 7.90e-04 | 4185.29 ms | 32.3% bf16 MFU | 125312 tok/s step 2320/19560 | loss 3.753796 (-0.96z)| norm 0.2471 (-1.08z)| lr 7.90e-04 | 4164.56 ms | 32.4% bf16 MFU | 125342 tok/s step 2321/19560 | loss 3.836249 (+1.23z)| norm 0.2717 (-0.21z)| lr 7.90e-04 | 4174.34 ms | 32.3% bf16 MFU | 125354 tok/s step 2322/19560 | loss 3.773575 (-0.44z)| norm 0.2741 (-0.13z)| lr 7.90e-04 | 4177.28 ms | 32.3% bf16 MFU | 125362 tok/s step 2323/19560 | loss 3.785028 (-0.13z)| norm 0.2399 (-1.35z)| lr 7.90e-04 | 4179.83 ms | 32.3% bf16 MFU | 125366 tok/s step 2324/19560 | loss 3.762710 (-0.72z)| norm 0.2372 (-1.43z)| lr 7.90e-04 | 4166.22 ms | 32.4% bf16 MFU | 125389 tok/s step 2325/19560 | loss 3.713132 (-2.11z)| norm 0.2454 (-1.12z)| lr 7.90e-04 | 4167.03 ms | 32.4% bf16 MFU | 125411 tok/s step 2326/19560 | loss 3.758591 (-0.81z)| norm 0.2545 (-0.79z)| lr 7.90e-04 | 4185.25 ms | 32.3% bf16 MFU | 125404 tok/s step 2327/19560 | loss 3.753909 (-0.93z)| norm 0.2509 (-0.90z)| lr 7.90e-04 | 4181.26 ms | 32.3% bf16 MFU | 125403 tok/s step 2328/19560 | loss 3.778922 (-0.21z)| norm 0.2524 (-0.86z)| lr 7.90e-04 | 4174.09 ms | 32.3% bf16 MFU | 125413 tok/s step 2329/19560 | loss 3.724121 (-1.75z)| norm 0.2367 (-1.39z)| lr 7.90e-04 | 4168.26 ms | 32.4% bf16 MFU | 125432 tok/s step 2330/19560 | loss 3.825440 (+1.11z)| norm 0.2694 (-0.24z)| lr 7.90e-04 | 4185.40 ms | 32.3% bf16 MFU | 125423 tok/s step 2331/19560 | loss 3.836279 (+1.40z)| norm 0.2475 (-1.00z)| lr 7.90e-04 | 4170.81 ms | 32.4% bf16 MFU | 125437 tok/s step 2332/19560 | loss 3.779601 (-0.20z)| norm 0.2372 (-1.34z)| lr 7.90e-04 | 4176.72 ms | 32.3% bf16 MFU | 125442 tok/s step 2333/19560 | loss 3.800715 (+0.40z)| norm 0.2455 (-1.04z)| lr 7.90e-04 | 4173.35 ms | 32.4% bf16 MFU | 125451 tok/s step 2334/19560 | loss 3.767998 (-0.52z)| norm 0.2442 (-1.07z)| lr 7.90e-04 | 4181.27 ms | 32.3% bf16 MFU | 125448 tok/s step 2335/19560 | loss 3.728451 (-1.61z)| norm 0.2700 (-0.18z)| lr 7.90e-04 | 4170.58 ms | 32.4% bf16 MFU | 125461 tok/s step 2336/19560 | loss 3.776685 (-0.26z)| norm 0.3066 (+1.09z)| lr 7.90e-04 | 4180.17 ms | 32.3% bf16 MFU | 125459 tok/s step 2337/19560 | loss 3.731662 (-1.50z)| norm 0.3063 (+1.11z)| lr 7.90e-04 | 4174.68 ms | 32.3% bf16 MFU | 125466 tok/s step 2338/19560 | loss 3.753386 (-0.88z)| norm 0.3184 (+1.62z)| lr 7.90e-04 | 4164.85 ms | 32.4% bf16 MFU | 125487 tok/s step 2339/19560 | loss 3.805186 (+0.57z)| norm 0.3464 (+2.62z)| lr 7.90e-04 | 4231.22 ms | 31.9% bf16 MFU | 125408 tok/s step 2340/19560 | loss 3.664885 (-3.22z)| norm 0.3107 (+1.32z)| lr 7.90e-04 | 4163.93 ms | 32.4% bf16 MFU | 125433 tok/s step 2341/19560 | loss 3.745016 (-1.04z)| norm 0.3130 (+1.38z)| lr 7.90e-04 | 4169.18 ms | 32.4% bf16 MFU | 125449 tok/s step 2342/19560 | loss 3.738822 (-1.20z)| norm 0.3043 (+1.06z)| lr 7.90e-04 | 4188.42 ms | 32.2% bf16 MFU | 125435 tok/s step 2343/19560 | loss 3.774405 (-0.24z)| norm 0.2656 (-0.33z)| lr 7.90e-04 | 4167.13 ms | 32.4% bf16 MFU | 125454 tok/s step 2344/19560 | loss 3.888162 (+2.73z)| norm 0.2482 (-0.95z)| lr 7.90e-04 | 4180.12 ms | 32.3% bf16 MFU | 125453 tok/s step 2345/19560 | loss 3.804821 (+0.54z)| norm 0.2697 (-0.17z)| lr 7.90e-04 | 4166.42 ms | 32.4% bf16 MFU | 125472 tok/s step 2346/19560 | loss 3.715667 (-1.77z)| norm 0.2590 (-0.56z)| lr 7.90e-04 | 4180.04 ms | 32.3% bf16 MFU | 125470 tok/s step 2347/19560 | loss 3.750709 (-0.84z)| norm 0.2630 (-0.42z)| lr 7.90e-04 | 4178.46 ms | 32.3% bf16 MFU | 125470 tok/s step 2348/19560 | loss 3.777363 (-0.13z)| norm 0.3046 (+1.08z)| lr 7.90e-04 | 4161.26 ms | 32.4% bf16 MFU | 125496 tok/s step 2349/19560 | loss 3.810610 (+0.73z)| norm 0.2972 (+0.79z)| lr 7.90e-04 | 4178.16 ms | 32.3% bf16 MFU | 125495 tok/s step 2350/19560 | loss 3.749573 (-0.86z)| norm 0.2659 (-0.34z)| lr 7.90e-04 | 4176.29 ms | 32.3% bf16 MFU | 125498 tok/s step 2351/19560 | loss 3.814353 (+0.82z)| norm 0.2693 (-0.21z)| lr 7.90e-04 | 4171.41 ms | 32.4% bf16 MFU | 125507 tok/s step 2352/19560 | loss 3.775294 (-0.19z)| norm 0.2567 (-0.67z)| lr 7.90e-04 | 4181.00 ms | 32.3% bf16 MFU | 125502 tok/s step 2353/19560 | loss 3.775548 (-0.19z)| norm 0.2492 (-0.93z)| lr 7.90e-04 | 4166.74 ms | 32.4% bf16 MFU | 125518 tok/s step 2354/19560 | loss 3.788206 (+0.15z)| norm 0.2875 (+0.47z)| lr 7.90e-04 | 4178.51 ms | 32.3% bf16 MFU | 125516 tok/s step 2355/19560 | loss 3.776297 (-0.16z)| norm 0.3407 (+2.36z)| lr 7.90e-04 | 4184.37 ms | 32.3% bf16 MFU | 125505 tok/s step 2356/19560 | loss 3.752470 (-0.78z)| norm 0.3154 (+1.44z)| lr 7.90e-04 | 4386.58 ms | 30.8% bf16 MFU | 125205 tok/s step 2357/19560 | loss 3.767370 (-0.38z)| norm 0.2484 (-0.94z)| lr 7.90e-04 | 4181.19 ms | 32.3% bf16 MFU | 125215 tok/s step 2358/19560 | loss 3.749166 (-0.86z)| norm 0.2628 (-0.42z)| lr 7.89e-04 | 4171.67 ms | 32.4% bf16 MFU | 125238 tok/s step 2359/19560 | loss 3.751323 (-0.80z)| norm 0.2412 (-1.19z)| lr 7.89e-04 | 4161.28 ms | 32.4% bf16 MFU | 125276 tok/s step 2360/19560 | loss 3.775014 (-0.18z)| norm 0.2527 (-0.78z)| lr 7.89e-04 | 4191.03 ms | 32.2% bf16 MFU | 125267 tok/s step 2361/19560 | loss 3.728400 (-1.38z)| norm 0.2672 (-0.28z)| lr 7.89e-04 | 4187.18 ms | 32.2% bf16 MFU | 125264 tok/s step 2362/19560 | loss 3.739717 (-1.07z)| norm 0.2487 (-0.94z)| lr 7.89e-04 | 4171.01 ms | 32.4% bf16 MFU | 125286 tok/s step 2363/19560 | loss 3.821781 (+1.05z)| norm 0.2469 (-1.00z)| lr 7.89e-04 | 4215.40 ms | 32.0% bf16 MFU | 125240 tok/s step 2364/19560 | loss 3.747569 (-0.86z)| norm 0.2601 (-0.53z)| lr 7.89e-04 | 4177.84 ms | 32.3% bf16 MFU | 125253 tok/s step 2365/19560 | loss 3.720301 (-1.54z)| norm 0.2510 (-0.86z)| lr 7.89e-04 | 4184.81 ms | 32.3% bf16 MFU | 125254 tok/s step 2366/19560 | loss 3.763258 (-0.43z)| norm 0.2534 (-0.77z)| lr 7.89e-04 | 4174.64 ms | 32.3% bf16 MFU | 125271 tok/s step 2367/19560 | loss 3.782883 (+0.07z)| norm 0.2436 (-1.12z)| lr 7.89e-04 | 4166.47 ms | 32.4% bf16 MFU | 125299 tok/s step 2368/19560 | loss 3.741744 (-0.97z)| norm 0.2514 (-0.85z)| lr 7.89e-04 | 4163.68 ms | 32.4% bf16 MFU | 125330 tok/s step 2369/19560 | loss 3.715749 (-1.61z)| norm 0.2481 (-0.95z)| lr 7.89e-04 | 4173.81 ms | 32.3% bf16 MFU | 125344 tok/s step 2370/19560 | loss 3.775074 (-0.09z)| norm 0.2446 (-1.06z)| lr 7.89e-04 | 4180.60 ms | 32.3% bf16 MFU | 125348 tok/s step 2371/19560 | loss 3.750627 (-0.71z)| norm 0.2583 (-0.55z)| lr 7.89e-04 | 4211.58 ms | 32.1% bf16 MFU | 125305 tok/s step 2372/19560 | loss 3.734891 (-1.13z)| norm 0.2472 (-0.94z)| lr 7.89e-04 | 4183.81 ms | 32.3% bf16 MFU | 125305 tok/s step 2373/19560 | loss 3.775611 (-0.08z)| norm 0.2570 (-0.58z)| lr 7.89e-04 | 4172.58 ms | 32.4% bf16 MFU | 125322 tok/s step 2374/19560 | loss 3.752264 (-0.67z)| norm 0.2552 (-0.64z)| lr 7.89e-04 | 4185.68 ms | 32.3% bf16 MFU | 125319 tok/s step 2375/19560 | loss 3.858392 (+2.03z)| norm 0.2768 (+0.15z)| lr 7.89e-04 | 4162.36 ms | 32.4% bf16 MFU | 125351 tok/s step 2376/19560 | loss 3.725483 (-1.34z)| norm 0.2875 (+0.52z)| lr 7.89e-04 | 4263.80 ms | 31.7% bf16 MFU | 125232 tok/s step 2377/19560 | loss 3.771924 (-0.15z)| norm 0.2835 (+0.37z)| lr 7.89e-04 | 4173.57 ms | 32.4% bf16 MFU | 125251 tok/s step 2378/19560 | loss 3.724456 (-1.34z)| norm 0.2821 (+0.32z)| lr 7.89e-04 | 4166.87 ms | 32.4% bf16 MFU | 125280 tok/s step 2379/19560 | loss 3.769657 (-0.19z)| norm 0.2653 (-0.28z)| lr 7.89e-04 | 4171.62 ms | 32.4% bf16 MFU | 125300 tok/s step 2380/19560 | loss 3.786620 (+0.24z)| norm 0.2664 (-0.23z)| lr 7.89e-04 | 4234.78 ms | 31.9% bf16 MFU | 125225 tok/s step 2381/19560 | loss 3.781138 (+0.12z)| norm 0.2567 (-0.57z)| lr 7.89e-04 | 4175.34 ms | 32.3% bf16 MFU | 125242 tok/s step 2382/19560 | loss 3.758889 (-0.46z)| norm 0.2560 (-0.59z)| lr 7.89e-04 | 4167.08 ms | 32.4% bf16 MFU | 125271 tok/s step 2383/19560 | loss 3.722852 (-1.39z)| norm 0.2667 (-0.18z)| lr 7.89e-04 | 4157.86 ms | 32.5% bf16 MFU | 125312 tok/s step 2384/19560 | loss 3.701560 (-1.91z)| norm 0.2837 (+0.45z)| lr 7.89e-04 | 4302.44 ms | 31.4% bf16 MFU | 125139 tok/s step 2385/19560 | loss 3.720298 (-1.41z)| norm 0.2775 (+0.22z)| lr 7.89e-04 | 4161.79 ms | 32.4% bf16 MFU | 125181 tok/s step 2386/19560 | loss 3.777868 (+0.06z)| norm 0.2942 (+0.84z)| lr 7.89e-04 | 4757.04 ms | 28.4% bf16 MFU | 124433 tok/s step 2387/19560 | loss 3.782694 (+0.21z)| norm 0.2772 (+0.19z)| lr 7.89e-04 | 4163.11 ms | 32.4% bf16 MFU | 124508 tok/s step 2388/19560 | loss 3.743837 (-0.81z)| norm 0.2645 (-0.28z)| lr 7.89e-04 | 4157.93 ms | 32.5% bf16 MFU | 124587 tok/s step 2389/19560 | loss 3.755300 (-0.52z)| norm 0.3078 (+1.32z)| lr 7.89e-04 | 4249.79 ms | 31.8% bf16 MFU | 124526 tok/s step 2390/19560 | loss 3.810909 (+0.96z)| norm 0.3165 (+1.61z)| lr 7.89e-04 | 4167.87 ms | 32.4% bf16 MFU | 124590 tok/s step 2391/19560 | loss 3.739585 (-0.92z)| norm 0.2961 (+0.85z)| lr 7.89e-04 | 4166.11 ms | 32.4% bf16 MFU | 124652 tok/s step 2392/19560 | loss 3.755859 (-0.48z)| norm 0.3056 (+1.25z)| lr 7.89e-04 | 4172.61 ms | 32.4% bf16 MFU | 124702 tok/s step 2393/19560 | loss 3.739188 (-0.91z)| norm 0.3111 (+1.57z)| lr 7.89e-04 | 4228.20 ms | 31.9% bf16 MFU | 124667 tok/s step 2394/19560 | loss 3.765914 (-0.19z)| norm 0.2698 (-0.09z)| lr 7.89e-04 | 4508.77 ms | 29.9% bf16 MFU | 124248 tok/s step 2395/19560 | loss 3.801658 (+0.76z)| norm 0.2601 (-0.47z)| lr 7.89e-04 | 4959.26 ms | 27.2% bf16 MFU | 123321 tok/s step 2396/19560 | loss 3.822455 (+1.30z)| norm 0.2818 (+0.40z)| lr 7.89e-04 | 5324.45 ms | 25.4% bf16 MFU | 122079 tok/s step 2397/19560 | loss 3.823794 (+1.34z)| norm 0.2674 (-0.18z)| lr 7.89e-04 | 4400.94 ms | 30.7% bf16 MFU | 121931 tok/s step 2398/19560 | loss 3.796767 (+0.63z)| norm 0.2527 (-0.76z)| lr 7.89e-04 | 4306.43 ms | 31.4% bf16 MFU | 121922 tok/s step 2399/19560 | loss 3.760890 (-0.34z)| norm 0.2642 (-0.29z)| lr 7.89e-04 | 4374.78 ms | 30.9% bf16 MFU | 121818 tok/s step 2400/19560 | loss 3.755064 (-0.49z)| norm 0.2623 (-0.36z)| lr 7.89e-04 | 4185.24 ms | 32.3% bf16 MFU | 121991 tok/s step 2401/19560 | loss 3.794223 (+0.56z)| norm 0.2534 (-0.73z)| lr 7.89e-04 | 4281.27 ms | 31.5% bf16 MFU | 122014 tok/s step 2402/19560 | loss 3.791452 (+0.48z)| norm 0.2826 (+0.45z)| lr 7.89e-04 | 4402.86 ms | 30.7% bf16 MFU | 121867 tok/s step 2403/19560 | loss 3.832102 (+1.57z)| norm 0.2582 (-0.54z)| lr 7.89e-04 | 4182.33 ms | 32.3% bf16 MFU | 122042 tok/s step 2404/19560 | loss 3.838118 (+1.70z)| norm 0.3045 (+1.31z)| lr 7.89e-04 | 4192.10 ms | 32.2% bf16 MFU | 122193 tok/s step 2405/19560 | loss 3.823469 (+1.30z)| norm 0.3040 (+1.27z)| lr 7.89e-04 | 4153.49 ms | 32.5% bf16 MFU | 122395 tok/s step 2406/19560 | loss 3.706538 (-1.80z)| norm 0.2702 (-0.09z)| lr 7.89e-04 | 4190.99 ms | 32.2% bf16 MFU | 122530 tok/s step 2407/19560 | loss 3.762010 (-0.32z)| norm 0.2464 (-1.05z)| lr 7.89e-04 | 4178.69 ms | 32.3% bf16 MFU | 122677 tok/s step 2408/19560 | loss 3.735366 (-1.02z)| norm 0.2510 (-0.86z)| lr 7.89e-04 | 4169.79 ms | 32.4% bf16 MFU | 122830 tok/s step 2409/19560 | loss 3.800050 (+0.71z)| norm 0.2737 (+0.05z)| lr 7.89e-04 | 4209.32 ms | 32.1% bf16 MFU | 122916 tok/s step 2410/19560 | loss 3.745916 (-0.73z)| norm 0.2509 (-0.86z)| lr 7.89e-04 | 4172.25 ms | 32.4% bf16 MFU | 123053 tok/s step 2411/19560 | loss 3.814150 (+1.07z)| norm 0.2650 (-0.29z)| lr 7.89e-04 | 4171.20 ms | 32.4% bf16 MFU | 123185 tok/s step 2412/19560 | loss 3.772108 (-0.05z)| norm 0.2493 (-0.93z)| lr 7.89e-04 | 4161.90 ms | 32.4% bf16 MFU | 123325 tok/s step 2413/19560 | loss 3.794857 (+0.56z)| norm 0.2732 (+0.04z)| lr 7.89e-04 | 4158.36 ms | 32.5% bf16 MFU | 123462 tok/s step 2414/19560 | loss 3.766058 (-0.21z)| norm 0.3172 (+1.78z)| lr 7.89e-04 | 4212.35 ms | 32.1% bf16 MFU | 123513 tok/s step 2415/19560 | loss 3.752959 (-0.56z)| norm 0.2889 (+0.65z)| lr 7.89e-04 | 4186.34 ms | 32.3% bf16 MFU | 123599 tok/s step 2416/19560 | loss 3.728515 (-1.20z)| norm 0.3010 (+1.13z)| lr 7.89e-04 | 4167.39 ms | 32.4% bf16 MFU | 123709 tok/s step 2417/19560 | loss 3.781359 (+0.23z)| norm 0.3215 (+1.91z)| lr 7.89e-04 | 4168.09 ms | 32.4% bf16 MFU | 123813 tok/s step 2418/19560 | loss 3.769319 (-0.09z)| norm 0.3208 (+1.86z)| lr 7.89e-04 | 4159.76 ms | 32.5% bf16 MFU | 123924 tok/s step 2419/19560 | loss 3.762290 (-0.28z)| norm 0.3042 (+1.20z)| lr 7.89e-04 | 4161.25 ms | 32.4% bf16 MFU | 124028 tok/s step 2420/19560 | loss 3.788543 (+0.42z)| norm 0.2631 (-0.38z)| lr 7.89e-04 | 4166.17 ms | 32.4% bf16 MFU | 124119 tok/s step 2421/19560 | loss 3.795231 (+0.59z)| norm 0.2385 (-1.32z)| lr 7.89e-04 | 4162.77 ms | 32.4% bf16 MFU | 124210 tok/s step 2422/19560 | loss 3.740337 (-0.88z)| norm 0.2322 (-1.54z)| lr 7.88e-04 | 4161.65 ms | 32.4% bf16 MFU | 124299 tok/s step 2423/19560 | loss 3.780401 (+0.23z)| norm 0.2685 (-0.14z)| lr 7.88e-04 | 4170.80 ms | 32.4% bf16 MFU | 124369 tok/s step 2424/19560 | loss 3.784999 (+0.35z)| norm 0.2658 (-0.25z)| lr 7.88e-04 | 4170.04 ms | 32.4% bf16 MFU | 124437 tok/s step 2425/19560 | loss 3.856904 (+2.29z)| norm 0.2494 (-0.86z)| lr 7.88e-04 | 4162.26 ms | 32.4% bf16 MFU | 124513 tok/s step 2426/19560 | loss 3.793055 (+0.55z)| norm 0.2749 (+0.13z)| lr 7.88e-04 | 4192.84 ms | 32.2% bf16 MFU | 124540 tok/s step 2427/19560 | loss 3.762796 (-0.28z)| norm 0.2918 (+0.77z)| lr 7.88e-04 | 4191.19 ms | 32.2% bf16 MFU | 124567 tok/s step 2428/19560 | loss 3.794343 (+0.57z)| norm 0.2955 (+0.90z)| lr 7.88e-04 | 4169.84 ms | 32.4% bf16 MFU | 124626 tok/s step 2429/19560 | loss 3.822841 (+1.33z)| norm 0.3548 (+3.07z)| lr 7.88e-04 | 4168.48 ms | 32.4% bf16 MFU | 124683 tok/s step 2430/19560 | loss 3.745441 (-0.77z)| norm 0.2921 (+0.71z)| lr 7.88e-04 | 4166.75 ms | 32.4% bf16 MFU | 124740 tok/s step 2431/19560 | loss 3.739002 (-0.94z)| norm 0.2657 (-0.27z)| lr 7.88e-04 | 4169.38 ms | 32.4% bf16 MFU | 124790 tok/s step 2432/19560 | loss 3.730132 (-1.17z)| norm 0.2774 (+0.16z)| lr 7.88e-04 | 4237.40 ms | 31.9% bf16 MFU | 124737 tok/s step 2433/19560 | loss 3.767642 (-0.15z)| norm 0.2885 (+0.57z)| lr 7.88e-04 | 4173.02 ms | 32.4% bf16 MFU | 124782 tok/s step 2434/19560 | loss 3.760074 (-0.34z)| norm 0.2674 (-0.22z)| lr 7.88e-04 | 4161.27 ms | 32.4% bf16 MFU | 124843 tok/s step 2435/19560 | loss 3.788884 (+0.46z)| norm 0.3048 (+1.16z)| lr 7.88e-04 | 4222.08 ms | 32.0% bf16 MFU | 124810 tok/s step 2436/19560 | loss 3.806131 (+0.92z)| norm 0.3572 (+3.00z)| lr 7.88e-04 | 4168.00 ms | 32.4% bf16 MFU | 124859 tok/s step 2437/19560 | loss 3.776602 (+0.11z)| norm 0.3883 (+3.86z)| lr 7.88e-04 | 4169.62 ms | 32.4% bf16 MFU | 124903 tok/s step 2438/19560 | loss 3.763604 (-0.24z)| norm 0.3102 (+1.17z)| lr 7.88e-04 | 4169.79 ms | 32.4% bf16 MFU | 124944 tok/s step 2439/19560 | loss 3.764701 (-0.21z)| norm 0.2828 (+0.25z)| lr 7.88e-04 | 4166.46 ms | 32.4% bf16 MFU | 124989 tok/s step 2440/19560 | loss 3.726139 (-1.27z)| norm 0.2728 (-0.08z)| lr 7.88e-04 | 4165.61 ms | 32.4% bf16 MFU | 125032 tok/s step 2441/19560 | loss 3.798745 (+0.78z)| norm 0.2870 (+0.44z)| lr 7.88e-04 | 4237.16 ms | 31.9% bf16 MFU | 124968 tok/s step 2442/19560 | loss 3.776276 (+0.14z)| norm 0.2788 (+0.15z)| lr 7.88e-04 | 4173.38 ms | 32.4% bf16 MFU | 125001 tok/s step 2443/19560 | loss 3.749784 (-0.60z)| norm 0.2612 (-0.47z)| lr 7.88e-04 | 4172.33 ms | 32.4% bf16 MFU | 125033 tok/s step 2444/19560 | loss 3.790313 (+0.56z)| norm 0.2483 (-0.92z)| lr 7.88e-04 | 4169.44 ms | 32.4% bf16 MFU | 125069 tok/s step 2445/19560 | loss 3.761671 (-0.25z)| norm 0.2576 (-0.59z)| lr 7.88e-04 | 4162.70 ms | 32.4% bf16 MFU | 125113 tok/s step 2446/19560 | loss 3.828305 (+1.63z)| norm 0.2717 (-0.09z)| lr 7.88e-04 | 4168.08 ms | 32.4% bf16 MFU | 125147 tok/s step 2447/19560 | loss 3.738111 (-0.91z)| norm 0.2993 (+0.87z)| lr 7.88e-04 | 4172.53 ms | 32.4% bf16 MFU | 125172 tok/s step 2448/19560 | loss 3.820244 (+1.39z)| norm 0.2881 (+0.47z)| lr 7.88e-04 | 4165.17 ms | 32.4% bf16 MFU | 125207 tok/s step 2449/19560 | loss 3.826613 (+1.58z)| norm 0.2824 (+0.26z)| lr 7.88e-04 | 4164.12 ms | 32.4% bf16 MFU | 125242 tok/s step 2450/19560 | loss 3.785064 (+0.40z)| norm 0.2737 (-0.05z)| lr 7.88e-04 | 4164.95 ms | 32.4% bf16 MFU | 125274 tok/s step 2451/19560 | loss 3.742587 (-0.78z)| norm 0.2735 (-0.07z)| lr 7.88e-04 | 4512.98 ms | 29.9% bf16 MFU | 124819 tok/s step 2452/19560 | loss 3.778210 (+0.21z)| norm 0.2832 (+0.27z)| lr 7.88e-04 | 4172.78 ms | 32.4% bf16 MFU | 124860 tok/s step 2453/19560 | loss 3.777826 (+0.19z)| norm 0.2803 (+0.15z)| lr 7.88e-04 | 4196.98 ms | 32.2% bf16 MFU | 124863 tok/s step 2454/19560 | loss 3.744071 (-0.76z)| norm 0.2840 (+0.28z)| lr 7.88e-04 | 4158.46 ms | 32.5% bf16 MFU | 124924 tok/s step 2455/19560 | loss 3.754190 (-0.48z)| norm 0.2454 (-1.12z)| lr 7.88e-04 | 4163.44 ms | 32.4% bf16 MFU | 124974 tok/s step 2456/19560 | loss 3.787577 (+0.47z)| norm 0.2387 (-1.36z)| lr 7.88e-04 | 4172.84 ms | 32.4% bf16 MFU | 125008 tok/s step 2457/19560 | loss 3.708584 (-1.76z)| norm 0.2431 (-1.20z)| lr 7.88e-04 | 4171.21 ms | 32.4% bf16 MFU | 125042 tok/s step 2458/19560 | loss 3.717605 (-1.48z)| norm 0.2386 (-1.35z)| lr 7.88e-04 | 4159.52 ms | 32.5% bf16 MFU | 125092 tok/s step 2459/19560 | loss 3.691835 (-2.16z)| norm 0.2321 (-1.57z)| lr 7.88e-04 | 4211.09 ms | 32.1% bf16 MFU | 125062 tok/s step 2460/19560 | loss 3.792274 (+0.65z)| norm 0.2347 (-1.47z)| lr 7.88e-04 | 4485.24 ms | 30.1% bf16 MFU | 124654 tok/s step 2461/19560 | loss 3.699941 (-1.89z)| norm 0.2515 (-0.87z)| lr 7.88e-04 | 4191.96 ms | 32.2% bf16 MFU | 124675 tok/s step 2462/19560 | loss 3.747198 (-0.58z)| norm 0.2587 (-0.62z)| lr 7.88e-04 | 4162.42 ms | 32.4% bf16 MFU | 124739 tok/s step 2463/19560 | loss 3.746388 (-0.61z)| norm 0.2575 (-0.66z)| lr 7.88e-04 | 4156.92 ms | 32.5% bf16 MFU | 124808 tok/s step 2464/19560 | loss 3.799080 (+0.85z)| norm 0.2733 (-0.08z)| lr 7.88e-04 | 4239.94 ms | 31.8% bf16 MFU | 124750 tok/s step 2465/19560 | loss 3.752265 (-0.45z)| norm 0.2997 (+0.87z)| lr 7.88e-04 | 4199.37 ms | 32.2% bf16 MFU | 124755 tok/s step 2466/19560 | loss 3.751014 (-0.49z)| norm 0.2959 (+0.75z)| lr 7.88e-04 | 4165.99 ms | 32.4% bf16 MFU | 124810 tok/s step 2467/19560 | loss 3.860484 (+2.49z)| norm 0.2758 (+0.03z)| lr 7.88e-04 | 4170.78 ms | 32.4% bf16 MFU | 124855 tok/s step 2468/19560 | loss 3.795013 (+0.70z)| norm 0.2881 (+0.51z)| lr 7.88e-04 | 4206.29 ms | 32.1% bf16 MFU | 124844 tok/s step 2469/19560 | loss 3.788482 (+0.51z)| norm 0.2670 (-0.28z)| lr 7.88e-04 | 4173.03 ms | 32.4% bf16 MFU | 124884 tok/s step 2470/19560 | loss 3.735321 (-0.98z)| norm 0.2935 (+0.74z)| lr 7.88e-04 | 4157.64 ms | 32.5% bf16 MFU | 124945 tok/s step 2471/19560 | loss 3.862844 (+2.51z)| norm 0.3090 (+1.30z)| lr 7.88e-04 | 4171.06 ms | 32.4% bf16 MFU | 124982 tok/s step 2472/19560 | loss 3.790151 (+0.57z)| norm 0.3195 (+1.67z)| lr 7.88e-04 | 4165.44 ms | 32.4% bf16 MFU | 125027 tok/s step 2473/19560 | loss 3.796908 (+0.76z)| norm 0.3006 (+0.94z)| lr 7.88e-04 | 4169.20 ms | 32.4% bf16 MFU | 125063 tok/s step 2474/19560 | loss 3.759684 (-0.31z)| norm 0.3071 (+1.17z)| lr 7.88e-04 | 4168.92 ms | 32.4% bf16 MFU | 125098 tok/s step 2475/19560 | loss 3.788858 (+0.52z)| norm 0.2881 (+0.45z)| lr 7.88e-04 | 4167.91 ms | 32.4% bf16 MFU | 125133 tok/s step 2476/19560 | loss 3.774327 (+0.10z)| norm 0.2637 (-0.45z)| lr 7.88e-04 | 4176.39 ms | 32.3% bf16 MFU | 125153 tok/s step 2477/19560 | loss 3.753850 (-0.48z)| norm 0.2470 (-1.06z)| lr 7.88e-04 | 4239.97 ms | 31.8% bf16 MFU | 125078 tok/s step 2478/19560 | loss 3.718582 (-1.48z)| norm 0.2566 (-0.69z)| lr 7.88e-04 | 4174.38 ms | 32.3% bf16 MFU | 125104 tok/s step 2479/19560 | loss 3.768148 (-0.05z)| norm 0.2535 (-0.80z)| lr 7.88e-04 | 4182.67 ms | 32.3% bf16 MFU | 125116 tok/s step 2480/19560 | loss 3.793929 (+0.69z)| norm 0.2429 (-1.19z)| lr 7.88e-04 | 4165.29 ms | 32.4% bf16 MFU | 125154 tok/s step 2481/19560 | loss 3.742161 (-0.79z)| norm 0.2592 (-0.59z)| lr 7.88e-04 | 4188.01 ms | 32.2% bf16 MFU | 125155 tok/s step 2482/19560 | loss 3.783254 (+0.39z)| norm 0.2910 (+0.59z)| lr 7.87e-04 | 4168.22 ms | 32.4% bf16 MFU | 125187 tok/s step 2483/19560 | loss 3.789611 (+0.57z)| norm 0.2669 (-0.29z)| lr 7.87e-04 | 4173.06 ms | 32.4% bf16 MFU | 125209 tok/s step 2484/19560 | loss 3.806623 (+1.05z)| norm 0.2373 (-1.40z)| lr 7.87e-04 | 4167.35 ms | 32.4% bf16 MFU | 125239 tok/s step 2485/19560 | loss 3.752480 (-0.50z)| norm 0.2370 (-1.40z)| lr 7.87e-04 | 4165.16 ms | 32.4% bf16 MFU | 125271 tok/s step 2486/19560 | loss 3.731024 (-1.11z)| norm 0.2642 (-0.37z)| lr 7.87e-04 | 4174.52 ms | 32.3% bf16 MFU | 125287 tok/s step 2487/19560 | loss 3.714358 (-1.56z)| norm 0.2555 (-0.70z)| lr 7.87e-04 | 4171.01 ms | 32.4% bf16 MFU | 125308 tok/s step 2488/19560 | loss 3.798990 (+0.82z)| norm 0.2351 (-1.47z)| lr 7.87e-04 | 4159.33 ms | 32.5% bf16 MFU | 125345 tok/s step 2489/19560 | loss 3.741971 (-0.79z)| norm 0.2496 (-0.91z)| lr 7.87e-04 | 4176.06 ms | 32.3% bf16 MFU | 125355 tok/s step 2490/19560 | loss 3.796964 (+0.75z)| norm 0.2838 (+0.38z)| lr 7.87e-04 | 4162.40 ms | 32.4% bf16 MFU | 125385 tok/s step 2491/19560 | loss 3.743012 (-0.76z)| norm 0.2658 (-0.32z)| lr 7.87e-04 | 4177.27 ms | 32.3% bf16 MFU | 125391 tok/s step 2492/19560 | loss 3.745411 (-0.69z)| norm 0.2489 (-0.95z)| lr 7.87e-04 | 4166.16 ms | 32.4% bf16 MFU | 125414 tok/s step 2493/19560 | loss 3.804263 (+0.97z)| norm 0.2879 (+0.52z)| lr 7.87e-04 | 4161.55 ms | 32.4% bf16 MFU | 125442 tok/s step 2494/19560 | loss 3.778104 (+0.22z)| norm 0.3241 (+1.85z)| lr 7.87e-04 | 4168.15 ms | 32.4% bf16 MFU | 125459 tok/s step 2495/19560 | loss 3.792351 (+0.62z)| norm 0.3004 (+0.95z)| lr 7.87e-04 | 4159.24 ms | 32.5% bf16 MFU | 125489 tok/s step 2496/19560 | loss 3.837755 (+1.88z)| norm 0.2952 (+0.74z)| lr 7.87e-04 | 4167.41 ms | 32.4% bf16 MFU | 125505 tok/s step 2497/19560 | loss 3.734966 (-1.04z)| norm 0.2599 (-0.60z)| lr 7.87e-04 | 4170.00 ms | 32.4% bf16 MFU | 125516 tok/s step 2498/19560 | loss 3.708946 (-1.74z)| norm 0.2429 (-1.24z)| lr 7.87e-04 | 4161.30 ms | 32.4% bf16 MFU | 125540 tok/s step 2499/19560 | loss 3.825456 (+1.50z)| norm 0.2434 (-1.21z)| lr 7.87e-04 | 4163.74 ms | 32.4% bf16 MFU | 125559 tok/s step 2500/19560 | loss 3.828784 (+1.57z)| norm 0.2505 (-0.94z)| lr 7.87e-04 | 4163.99 ms | 32.4% bf16 MFU | 125576 tok/s val loss 3.756273 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2683/10042 = 0.267178 step 2501/19560 | loss 3.784120 (+0.33z)| norm 0.2857 (+0.37z)| lr 7.87e-04 | 4377.07 ms | 30.8% bf16 MFU | 125287 tok/s step 2502/19560 | loss 3.769490 (-0.08z)| norm 0.2811 (+0.19z)| lr 7.87e-04 | 4165.59 ms | 32.4% bf16 MFU | 125315 tok/s step 2503/19560 | loss 3.748487 (-0.65z)| norm 0.3091 (+1.24z)| lr 7.87e-04 | 4267.14 ms | 31.6% bf16 MFU | 125193 tok/s step 2504/19560 | loss 3.743174 (-0.81z)| norm 0.2651 (-0.41z)| lr 7.87e-04 | 4443.19 ms | 30.4% bf16 MFU | 124833 tok/s step 2505/19560 | loss 3.812870 (+1.15z)| norm 0.2340 (-1.55z)| lr 7.87e-04 | 4168.04 ms | 32.4% bf16 MFU | 124881 tok/s step 2506/19560 | loss 3.752210 (-0.57z)| norm 0.2625 (-0.49z)| lr 7.87e-04 | 4171.83 ms | 32.4% bf16 MFU | 124921 tok/s step 2507/19560 | loss 3.826261 (+1.51z)| norm 0.2858 (+0.38z)| lr 7.87e-04 | 4165.27 ms | 32.4% bf16 MFU | 124968 tok/s step 2508/19560 | loss 3.834904 (+1.72z)| norm 0.2961 (+0.75z)| lr 7.87e-04 | 4442.91 ms | 30.4% bf16 MFU | 124620 tok/s step 2509/19560 | loss 3.720503 (-1.44z)| norm 0.2869 (+0.40z)| lr 7.87e-04 | 4248.79 ms | 31.8% bf16 MFU | 124559 tok/s step 2510/19560 | loss 3.741571 (-0.85z)| norm 0.3082 (+1.17z)| lr 7.87e-04 | 4299.32 ms | 31.4% bf16 MFU | 124428 tok/s step 2511/19560 | loss 3.734109 (-1.06z)| norm 0.3301 (+1.94z)| lr 7.87e-04 | 4175.77 ms | 32.3% bf16 MFU | 124485 tok/s step 2512/19560 | loss 3.748244 (-0.69z)| norm 0.3063 (+1.06z)| lr 7.87e-04 | 4172.06 ms | 32.4% bf16 MFU | 124544 tok/s step 2513/19560 | loss 3.721153 (-1.45z)| norm 0.3016 (+0.88z)| lr 7.87e-04 | 4172.28 ms | 32.4% bf16 MFU | 124599 tok/s step 2514/19560 | loss 3.722425 (-1.40z)| norm 0.2748 (-0.09z)| lr 7.87e-04 | 4160.94 ms | 32.4% bf16 MFU | 124670 tok/s step 2515/19560 | loss 3.766893 (-0.15z)| norm 0.2632 (-0.51z)| lr 7.87e-04 | 4384.97 ms | 30.8% bf16 MFU | 124414 tok/s step 2516/19560 | loss 3.749546 (-0.64z)| norm 0.2626 (-0.53z)| lr 7.87e-04 | 4165.06 ms | 32.4% bf16 MFU | 124487 tok/s step 2517/19560 | loss 3.796932 (+0.67z)| norm 0.2622 (-0.53z)| lr 7.87e-04 | 4407.69 ms | 30.6% bf16 MFU | 124211 tok/s step 2518/19560 | loss 3.750966 (-0.60z)| norm 0.2705 (-0.22z)| lr 7.87e-04 | 4246.93 ms | 31.8% bf16 MFU | 124173 tok/s step 2519/19560 | loss 3.778149 (+0.16z)| norm 0.2761 (-0.01z)| lr 7.87e-04 | 4168.68 ms | 32.4% bf16 MFU | 124252 tok/s step 2520/19560 | loss 3.720417 (-1.44z)| norm 0.2736 (-0.09z)| lr 7.87e-04 | 4162.74 ms | 32.4% bf16 MFU | 124337 tok/s step 2521/19560 | loss 3.735626 (-1.02z)| norm 0.2423 (-1.23z)| lr 7.87e-04 | 4264.43 ms | 31.7% bf16 MFU | 124267 tok/s step 2522/19560 | loss 3.699756 (-1.97z)| norm 0.2626 (-0.47z)| lr 7.87e-04 | 4165.74 ms | 32.4% bf16 MFU | 124347 tok/s step 2523/19560 | loss 3.727444 (-1.19z)| norm 0.2687 (-0.25z)| lr 7.87e-04 | 4181.91 ms | 32.3% bf16 MFU | 124398 tok/s step 2524/19560 | loss 3.728245 (-1.16z)| norm 0.2690 (-0.24z)| lr 7.87e-04 | 4170.94 ms | 32.4% bf16 MFU | 124463 tok/s step 2525/19560 | loss 3.772783 (+0.08z)| norm 0.2639 (-0.42z)| lr 7.87e-04 | 4166.64 ms | 32.4% bf16 MFU | 124532 tok/s step 2526/19560 | loss 3.688634 (-2.19z)| norm 0.2685 (-0.26z)| lr 7.87e-04 | 4180.05 ms | 32.3% bf16 MFU | 124576 tok/s step 2527/19560 | loss 3.798729 (+0.79z)| norm 0.2754 (-0.01z)| lr 7.87e-04 | 4167.38 ms | 32.4% bf16 MFU | 124638 tok/s step 2528/19560 | loss 3.733665 (-0.96z)| norm 0.3037 (+1.03z)| lr 7.87e-04 | 4174.97 ms | 32.3% bf16 MFU | 124685 tok/s step 2529/19560 | loss 3.694382 (-1.97z)| norm 0.2880 (+0.44z)| lr 7.87e-04 | 4178.73 ms | 32.3% bf16 MFU | 124724 tok/s step 2530/19560 | loss 3.761971 (-0.17z)| norm 0.3353 (+2.14z)| lr 7.87e-04 | 4167.75 ms | 32.4% bf16 MFU | 124778 tok/s step 2531/19560 | loss 3.771826 (+0.11z)| norm 0.3124 (+1.28z)| lr 7.87e-04 | 4170.88 ms | 32.4% bf16 MFU | 124824 tok/s step 2532/19560 | loss 3.771574 (+0.12z)| norm 0.2666 (-0.37z)| lr 7.87e-04 | 4168.24 ms | 32.4% bf16 MFU | 124872 tok/s step 2533/19560 | loss 3.821261 (+1.48z)| norm 0.2746 (-0.07z)| lr 7.87e-04 | 4170.33 ms | 32.4% bf16 MFU | 124914 tok/s step 2534/19560 | loss 3.756622 (-0.31z)| norm 0.2588 (-0.64z)| lr 7.87e-04 | 4174.75 ms | 32.3% bf16 MFU | 124948 tok/s step 2535/19560 | loss 3.779708 (+0.33z)| norm 0.2558 (-0.76z)| lr 7.87e-04 | 4177.72 ms | 32.3% bf16 MFU | 124975 tok/s step 2536/19560 | loss 3.823426 (+1.51z)| norm 0.2306 (-1.66z)| lr 7.87e-04 | 4188.24 ms | 32.2% bf16 MFU | 124985 tok/s step 2537/19560 | loss 3.788646 (+0.56z)| norm 0.2701 (-0.23z)| lr 7.87e-04 | 4170.62 ms | 32.4% bf16 MFU | 125022 tok/s step 2538/19560 | loss 3.747779 (-0.57z)| norm 0.3174 (+1.47z)| lr 7.87e-04 | 4172.80 ms | 32.4% bf16 MFU | 125053 tok/s step 2539/19560 | loss 3.714952 (-1.45z)| norm 0.3095 (+1.16z)| lr 7.87e-04 | 4180.33 ms | 32.3% bf16 MFU | 125071 tok/s step 2540/19560 | loss 3.789818 (+0.60z)| norm 0.2703 (-0.25z)| lr 7.87e-04 | 4174.12 ms | 32.3% bf16 MFU | 125098 tok/s step 2541/19560 | loss 3.726782 (-1.11z)| norm 0.2784 (+0.04z)| lr 7.86e-04 | 4186.98 ms | 32.2% bf16 MFU | 125104 tok/s step 2542/19560 | loss 3.716266 (-1.38z)| norm 0.3118 (+1.25z)| lr 7.86e-04 | 4174.95 ms | 32.3% bf16 MFU | 125127 tok/s step 2543/19560 | loss 3.762753 (-0.11z)| norm 0.3003 (+0.83z)| lr 7.86e-04 | 4174.20 ms | 32.3% bf16 MFU | 125151 tok/s step 2544/19560 | loss 3.745607 (-0.59z)| norm 0.2761 (-0.04z)| lr 7.86e-04 | 4171.82 ms | 32.4% bf16 MFU | 125177 tok/s step 2545/19560 | loss 3.774309 (+0.20z)| norm 0.2688 (-0.29z)| lr 7.86e-04 | 4170.76 ms | 32.4% bf16 MFU | 125204 tok/s step 2546/19560 | loss 3.730306 (-0.99z)| norm 0.2715 (-0.18z)| lr 7.86e-04 | 4169.86 ms | 32.4% bf16 MFU | 125230 tok/s step 2547/19560 | loss 3.813595 (+1.25z)| norm 0.2680 (-0.30z)| lr 7.86e-04 | 4181.09 ms | 32.3% bf16 MFU | 125238 tok/s step 2548/19560 | loss 3.787063 (+0.54z)| norm 0.2717 (-0.17z)| lr 7.86e-04 | 4175.49 ms | 32.3% bf16 MFU | 125255 tok/s step 2549/19560 | loss 3.777341 (+0.28z)| norm 0.2759 (-0.02z)| lr 7.86e-04 | 4174.46 ms | 32.3% bf16 MFU | 125272 tok/s step 2550/19560 | loss 3.732378 (-0.93z)| norm 0.2560 (-0.78z)| lr 7.86e-04 | 4174.05 ms | 32.3% bf16 MFU | 125288 tok/s step 2551/19560 | loss 3.772870 (+0.16z)| norm 0.2758 (-0.03z)| lr 7.86e-04 | 4161.57 ms | 32.4% bf16 MFU | 125323 tok/s step 2552/19560 | loss 3.777869 (+0.30z)| norm 0.3155 (+1.43z)| lr 7.86e-04 | 4176.32 ms | 32.3% bf16 MFU | 125334 tok/s step 2553/19560 | loss 3.810563 (+1.22z)| norm 0.2921 (+0.55z)| lr 7.86e-04 | 4169.46 ms | 32.4% bf16 MFU | 125354 tok/s step 2554/19560 | loss 3.707968 (-1.58z)| norm 0.2802 (+0.10z)| lr 7.86e-04 | 4171.41 ms | 32.4% bf16 MFU | 125371 tok/s step 2555/19560 | loss 3.828197 (+1.67z)| norm 0.2541 (-0.87z)| lr 7.86e-04 | 4174.70 ms | 32.3% bf16 MFU | 125382 tok/s step 2556/19560 | loss 3.786210 (+0.54z)| norm 0.2563 (-0.77z)| lr 7.86e-04 | 4167.72 ms | 32.4% bf16 MFU | 125403 tok/s step 2557/19560 | loss 3.780147 (+0.39z)| norm 0.2444 (-1.22z)| lr 7.86e-04 | 4174.16 ms | 32.3% bf16 MFU | 125413 tok/s step 2558/19560 | loss 3.832777 (+1.79z)| norm 0.2450 (-1.18z)| lr 7.86e-04 | 4169.72 ms | 32.4% bf16 MFU | 125429 tok/s step 2559/19560 | loss 3.719096 (-1.27z)| norm 0.2472 (-1.08z)| lr 7.86e-04 | 4181.61 ms | 32.3% bf16 MFU | 125426 tok/s step 2560/19560 | loss 3.729942 (-0.98z)| norm 0.2444 (-1.17z)| lr 7.86e-04 | 4195.53 ms | 32.2% bf16 MFU | 125403 tok/s step 2561/19560 | loss 3.726529 (-1.06z)| norm 0.2241 (-1.90z)| lr 7.86e-04 | 4175.32 ms | 32.3% bf16 MFU | 125411 tok/s step 2562/19560 | loss 3.784804 (+0.49z)| norm 0.2369 (-1.40z)| lr 7.86e-04 | 4172.32 ms | 32.4% bf16 MFU | 125424 tok/s step 2563/19560 | loss 3.757486 (-0.23z)| norm 0.2452 (-1.08z)| lr 7.86e-04 | 4168.94 ms | 32.4% bf16 MFU | 125441 tok/s step 2564/19560 | loss 3.745824 (-0.53z)| norm 0.2505 (-0.88z)| lr 7.86e-04 | 4172.86 ms | 32.4% bf16 MFU | 125451 tok/s step 2565/19560 | loss 3.684891 (-2.11z)| norm 0.2674 (-0.21z)| lr 7.86e-04 | 4178.23 ms | 32.3% bf16 MFU | 125452 tok/s step 2566/19560 | loss 3.744921 (-0.52z)| norm 0.2599 (-0.51z)| lr 7.86e-04 | 4170.52 ms | 32.4% bf16 MFU | 125465 tok/s step 2567/19560 | loss 3.730120 (-0.90z)| norm 0.2360 (-1.49z)| lr 7.86e-04 | 4173.20 ms | 32.4% bf16 MFU | 125474 tok/s step 2568/19560 | loss 3.738790 (-0.68z)| norm 0.2290 (-1.75z)| lr 7.86e-04 | 4167.42 ms | 32.4% bf16 MFU | 125490 tok/s step 2569/19560 | loss 3.739925 (-0.64z)| norm 0.2471 (-0.98z)| lr 7.86e-04 | 4177.27 ms | 32.3% bf16 MFU | 125491 tok/s step 2570/19560 | loss 3.688507 (-1.95z)| norm 0.2603 (-0.43z)| lr 7.86e-04 | 4178.98 ms | 32.3% bf16 MFU | 125490 tok/s step 2571/19560 | loss 3.758713 (-0.12z)| norm 0.2911 (+0.83z)| lr 7.86e-04 | 4267.17 ms | 31.6% bf16 MFU | 125358 tok/s step 2572/19560 | loss 3.810016 (+1.20z)| norm 0.3396 (+2.74z)| lr 7.86e-04 | 4263.61 ms | 31.7% bf16 MFU | 125239 tok/s step 2573/19560 | loss 3.764833 (+0.03z)| norm 0.3736 (+3.83z)| lr 7.86e-04 | 4170.16 ms | 32.4% bf16 MFU | 125263 tok/s step 2574/19560 | loss 3.721316 (-1.08z)| norm 0.3225 (+1.85z)| lr 7.86e-04 | 4167.36 ms | 32.4% bf16 MFU | 125290 tok/s step 2575/19560 | loss 3.711061 (-1.34z)| norm 0.2919 (+0.71z)| lr 7.86e-04 | 4169.02 ms | 32.4% bf16 MFU | 125314 tok/s step 2576/19560 | loss 3.746378 (-0.41z)| norm 0.2933 (+0.76z)| lr 7.86e-04 | 4168.66 ms | 32.4% bf16 MFU | 125337 tok/s step 2577/19560 | loss 3.741016 (-0.54z)| norm 0.2861 (+0.49z)| lr 7.86e-04 | 4173.32 ms | 32.4% bf16 MFU | 125351 tok/s step 2578/19560 | loss 3.754738 (-0.17z)| norm 0.2730 (+0.00z)| lr 7.86e-04 | 4169.70 ms | 32.4% bf16 MFU | 125370 tok/s step 2579/19560 | loss 3.706606 (-1.42z)| norm 0.2858 (+0.48z)| lr 7.86e-04 | 4230.06 ms | 31.9% bf16 MFU | 125299 tok/s step 2580/19560 | loss 3.704167 (-1.46z)| norm 0.2363 (-1.35z)| lr 7.86e-04 | 4160.19 ms | 32.5% bf16 MFU | 125335 tok/s step 2581/19560 | loss 3.801117 (+1.06z)| norm 0.2293 (-1.58z)| lr 7.86e-04 | 4193.55 ms | 32.2% bf16 MFU | 125320 tok/s step 2582/19560 | loss 3.747661 (-0.33z)| norm 0.2616 (-0.39z)| lr 7.86e-04 | 4163.40 ms | 32.4% bf16 MFU | 125350 tok/s step 2583/19560 | loss 3.747711 (-0.33z)| norm 0.2573 (-0.55z)| lr 7.86e-04 | 4166.89 ms | 32.4% bf16 MFU | 125374 tok/s step 2584/19560 | loss 3.781016 (+0.54z)| norm 0.2357 (-1.34z)| lr 7.86e-04 | 4183.05 ms | 32.3% bf16 MFU | 125372 tok/s step 2585/19560 | loss 3.784908 (+0.63z)| norm 0.2705 (-0.07z)| lr 7.86e-04 | 4845.51 ms | 27.9% bf16 MFU | 124513 tok/s step 2586/19560 | loss 3.787506 (+0.68z)| norm 0.2724 (-0.01z)| lr 7.86e-04 | 4672.86 ms | 28.9% bf16 MFU | 123898 tok/s step 2587/19560 | loss 3.762261 (+0.00z)| norm 0.2837 (+0.40z)| lr 7.86e-04 | 4552.44 ms | 29.7% bf16 MFU | 123461 tok/s step 2588/19560 | loss 3.756420 (-0.14z)| norm 0.2927 (+0.72z)| lr 7.86e-04 | 4337.70 ms | 31.1% bf16 MFU | 123331 tok/s step 2589/19560 | loss 3.738293 (-0.64z)| norm 0.2396 (-1.27z)| lr 7.86e-04 | 4250.26 ms | 31.8% bf16 MFU | 123333 tok/s step 2590/19560 | loss 3.670959 (-2.38z)| norm 0.2567 (-0.63z)| lr 7.86e-04 | 4297.43 ms | 31.4% bf16 MFU | 123266 tok/s step 2591/19560 | loss 3.783488 (+0.57z)| norm 0.2607 (-0.48z)| lr 7.86e-04 | 4291.04 ms | 31.5% bf16 MFU | 123212 tok/s step 2592/19560 | loss 3.731838 (-0.77z)| norm 0.2526 (-0.78z)| lr 7.86e-04 | 4243.04 ms | 31.8% bf16 MFU | 123229 tok/s step 2593/19560 | loss 3.740034 (-0.56z)| norm 0.2691 (-0.15z)| lr 7.86e-04 | 4174.26 ms | 32.3% bf16 MFU | 123348 tok/s step 2594/19560 | loss 3.827311 (+1.70z)| norm 0.2596 (-0.50z)| lr 7.86e-04 | 4282.50 ms | 31.5% bf16 MFU | 123302 tok/s step 2595/19560 | loss 3.755390 (-0.15z)| norm 0.2658 (-0.26z)| lr 7.86e-04 | 4173.36 ms | 32.4% bf16 MFU | 123418 tok/s step 2596/19560 | loss 3.764271 (+0.10z)| norm 0.3103 (+1.41z)| lr 7.86e-04 | 4382.79 ms | 30.8% bf16 MFU | 123228 tok/s step 2597/19560 | loss 3.755377 (-0.14z)| norm 0.2922 (+0.72z)| lr 7.85e-04 | 4284.53 ms | 31.5% bf16 MFU | 123185 tok/s step 2598/19560 | loss 3.771289 (+0.28z)| norm 0.2500 (-0.85z)| lr 7.85e-04 | 4215.12 ms | 32.0% bf16 MFU | 123245 tok/s step 2599/19560 | loss 3.726788 (-0.91z)| norm 0.2664 (-0.23z)| lr 7.85e-04 | 4226.08 ms | 31.9% bf16 MFU | 123286 tok/s step 2600/19560 | loss 3.762834 (+0.09z)| norm 0.2574 (-0.55z)| lr 7.85e-04 | 4344.36 ms | 31.1% bf16 MFU | 123156 tok/s step 2601/19560 | loss 3.825627 (+1.81z)| norm 0.2551 (-0.63z)| lr 7.85e-04 | 4327.48 ms | 31.2% bf16 MFU | 123056 tok/s step 2602/19560 | loss 3.784594 (+0.68z)| norm 0.2658 (-0.21z)| lr 7.85e-04 | 8421.37 ms | 16.0% bf16 MFU | 120016 tok/s step 2603/19560 | loss 3.743539 (-0.44z)| norm 0.2755 (+0.17z)| lr 7.85e-04 | 4166.63 ms | 32.4% bf16 MFU | 120306 tok/s step 2604/19560 | loss 3.746911 (-0.34z)| norm 0.2632 (-0.30z)| lr 7.85e-04 | 4177.27 ms | 32.3% bf16 MFU | 120567 tok/s step 2605/19560 | loss 3.757073 (-0.06z)| norm 0.2755 (+0.16z)| lr 7.85e-04 | 4203.67 ms | 32.1% bf16 MFU | 120774 tok/s step 2606/19560 | loss 3.739337 (-0.56z)| norm 0.2490 (-0.86z)| lr 7.85e-04 | 4281.61 ms | 31.5% bf16 MFU | 120858 tok/s step 2607/19560 | loss 3.796173 (+1.00z)| norm 0.2683 (-0.12z)| lr 7.85e-04 | 4178.08 ms | 32.3% bf16 MFU | 121090 tok/s step 2608/19560 | loss 3.782004 (+0.62z)| norm 0.3109 (+1.50z)| lr 7.85e-04 | 4314.14 ms | 31.3% bf16 MFU | 121111 tok/s step 2609/19560 | loss 3.759637 (-0.00z)| norm 0.2973 (+0.97z)| lr 7.85e-04 | 4161.89 ms | 32.4% bf16 MFU | 121355 tok/s step 2610/19560 | loss 3.721771 (-1.03z)| norm 0.3081 (+1.37z)| lr 7.85e-04 | 4193.77 ms | 32.2% bf16 MFU | 121538 tok/s step 2611/19560 | loss 3.785827 (+0.73z)| norm 0.3098 (+1.41z)| lr 7.85e-04 | 4278.48 ms | 31.6% bf16 MFU | 121588 tok/s step 2612/19560 | loss 3.790580 (+0.87z)| norm 0.2997 (+1.01z)| lr 7.85e-04 | 4218.97 ms | 32.0% bf16 MFU | 121722 tok/s step 2613/19560 | loss 3.787231 (+0.77z)| norm 0.2853 (+0.45z)| lr 7.85e-04 | 4169.81 ms | 32.4% bf16 MFU | 121922 tok/s step 2614/19560 | loss 3.730594 (-0.79z)| norm 0.2522 (-0.81z)| lr 7.85e-04 | 4203.64 ms | 32.1% bf16 MFU | 122062 tok/s step 2615/19560 | loss 3.735602 (-0.66z)| norm 0.2597 (-0.53z)| lr 7.85e-04 | 4170.65 ms | 32.4% bf16 MFU | 122245 tok/s step 2616/19560 | loss 3.725928 (-0.92z)| norm 0.2623 (-0.44z)| lr 7.85e-04 | 4189.55 ms | 32.2% bf16 MFU | 122390 tok/s step 2617/19560 | loss 3.747915 (-0.31z)| norm 0.2504 (-0.90z)| lr 7.85e-04 | 4176.46 ms | 32.3% bf16 MFU | 122547 tok/s step 2618/19560 | loss 3.744013 (-0.41z)| norm 0.2673 (-0.24z)| lr 7.85e-04 | 4178.43 ms | 32.3% bf16 MFU | 122693 tok/s step 2619/19560 | loss 3.733471 (-0.70z)| norm 0.2648 (-0.34z)| lr 7.85e-04 | 4183.44 ms | 32.3% bf16 MFU | 122825 tok/s step 2620/19560 | loss 3.753607 (-0.14z)| norm 0.2970 (+0.89z)| lr 7.85e-04 | 4174.66 ms | 32.3% bf16 MFU | 122963 tok/s step 2621/19560 | loss 3.782370 (+0.67z)| norm 0.3424 (+2.56z)| lr 7.85e-04 | 4178.67 ms | 32.3% bf16 MFU | 123088 tok/s step 2622/19560 | loss 3.775854 (+0.49z)| norm 0.3116 (+1.41z)| lr 7.85e-04 | 4210.15 ms | 32.1% bf16 MFU | 123160 tok/s step 2623/19560 | loss 3.755318 (-0.08z)| norm 0.3160 (+1.57z)| lr 7.85e-04 | 4181.92 ms | 32.3% bf16 MFU | 123271 tok/s step 2624/19560 | loss 3.781226 (+0.67z)| norm 0.2744 (+0.00z)| lr 7.85e-04 | 4183.79 ms | 32.3% bf16 MFU | 123373 tok/s step 2625/19560 | loss 3.700267 (-1.62z)| norm 0.2392 (-1.31z)| lr 7.85e-04 | 4179.76 ms | 32.3% bf16 MFU | 123476 tok/s step 2626/19560 | loss 3.717622 (-1.13z)| norm 0.2375 (-1.37z)| lr 7.85e-04 | 4182.09 ms | 32.3% bf16 MFU | 123570 tok/s step 2627/19560 | loss 3.768030 (+0.32z)| norm 0.2547 (-0.73z)| lr 7.85e-04 | 4183.66 ms | 32.3% bf16 MFU | 123658 tok/s step 2628/19560 | loss 3.807038 (+1.46z)| norm 0.2737 (-0.02z)| lr 7.85e-04 | 4205.23 ms | 32.1% bf16 MFU | 123709 tok/s step 2629/19560 | loss 3.726320 (-0.87z)| norm 0.2379 (-1.35z)| lr 7.85e-04 | 4188.50 ms | 32.2% bf16 MFU | 123782 tok/s step 2630/19560 | loss 3.736974 (-0.56z)| norm 0.2360 (-1.40z)| lr 7.85e-04 | 4175.80 ms | 32.3% bf16 MFU | 123871 tok/s step 2631/19560 | loss 3.780593 (+0.70z)| norm 0.2328 (-1.49z)| lr 7.85e-04 | 4201.89 ms | 32.1% bf16 MFU | 123916 tok/s step 2632/19560 | loss 3.696869 (-1.69z)| norm 0.2392 (-1.24z)| lr 7.85e-04 | 4207.11 ms | 32.1% bf16 MFU | 123951 tok/s step 2633/19560 | loss 3.733177 (-0.64z)| norm 0.2626 (-0.39z)| lr 7.85e-04 | 4199.73 ms | 32.1% bf16 MFU | 123995 tok/s step 2634/19560 | loss 3.761872 (+0.18z)| norm 0.2683 (-0.18z)| lr 7.85e-04 | 4185.36 ms | 32.3% bf16 MFU | 124059 tok/s step 2635/19560 | loss 3.819610 (+1.86z)| norm 0.2800 (+0.26z)| lr 7.85e-04 | 4189.54 ms | 32.2% bf16 MFU | 124113 tok/s step 2636/19560 | loss 3.783736 (+0.85z)| norm 0.2850 (+0.45z)| lr 7.85e-04 | 4177.27 ms | 32.3% bf16 MFU | 124183 tok/s step 2637/19560 | loss 3.777201 (+0.64z)| norm 0.2959 (+0.85z)| lr 7.85e-04 | 4203.70 ms | 32.1% bf16 MFU | 124210 tok/s step 2638/19560 | loss 3.759669 (+0.12z)| norm 0.2773 (+0.17z)| lr 7.85e-04 | 4184.72 ms | 32.3% bf16 MFU | 124264 tok/s step 2639/19560 | loss 3.742877 (-0.38z)| norm 0.2514 (-0.79z)| lr 7.85e-04 | 4188.65 ms | 32.2% bf16 MFU | 124309 tok/s step 2640/19560 | loss 3.711062 (-1.30z)| norm 0.2549 (-0.65z)| lr 7.85e-04 | 4173.27 ms | 32.4% bf16 MFU | 124375 tok/s step 2641/19560 | loss 3.762346 (+0.20z)| norm 0.2265 (-1.70z)| lr 7.85e-04 | 4205.97 ms | 32.1% bf16 MFU | 124389 tok/s step 2642/19560 | loss 3.728662 (-0.80z)| norm 0.2702 (-0.04z)| lr 7.85e-04 | 4179.89 ms | 32.3% bf16 MFU | 124441 tok/s step 2643/19560 | loss 3.708567 (-1.37z)| norm 0.2717 (+0.02z)| lr 7.85e-04 | 4207.52 ms | 32.1% bf16 MFU | 124449 tok/s step 2644/19560 | loss 3.743215 (-0.35z)| norm 0.2626 (-0.32z)| lr 7.85e-04 | 4184.02 ms | 32.3% bf16 MFU | 124492 tok/s step 2645/19560 | loss 3.772206 (+0.51z)| norm 0.2591 (-0.46z)| lr 7.85e-04 | 4200.34 ms | 32.1% bf16 MFU | 124509 tok/s step 2646/19560 | loss 3.869285 (+3.20z)| norm 0.2361 (-1.32z)| lr 7.85e-04 | 4193.91 ms | 32.2% bf16 MFU | 124534 tok/s step 2647/19560 | loss 3.747330 (-0.24z)| norm 0.2498 (-0.79z)| lr 7.85e-04 | 4176.41 ms | 32.3% bf16 MFU | 124584 tok/s step 2648/19560 | loss 3.733189 (-0.64z)| norm 0.2552 (-0.58z)| lr 7.85e-04 | 4173.91 ms | 32.3% bf16 MFU | 124635 tok/s step 2649/19560 | loss 3.733963 (-0.62z)| norm 0.2475 (-0.87z)| lr 7.85e-04 | 9039.05 ms | 14.9% bf16 MFU | 121304 tok/s step 2650/19560 | loss 3.733092 (-0.66z)| norm 0.2888 (+0.68z)| lr 7.85e-04 | 4159.29 ms | 32.5% bf16 MFU | 121541 tok/s step 2651/19560 | loss 3.730859 (-0.72z)| norm 0.2713 (+0.02z)| lr 7.85e-04 | 4160.50 ms | 32.5% bf16 MFU | 121765 tok/s step 2652/19560 | loss 3.804270 (+1.36z)| norm 0.2476 (-0.86z)| lr 7.84e-04 | 4170.36 ms | 32.4% bf16 MFU | 121962 tok/s step 2653/19560 | loss 3.754675 (-0.05z)| norm 0.2478 (-0.85z)| lr 7.84e-04 | 4178.67 ms | 32.3% bf16 MFU | 122138 tok/s step 2654/19560 | loss 3.747675 (-0.27z)| norm 0.2568 (-0.51z)| lr 7.84e-04 | 4342.89 ms | 31.1% bf16 MFU | 122067 tok/s step 2655/19560 | loss 3.750112 (-0.19z)| norm 0.2552 (-0.56z)| lr 7.84e-04 | 4216.88 ms | 32.0% bf16 MFU | 122180 tok/s step 2656/19560 | loss 3.736493 (-0.59z)| norm 0.2463 (-0.88z)| lr 7.84e-04 | 4170.42 ms | 32.4% bf16 MFU | 122357 tok/s step 2657/19560 | loss 3.725517 (-0.92z)| norm 0.2517 (-0.67z)| lr 7.84e-04 | 4182.67 ms | 32.3% bf16 MFU | 122506 tok/s step 2658/19560 | loss 3.786053 (+0.85z)| norm 0.2500 (-0.72z)| lr 7.84e-04 | 4184.97 ms | 32.3% bf16 MFU | 122645 tok/s step 2659/19560 | loss 3.729157 (-0.81z)| norm 0.2643 (-0.16z)| lr 7.84e-04 | 4167.84 ms | 32.4% bf16 MFU | 122802 tok/s step 2660/19560 | loss 3.749034 (-0.22z)| norm 0.2826 (+0.54z)| lr 7.84e-04 | 4163.36 ms | 32.4% bf16 MFU | 122959 tok/s step 2661/19560 | loss 3.749306 (-0.20z)| norm 0.2917 (+0.89z)| lr 7.84e-04 | 4158.73 ms | 32.5% bf16 MFU | 123114 tok/s step 2662/19560 | loss 3.777288 (+0.63z)| norm 0.2702 (+0.05z)| lr 7.84e-04 | 4168.79 ms | 32.4% bf16 MFU | 123247 tok/s step 2663/19560 | loss 3.707856 (-1.41z)| norm 0.2909 (+0.84z)| lr 7.84e-04 | 4163.24 ms | 32.4% bf16 MFU | 123381 tok/s step 2664/19560 | loss 3.760778 (+0.17z)| norm 0.2481 (-0.82z)| lr 7.84e-04 | 4179.10 ms | 32.3% bf16 MFU | 123485 tok/s step 2665/19560 | loss 3.763878 (+0.27z)| norm 0.2651 (-0.16z)| lr 7.84e-04 | 4167.87 ms | 32.4% bf16 MFU | 123600 tok/s step 2666/19560 | loss 3.786926 (+0.95z)| norm 0.2787 (+0.38z)| lr 7.84e-04 | 4167.95 ms | 32.4% bf16 MFU | 123710 tok/s step 2667/19560 | loss 3.754913 (-0.02z)| norm 0.2904 (+0.85z)| lr 7.84e-04 | 4209.90 ms | 32.1% bf16 MFU | 123751 tok/s step 2668/19560 | loss 3.787804 (+0.97z)| norm 0.2877 (+0.74z)| lr 7.84e-04 | 4166.44 ms | 32.4% bf16 MFU | 123855 tok/s step 2669/19560 | loss 3.756031 (+0.01z)| norm 0.2947 (+1.01z)| lr 7.84e-04 | 4170.17 ms | 32.4% bf16 MFU | 123949 tok/s step 2670/19560 | loss 3.726446 (-0.89z)| norm 0.2792 (+0.41z)| lr 7.84e-04 | 4176.90 ms | 32.3% bf16 MFU | 124027 tok/s step 2671/19560 | loss 3.778724 (+0.69z)| norm 0.2746 (+0.24z)| lr 7.84e-04 | 4174.15 ms | 32.3% bf16 MFU | 124106 tok/s step 2672/19560 | loss 3.749620 (-0.19z)| norm 0.3296 (+2.37z)| lr 7.84e-04 | 4166.88 ms | 32.4% bf16 MFU | 124192 tok/s step 2673/19560 | loss 3.721098 (-1.04z)| norm 0.2909 (+0.85z)| lr 7.84e-04 | 4175.71 ms | 32.3% bf16 MFU | 124260 tok/s step 2674/19560 | loss 3.722019 (-1.01z)| norm 0.2544 (-0.57z)| lr 7.84e-04 | 4172.87 ms | 32.4% bf16 MFU | 124329 tok/s step 2675/19560 | loss 3.754141 (-0.03z)| norm 0.2622 (-0.27z)| lr 7.84e-04 | 4175.53 ms | 32.3% bf16 MFU | 124391 tok/s step 2676/19560 | loss 3.677483 (-2.30z)| norm 0.2372 (-1.22z)| lr 7.84e-04 | 4168.99 ms | 32.4% bf16 MFU | 124459 tok/s step 2677/19560 | loss 3.755190 (+0.03z)| norm 0.2413 (-1.05z)| lr 7.84e-04 | 4168.05 ms | 32.4% bf16 MFU | 124526 tok/s step 2678/19560 | loss 3.700579 (-1.59z)| norm 0.2497 (-0.72z)| lr 7.84e-04 | 4164.68 ms | 32.4% bf16 MFU | 124594 tok/s step 2679/19560 | loss 3.782621 (+0.85z)| norm 0.2297 (-1.46z)| lr 7.84e-04 | 4177.20 ms | 32.3% bf16 MFU | 124640 tok/s step 2680/19560 | loss 3.723704 (-0.88z)| norm 0.2399 (-1.06z)| lr 7.84e-04 | 4175.58 ms | 32.3% bf16 MFU | 124686 tok/s step 2681/19560 | loss 3.791818 (+1.15z)| norm 0.2587 (-0.33z)| lr 7.84e-04 | 4184.63 ms | 32.3% bf16 MFU | 124716 tok/s step 2682/19560 | loss 3.724020 (-0.88z)| norm 0.2865 (+0.74z)| lr 7.84e-04 | 4163.05 ms | 32.4% bf16 MFU | 124777 tok/s step 2683/19560 | loss 3.706902 (-1.39z)| norm 0.2807 (+0.51z)| lr 7.84e-04 | 4167.12 ms | 32.4% bf16 MFU | 124829 tok/s step 2684/19560 | loss 3.740911 (-0.34z)| norm 0.2665 (-0.04z)| lr 7.84e-04 | 4179.12 ms | 32.3% bf16 MFU | 124860 tok/s step 2685/19560 | loss 3.794361 (+1.28z)| norm 0.2984 (+1.18z)| lr 7.84e-04 | 4163.86 ms | 32.4% bf16 MFU | 124913 tok/s step 2686/19560 | loss 3.690969 (-1.85z)| norm 0.2817 (+0.52z)| lr 7.84e-04 | 4180.94 ms | 32.3% bf16 MFU | 124937 tok/s step 2687/19560 | loss 3.705187 (-1.40z)| norm 0.2683 (-0.00z)| lr 7.84e-04 | 4172.21 ms | 32.4% bf16 MFU | 124974 tok/s step 2688/19560 | loss 3.735949 (-0.46z)| norm 0.2659 (-0.10z)| lr 7.84e-04 | 4175.81 ms | 32.3% bf16 MFU | 125003 tok/s step 2689/19560 | loss 3.745473 (-0.18z)| norm 0.2922 (+0.91z)| lr 7.84e-04 | 4175.25 ms | 32.3% bf16 MFU | 125031 tok/s step 2690/19560 | loss 3.818738 (+2.04z)| norm 0.2925 (+0.90z)| lr 7.84e-04 | 4171.40 ms | 32.4% bf16 MFU | 125064 tok/s step 2691/19560 | loss 3.768957 (+0.53z)| norm 0.2883 (+0.73z)| lr 7.84e-04 | 4169.70 ms | 32.4% bf16 MFU | 125097 tok/s step 2692/19560 | loss 3.776655 (+0.75z)| norm 0.2840 (+0.55z)| lr 7.84e-04 | 4182.14 ms | 32.3% bf16 MFU | 125111 tok/s step 2693/19560 | loss 3.745888 (-0.20z)| norm 0.2679 (-0.09z)| lr 7.84e-04 | 4176.29 ms | 32.3% bf16 MFU | 125132 tok/s step 2694/19560 | loss 3.759442 (+0.22z)| norm 0.2803 (+0.39z)| lr 7.84e-04 | 4175.42 ms | 32.3% bf16 MFU | 125154 tok/s step 2695/19560 | loss 3.685839 (-2.01z)| norm 0.2378 (-1.28z)| lr 7.84e-04 | 4166.35 ms | 32.4% bf16 MFU | 125188 tok/s step 2696/19560 | loss 3.749160 (-0.09z)| norm 0.2524 (-0.72z)| lr 7.84e-04 | 4172.11 ms | 32.4% bf16 MFU | 125212 tok/s step 2697/19560 | loss 3.701115 (-1.52z)| norm 0.2415 (-1.15z)| lr 7.84e-04 | 4174.27 ms | 32.3% bf16 MFU | 125231 tok/s step 2698/19560 | loss 3.699570 (-1.58z)| norm 0.2288 (-1.63z)| lr 7.84e-04 | 4203.96 ms | 32.1% bf16 MFU | 125205 tok/s step 2699/19560 | loss 3.765505 (+0.41z)| norm 0.2530 (-0.66z)| lr 7.84e-04 | 4175.35 ms | 32.3% bf16 MFU | 125223 tok/s step 2700/19560 | loss 3.755623 (+0.12z)| norm 0.2458 (-0.95z)| lr 7.84e-04 | 4159.55 ms | 32.5% bf16 MFU | 125265 tok/s step 2701/19560 | loss 3.718825 (-0.99z)| norm 0.2563 (-0.52z)| lr 7.84e-04 | 4174.55 ms | 32.3% bf16 MFU | 125281 tok/s step 2702/19560 | loss 3.838459 (+2.57z)| norm 0.2848 (+0.75z)| lr 7.84e-04 | 4174.73 ms | 32.3% bf16 MFU | 125296 tok/s step 2703/19560 | loss 3.747270 (-0.15z)| norm 0.2664 (-0.06z)| lr 7.84e-04 | 4172.10 ms | 32.4% bf16 MFU | 125315 tok/s step 2704/19560 | loss 3.719886 (-0.96z)| norm 0.2715 (+0.18z)| lr 7.83e-04 | 4310.22 ms | 31.3% bf16 MFU | 125131 tok/s step 2705/19560 | loss 3.817400 (+1.90z)| norm 0.2597 (-0.34z)| lr 7.83e-04 | 4249.65 ms | 31.8% bf16 MFU | 125043 tok/s step 2706/19560 | loss 3.711519 (-1.20z)| norm 0.2530 (-0.64z)| lr 7.83e-04 | 4171.06 ms | 32.4% bf16 MFU | 125076 tok/s step 2707/19560 | loss 3.704492 (-1.40z)| norm 0.2434 (-1.05z)| lr 7.83e-04 | 4164.04 ms | 32.4% bf16 MFU | 125117 tok/s step 2708/19560 | loss 3.756956 (+0.12z)| norm 0.2544 (-0.57z)| lr 7.83e-04 | 4189.18 ms | 32.2% bf16 MFU | 125119 tok/s step 2709/19560 | loss 3.754565 (+0.06z)| norm 0.2761 (+0.40z)| lr 7.83e-04 | 4163.52 ms | 32.4% bf16 MFU | 125159 tok/s step 2710/19560 | loss 3.750558 (-0.06z)| norm 0.2810 (+0.61z)| lr 7.83e-04 | 4176.97 ms | 32.3% bf16 MFU | 125177 tok/s step 2711/19560 | loss 3.720708 (-0.93z)| norm 0.3367 (+3.01z)| lr 7.83e-04 | 4159.71 ms | 32.5% bf16 MFU | 125220 tok/s step 2712/19560 | loss 3.783140 (+0.91z)| norm 0.3547 (+3.60z)| lr 7.83e-04 | 4177.04 ms | 32.3% bf16 MFU | 125235 tok/s step 2713/19560 | loss 3.812399 (+1.75z)| norm 0.3204 (+2.10z)| lr 7.83e-04 | 4212.44 ms | 32.1% bf16 MFU | 125196 tok/s step 2714/19560 | loss 3.766476 (+0.41z)| norm 0.3001 (+1.25z)| lr 7.83e-04 | 4168.46 ms | 32.4% bf16 MFU | 125225 tok/s step 2715/19560 | loss 3.751909 (-0.01z)| norm 0.2956 (+1.06z)| lr 7.83e-04 | 4175.32 ms | 32.3% bf16 MFU | 125243 tok/s step 2716/19560 | loss 3.737256 (-0.44z)| norm 0.2909 (+0.86z)| lr 7.83e-04 | 4162.09 ms | 32.4% bf16 MFU | 125279 tok/s step 2717/19560 | loss 3.707412 (-1.30z)| norm 0.2441 (-1.05z)| lr 7.83e-04 | 4173.05 ms | 32.4% bf16 MFU | 125297 tok/s step 2718/19560 | loss 3.687081 (-1.91z)| norm 0.2416 (-1.15z)| lr 7.83e-04 | 4166.59 ms | 32.4% bf16 MFU | 125323 tok/s step 2719/19560 | loss 3.712600 (-1.14z)| norm 0.2419 (-1.12z)| lr 7.83e-04 | 4175.96 ms | 32.3% bf16 MFU | 125335 tok/s step 2720/19560 | loss 3.709985 (-1.21z)| norm 0.2301 (-1.58z)| lr 7.83e-04 | 4187.54 ms | 32.2% bf16 MFU | 125328 tok/s step 2721/19560 | loss 3.765818 (+0.42z)| norm 0.2277 (-1.65z)| lr 7.83e-04 | 4174.87 ms | 32.3% bf16 MFU | 125341 tok/s step 2722/19560 | loss 3.736595 (-0.42z)| norm 0.2282 (-1.60z)| lr 7.83e-04 | 4177.82 ms | 32.3% bf16 MFU | 125348 tok/s step 2723/19560 | loss 3.716788 (-1.00z)| norm 0.2275 (-1.60z)| lr 7.83e-04 | 4167.91 ms | 32.4% bf16 MFU | 125370 tok/s step 2724/19560 | loss 3.730114 (-0.60z)| norm 0.2351 (-1.29z)| lr 7.83e-04 | 4168.38 ms | 32.4% bf16 MFU | 125391 tok/s step 2725/19560 | loss 3.771416 (+0.62z)| norm 0.2502 (-0.68z)| lr 7.83e-04 | 4170.90 ms | 32.4% bf16 MFU | 125406 tok/s step 2726/19560 | loss 3.776344 (+0.77z)| norm 0.2518 (-0.62z)| lr 7.83e-04 | 4177.29 ms | 32.3% bf16 MFU | 125411 tok/s step 2727/19560 | loss 3.753056 (+0.07z)| norm 0.2516 (-0.62z)| lr 7.83e-04 | 4157.41 ms | 32.5% bf16 MFU | 125446 tok/s step 2728/19560 | loss 3.704564 (-1.34z)| norm 0.2454 (-0.86z)| lr 7.83e-04 | 4160.21 ms | 32.5% bf16 MFU | 125475 tok/s step 2729/19560 | loss 3.717606 (-0.95z)| norm 0.2715 (+0.16z)| lr 7.83e-04 | 4162.49 ms | 32.4% bf16 MFU | 125499 tok/s step 2730/19560 | loss 3.762468 (+0.40z)| norm 0.3043 (+1.42z)| lr 7.83e-04 | 4168.99 ms | 32.4% bf16 MFU | 125512 tok/s step 2731/19560 | loss 3.711941 (-1.10z)| norm 0.3429 (+2.81z)| lr 7.83e-04 | 4162.17 ms | 32.4% bf16 MFU | 125535 tok/s step 2732/19560 | loss 3.726801 (-0.65z)| norm 0.2981 (+1.11z)| lr 7.83e-04 | 4175.06 ms | 32.3% bf16 MFU | 125537 tok/s step 2733/19560 | loss 3.755386 (+0.20z)| norm 0.2948 (+0.97z)| lr 7.83e-04 | 4193.38 ms | 32.2% bf16 MFU | 125511 tok/s step 2734/19560 | loss 3.840522 (+2.64z)| norm 0.2792 (+0.38z)| lr 7.83e-04 | 4156.04 ms | 32.5% bf16 MFU | 125543 tok/s step 2735/19560 | loss 3.758718 (+0.28z)| norm 0.2632 (-0.22z)| lr 7.83e-04 | 4176.02 ms | 32.3% bf16 MFU | 125544 tok/s step 2736/19560 | loss 3.727347 (-0.63z)| norm 0.2780 (+0.35z)| lr 7.83e-04 | 4169.32 ms | 32.4% bf16 MFU | 125554 tok/s step 2737/19560 | loss 3.731401 (-0.50z)| norm 0.2913 (+0.86z)| lr 7.83e-04 | 4164.73 ms | 32.4% bf16 MFU | 125571 tok/s step 2738/19560 | loss 3.785993 (+1.08z)| norm 0.2651 (-0.12z)| lr 7.83e-04 | 4166.68 ms | 32.4% bf16 MFU | 125584 tok/s step 2739/19560 | loss 3.690939 (-1.67z)| norm 0.2402 (-1.06z)| lr 7.83e-04 | 4164.96 ms | 32.4% bf16 MFU | 125598 tok/s step 2740/19560 | loss 3.714901 (-0.96z)| norm 0.2283 (-1.49z)| lr 7.83e-04 | 4179.26 ms | 32.3% bf16 MFU | 125591 tok/s step 2741/19560 | loss 3.722061 (-0.73z)| norm 0.2417 (-0.96z)| lr 7.83e-04 | 4177.47 ms | 32.3% bf16 MFU | 125587 tok/s step 2742/19560 | loss 3.697704 (-1.43z)| norm 0.2609 (-0.23z)| lr 7.83e-04 | 4180.23 ms | 32.3% bf16 MFU | 125578 tok/s step 2743/19560 | loss 3.759592 (+0.36z)| norm 0.2401 (-1.02z)| lr 7.83e-04 | 4166.72 ms | 32.4% bf16 MFU | 125591 tok/s step 2744/19560 | loss 3.700543 (-1.34z)| norm 0.2534 (-0.51z)| lr 7.83e-04 | 4171.75 ms | 32.4% bf16 MFU | 125595 tok/s step 2745/19560 | loss 3.761368 (+0.41z)| norm 0.2721 (+0.20z)| lr 7.83e-04 | 4174.95 ms | 32.3% bf16 MFU | 125594 tok/s step 2746/19560 | loss 3.726240 (-0.59z)| norm 0.2687 (+0.07z)| lr 7.83e-04 | 4181.33 ms | 32.3% bf16 MFU | 125584 tok/s step 2747/19560 | loss 3.753637 (+0.19z)| norm 0.2387 (-1.06z)| lr 7.83e-04 | 4175.20 ms | 32.3% bf16 MFU | 125583 tok/s step 2748/19560 | loss 3.704579 (-1.21z)| norm 0.2371 (-1.11z)| lr 7.83e-04 | 4188.37 ms | 32.2% bf16 MFU | 125563 tok/s step 2749/19560 | loss 3.684895 (-1.73z)| norm 0.2487 (-0.66z)| lr 7.83e-04 | 4165.03 ms | 32.4% bf16 MFU | 125579 tok/s step 2750/19560 | loss 3.733437 (-0.35z)| norm 0.2669 (+0.07z)| lr 7.83e-04 | 4162.14 ms | 32.4% bf16 MFU | 125598 tok/s val loss 3.716935 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2668/10042 = 0.265684 step 2751/19560 | loss 3.743752 (-0.05z)| norm 0.2893 (+0.98z)| lr 7.83e-04 | 4271.55 ms | 31.6% bf16 MFU | 125455 tok/s step 2752/19560 | loss 3.738723 (-0.18z)| norm 0.2860 (+0.84z)| lr 7.83e-04 | 4360.71 ms | 31.0% bf16 MFU | 125194 tok/s step 2753/19560 | loss 3.723744 (-0.62z)| norm 0.2798 (+0.58z)| lr 7.83e-04 | 4171.28 ms | 32.4% bf16 MFU | 125219 tok/s step 2754/19560 | loss 3.734051 (-0.33z)| norm 0.2811 (+0.62z)| lr 7.83e-04 | 4217.84 ms | 32.0% bf16 MFU | 125173 tok/s step 2755/19560 | loss 3.722792 (-0.64z)| norm 0.2675 (+0.07z)| lr 7.82e-04 | 4363.40 ms | 30.9% bf16 MFU | 124922 tok/s step 2756/19560 | loss 3.729249 (-0.44z)| norm 0.2775 (+0.47z)| lr 7.82e-04 | 4313.78 ms | 31.3% bf16 MFU | 124753 tok/s step 2757/19560 | loss 3.755491 (+0.31z)| norm 0.2814 (+0.62z)| lr 7.82e-04 | 4267.75 ms | 31.6% bf16 MFU | 124658 tok/s step 2758/19560 | loss 3.725032 (-0.57z)| norm 0.2898 (+0.94z)| lr 7.82e-04 | 4196.29 ms | 32.2% bf16 MFU | 124672 tok/s step 2759/19560 | loss 3.650544 (-2.65z)| norm 0.2928 (+1.05z)| lr 7.82e-04 | 4315.92 ms | 31.3% bf16 MFU | 124512 tok/s step 2760/19560 | loss 3.710246 (-0.96z)| norm 0.2681 (+0.03z)| lr 7.82e-04 | 4277.77 ms | 31.6% bf16 MFU | 124415 tok/s step 2761/19560 | loss 3.757684 (+0.39z)| norm 0.2726 (+0.22z)| lr 7.82e-04 | 4251.05 ms | 31.8% bf16 MFU | 124360 tok/s step 2762/19560 | loss 3.713741 (-0.85z)| norm 0.2499 (-0.71z)| lr 7.82e-04 | 4160.91 ms | 32.4% bf16 MFU | 124443 tok/s step 2763/19560 | loss 3.714542 (-0.82z)| norm 0.2830 (+0.64z)| lr 7.82e-04 | 4242.38 ms | 31.8% bf16 MFU | 124400 tok/s step 2764/19560 | loss 3.755145 (+0.37z)| norm 0.2392 (-1.13z)| lr 7.82e-04 | 4159.05 ms | 32.5% bf16 MFU | 124483 tok/s step 2765/19560 | loss 3.763625 (+0.62z)| norm 0.2287 (-1.53z)| lr 7.82e-04 | 4160.35 ms | 32.5% bf16 MFU | 124559 tok/s step 2766/19560 | loss 3.708175 (-0.98z)| norm 0.2542 (-0.49z)| lr 7.82e-04 | 4276.93 ms | 31.6% bf16 MFU | 124461 tok/s step 2767/19560 | loss 3.703180 (-1.11z)| norm 0.2793 (+0.52z)| lr 7.82e-04 | 4165.78 ms | 32.4% bf16 MFU | 124530 tok/s step 2768/19560 | loss 3.745903 (+0.11z)| norm 0.2889 (+0.90z)| lr 7.82e-04 | 4153.46 ms | 32.5% bf16 MFU | 124615 tok/s step 2769/19560 | loss 3.769408 (+0.79z)| norm 0.2693 (+0.09z)| lr 7.82e-04 | 4168.00 ms | 32.4% bf16 MFU | 124674 tok/s step 2770/19560 | loss 3.676403 (-1.87z)| norm 0.2492 (-0.72z)| lr 7.82e-04 | 4468.20 ms | 30.2% bf16 MFU | 124307 tok/s step 2771/19560 | loss 3.848752 (+2.94z)| norm 0.2695 (+0.11z)| lr 7.82e-04 | 4245.01 ms | 31.8% bf16 MFU | 124267 tok/s step 2772/19560 | loss 3.752078 (+0.26z)| norm 0.2825 (+0.63z)| lr 7.82e-04 | 4168.75 ms | 32.4% bf16 MFU | 124342 tok/s step 2773/19560 | loss 3.727436 (-0.42z)| norm 0.3265 (+2.36z)| lr 7.82e-04 | 4187.50 ms | 32.2% bf16 MFU | 124385 tok/s step 2774/19560 | loss 3.722585 (-0.55z)| norm 0.3643 (+3.64z)| lr 7.82e-04 | 4174.89 ms | 32.3% bf16 MFU | 124445 tok/s step 2775/19560 | loss 3.704100 (-1.07z)| norm 0.3366 (+2.51z)| lr 7.82e-04 | 4168.57 ms | 32.4% bf16 MFU | 124511 tok/s step 2776/19560 | loss 3.679663 (-1.75z)| norm 0.2789 (+0.35z)| lr 7.82e-04 | 4867.28 ms | 27.7% bf16 MFU | 123672 tok/s step 2777/19560 | loss 3.743084 (+0.07z)| norm 0.3227 (+1.94z)| lr 7.82e-04 | 4667.48 ms | 28.9% bf16 MFU | 123104 tok/s step 2778/19560 | loss 3.693739 (-1.33z)| norm 0.3412 (+2.54z)| lr 7.82e-04 | 4292.49 ms | 31.5% bf16 MFU | 123056 tok/s step 2779/19560 | loss 3.684915 (-1.55z)| norm 0.2703 (-0.00z)| lr 7.82e-04 | 4200.18 ms | 32.1% bf16 MFU | 123145 tok/s step 2780/19560 | loss 3.669566 (-1.95z)| norm 0.2773 (+0.24z)| lr 7.82e-04 | 4390.65 ms | 30.8% bf16 MFU | 122958 tok/s step 2781/19560 | loss 3.712965 (-0.72z)| norm 0.2742 (+0.12z)| lr 7.82e-04 | 4302.06 ms | 31.4% bf16 MFU | 122904 tok/s step 2782/19560 | loss 3.636378 (-2.77z)| norm 0.2539 (-0.61z)| lr 7.82e-04 | 4280.81 ms | 31.5% bf16 MFU | 122882 tok/s step 2783/19560 | loss 3.700212 (-1.01z)| norm 0.2574 (-0.48z)| lr 7.82e-04 | 4421.89 ms | 30.5% bf16 MFU | 122666 tok/s step 2784/19560 | loss 3.678737 (-1.57z)| norm 0.2523 (-0.67z)| lr 7.82e-04 | 4196.50 ms | 32.2% bf16 MFU | 122780 tok/s step 2785/19560 | loss 3.706075 (-0.82z)| norm 0.2416 (-1.05z)| lr 7.82e-04 | 4296.67 ms | 31.4% bf16 MFU | 122742 tok/s step 2786/19560 | loss 3.682962 (-1.42z)| norm 0.2553 (-0.56z)| lr 7.82e-04 | 4150.41 ms | 32.5% bf16 MFU | 122921 tok/s step 2787/19560 | loss 3.677215 (-1.55z)| norm 0.2495 (-0.76z)| lr 7.82e-04 | 4212.92 ms | 32.0% bf16 MFU | 122997 tok/s step 2788/19560 | loss 3.649606 (-2.22z)| norm 0.2690 (-0.06z)| lr 7.82e-04 | 4203.12 ms | 32.1% bf16 MFU | 123084 tok/s step 2789/19560 | loss 3.673220 (-1.57z)| norm 0.2825 (+0.43z)| lr 7.82e-04 | 4331.06 ms | 31.2% bf16 MFU | 122983 tok/s step 2790/19560 | loss 3.674115 (-1.52z)| norm 0.2357 (-1.24z)| lr 7.82e-04 | 4166.07 ms | 32.4% bf16 MFU | 123126 tok/s step 2791/19560 | loss 3.677914 (-1.41z)| norm 0.2353 (-1.23z)| lr 7.82e-04 | 4221.28 ms | 32.0% bf16 MFU | 123180 tok/s step 2792/19560 | loss 3.701866 (-0.79z)| norm 0.2405 (-1.04z)| lr 7.82e-04 | 4275.57 ms | 31.6% bf16 MFU | 123152 tok/s step 2793/19560 | loss 3.713717 (-0.47z)| norm 0.2436 (-0.92z)| lr 7.82e-04 | 4156.02 ms | 32.5% bf16 MFU | 123302 tok/s step 2794/19560 | loss 3.709633 (-0.57z)| norm 0.2704 (+0.03z)| lr 7.82e-04 | 4156.70 ms | 32.5% bf16 MFU | 123443 tok/s step 2795/19560 | loss 3.721540 (-0.25z)| norm 0.2484 (-0.74z)| lr 7.82e-04 | 4281.26 ms | 31.5% bf16 MFU | 123394 tok/s step 2796/19560 | loss 3.742241 (+0.29z)| norm 0.2522 (-0.59z)| lr 7.82e-04 | 4196.83 ms | 32.2% bf16 MFU | 123471 tok/s step 2797/19560 | loss 3.614380 (-2.91z)| norm 0.2520 (-0.59z)| lr 7.82e-04 | 4163.49 ms | 32.4% bf16 MFU | 123593 tok/s step 2798/19560 | loss 3.658006 (-1.77z)| norm 0.2641 (-0.15z)| lr 7.82e-04 | 4165.12 ms | 32.4% bf16 MFU | 123708 tok/s step 2799/19560 | loss 3.672664 (-1.39z)| norm 0.2970 (+1.01z)| lr 7.82e-04 | 4198.12 ms | 32.2% bf16 MFU | 123766 tok/s step 2800/19560 | loss 3.707916 (-0.50z)| norm 0.3696 (+3.46z)| lr 7.82e-04 | 4171.99 ms | 32.4% bf16 MFU | 123862 tok/s step 2801/19560 | loss 3.784925 (+1.39z)| norm 0.3730 (+3.40z)| lr 7.82e-04 | 4166.30 ms | 32.4% bf16 MFU | 123960 tok/s step 2802/19560 | loss 3.625506 (-2.46z)| norm 0.3004 (+1.00z)| lr 7.82e-04 | 4163.71 ms | 32.4% bf16 MFU | 124058 tok/s step 2803/19560 | loss 3.681478 (-1.10z)| norm 0.2975 (+0.89z)| lr 7.82e-04 | 4165.83 ms | 32.4% bf16 MFU | 124148 tok/s step 2804/19560 | loss 3.759352 (+0.75z)| norm 0.2724 (+0.06z)| lr 7.82e-04 | 4175.60 ms | 32.3% bf16 MFU | 124219 tok/s step 2805/19560 | loss 3.708458 (-0.46z)| norm 0.2477 (-0.75z)| lr 7.81e-04 | 4180.31 ms | 32.3% bf16 MFU | 124279 tok/s step 2806/19560 | loss 3.787478 (+1.42z)| norm 0.2793 (+0.28z)| lr 7.81e-04 | 4165.40 ms | 32.4% bf16 MFU | 124358 tok/s step 2807/19560 | loss 3.687812 (-0.95z)| norm 0.2968 (+0.84z)| lr 7.81e-04 | 4220.83 ms | 32.0% bf16 MFU | 124351 tok/s step 2808/19560 | loss 3.720801 (-0.16z)| norm 0.3009 (+0.96z)| lr 7.81e-04 | 4282.27 ms | 31.5% bf16 MFU | 124255 tok/s step 2809/19560 | loss 3.647907 (-1.88z)| norm 0.2668 (-0.17z)| lr 7.81e-04 | 4208.85 ms | 32.1% bf16 MFU | 124271 tok/s step 2810/19560 | loss 3.692505 (-0.80z)| norm 0.2483 (-0.77z)| lr 7.81e-04 | 4174.82 ms | 32.3% bf16 MFU | 124336 tok/s step 2811/19560 | loss 3.677878 (-1.14z)| norm 0.2404 (-1.02z)| lr 7.81e-04 | 4186.13 ms | 32.3% bf16 MFU | 124382 tok/s step 2812/19560 | loss 3.647065 (-1.83z)| norm 0.2522 (-0.62z)| lr 7.81e-04 | 4218.31 ms | 32.0% bf16 MFU | 124377 tok/s step 2813/19560 | loss 3.665003 (-1.39z)| norm 0.2402 (-1.00z)| lr 7.81e-04 | 4164.54 ms | 32.4% bf16 MFU | 124453 tok/s step 2814/19560 | loss 3.666305 (-1.35z)| norm 0.2598 (-0.35z)| lr 7.81e-04 | 4167.19 ms | 32.4% bf16 MFU | 124521 tok/s step 2815/19560 | loss 3.656651 (-1.55z)| norm 0.2609 (-0.31z)| lr 7.81e-04 | 4171.63 ms | 32.4% bf16 MFU | 124579 tok/s step 2816/19560 | loss 3.665894 (-1.32z)| norm 0.2515 (-0.62z)| lr 7.81e-04 | 4240.90 ms | 31.8% bf16 MFU | 124531 tok/s step 2817/19560 | loss 3.693832 (-0.66z)| norm 0.2599 (-0.33z)| lr 7.81e-04 | 4181.65 ms | 32.3% bf16 MFU | 124574 tok/s step 2818/19560 | loss 3.664664 (-1.32z)| norm 0.3072 (+1.21z)| lr 7.81e-04 | 4169.81 ms | 32.4% bf16 MFU | 124632 tok/s step 2819/19560 | loss 3.632675 (-2.02z)| norm 0.2529 (-0.56z)| lr 7.81e-04 | 4162.29 ms | 32.4% bf16 MFU | 124698 tok/s step 2820/19560 | loss 3.699100 (-0.48z)| norm 0.2629 (-0.22z)| lr 7.81e-04 | 4162.74 ms | 32.4% bf16 MFU | 124761 tok/s step 2821/19560 | loss 3.728116 (+0.20z)| norm 0.2692 (-0.02z)| lr 7.81e-04 | 4163.76 ms | 32.4% bf16 MFU | 124818 tok/s step 2822/19560 | loss 3.670158 (-1.13z)| norm 0.2434 (-0.85z)| lr 7.81e-04 | 4167.28 ms | 32.4% bf16 MFU | 124868 tok/s step 2823/19560 | loss 3.713649 (-0.13z)| norm 0.2523 (-0.57z)| lr 7.81e-04 | 4163.65 ms | 32.4% bf16 MFU | 124921 tok/s step 2824/19560 | loss 3.671737 (-1.08z)| norm 0.2447 (-0.81z)| lr 7.81e-04 | 4191.11 ms | 32.2% bf16 MFU | 124929 tok/s step 2825/19560 | loss 3.665907 (-1.21z)| norm 0.2750 (+0.17z)| lr 7.81e-04 | 4165.78 ms | 32.4% bf16 MFU | 124976 tok/s step 2826/19560 | loss 3.698820 (-0.45z)| norm 0.3063 (+1.18z)| lr 7.81e-04 | 4161.33 ms | 32.4% bf16 MFU | 125026 tok/s step 2827/19560 | loss 3.775963 (+1.32z)| norm 0.3133 (+1.39z)| lr 7.81e-04 | 4163.77 ms | 32.4% bf16 MFU | 125071 tok/s step 2828/19560 | loss 3.657213 (-1.38z)| norm 0.2620 (-0.29z)| lr 7.81e-04 | 4168.10 ms | 32.4% bf16 MFU | 125107 tok/s step 2829/19560 | loss 3.657890 (-1.35z)| norm 0.2400 (-1.01z)| lr 7.81e-04 | 4191.86 ms | 32.2% bf16 MFU | 125105 tok/s step 2830/19560 | loss 3.680269 (-0.83z)| norm 0.2312 (-1.27z)| lr 7.81e-04 | 4175.59 ms | 32.3% bf16 MFU | 125128 tok/s step 2831/19560 | loss 3.659625 (-1.29z)| norm 0.2365 (-1.09z)| lr 7.81e-04 | 4167.55 ms | 32.4% bf16 MFU | 125161 tok/s step 2832/19560 | loss 3.750435 (+0.81z)| norm 0.2410 (-0.93z)| lr 7.81e-04 | 4173.36 ms | 32.4% bf16 MFU | 125185 tok/s step 2833/19560 | loss 3.690529 (-0.57z)| norm 0.2664 (-0.12z)| lr 7.81e-04 | 4170.65 ms | 32.4% bf16 MFU | 125211 tok/s step 2834/19560 | loss 3.746999 (+0.76z)| norm 0.2433 (-0.86z)| lr 7.81e-04 | 4169.24 ms | 32.4% bf16 MFU | 125238 tok/s step 2835/19560 | loss 3.735435 (+0.48z)| norm 0.2582 (-0.38z)| lr 7.81e-04 | 4165.09 ms | 32.4% bf16 MFU | 125270 tok/s step 2836/19560 | loss 3.687571 (-0.64z)| norm 0.2445 (-0.82z)| lr 7.81e-04 | 4170.95 ms | 32.4% bf16 MFU | 125291 tok/s step 2837/19560 | loss 3.724175 (+0.24z)| norm 0.2925 (+0.72z)| lr 7.81e-04 | 4172.30 ms | 32.4% bf16 MFU | 125310 tok/s step 2838/19560 | loss 3.764967 (+1.20z)| norm 0.3044 (+1.09z)| lr 7.81e-04 | 4165.58 ms | 32.4% bf16 MFU | 125337 tok/s step 2839/19560 | loss 3.799806 (+1.98z)| norm 0.3113 (+1.33z)| lr 7.81e-04 | 4174.57 ms | 32.3% bf16 MFU | 125350 tok/s step 2840/19560 | loss 3.700341 (-0.33z)| norm 0.2811 (+0.39z)| lr 7.81e-04 | 4173.70 ms | 32.3% bf16 MFU | 125363 tok/s step 2841/19560 | loss 3.659268 (-1.29z)| norm 0.2606 (-0.28z)| lr 7.81e-04 | 4164.67 ms | 32.4% bf16 MFU | 125390 tok/s step 2842/19560 | loss 3.701380 (-0.27z)| norm 0.2751 (+0.21z)| lr 7.81e-04 | 4169.32 ms | 32.4% bf16 MFU | 125408 tok/s step 2843/19560 | loss 3.655848 (-1.34z)| norm 0.2649 (-0.12z)| lr 7.81e-04 | 4155.48 ms | 32.5% bf16 MFU | 125446 tok/s step 2844/19560 | loss 3.712538 (+0.02z)| norm 0.2576 (-0.37z)| lr 7.81e-04 | 4179.96 ms | 32.3% bf16 MFU | 125445 tok/s step 2845/19560 | loss 3.692630 (-0.45z)| norm 0.2648 (-0.13z)| lr 7.81e-04 | 4207.70 ms | 32.1% bf16 MFU | 125403 tok/s step 2846/19560 | loss 3.712873 (+0.03z)| norm 0.2573 (-0.39z)| lr 7.81e-04 | 4160.00 ms | 32.5% bf16 MFU | 125434 tok/s step 2847/19560 | loss 3.630480 (-1.91z)| norm 0.2353 (-1.13z)| lr 7.81e-04 | 4164.28 ms | 32.4% bf16 MFU | 125457 tok/s step 2848/19560 | loss 3.629136 (-1.90z)| norm 0.2543 (-0.50z)| lr 7.81e-04 | 4164.85 ms | 32.4% bf16 MFU | 125479 tok/s step 2849/19560 | loss 3.693347 (-0.39z)| norm 0.2850 (+0.55z)| lr 7.81e-04 | 4168.17 ms | 32.4% bf16 MFU | 125494 tok/s step 2850/19560 | loss 3.737987 (+0.66z)| norm 0.2720 (+0.08z)| lr 7.81e-04 | 4167.22 ms | 32.4% bf16 MFU | 125510 tok/s step 2851/19560 | loss 3.709302 (-0.01z)| norm 0.2859 (+0.56z)| lr 7.81e-04 | 4162.79 ms | 32.4% bf16 MFU | 125532 tok/s step 2852/19560 | loss 3.663773 (-1.06z)| norm 0.2520 (-0.64z)| lr 7.81e-04 | 4165.08 ms | 32.4% bf16 MFU | 125549 tok/s step 2853/19560 | loss 3.658123 (-1.18z)| norm 0.2701 (-0.01z)| lr 7.81e-04 | 4163.01 ms | 32.4% bf16 MFU | 125569 tok/s step 2854/19560 | loss 3.776172 (+1.59z)| norm 0.2925 (+0.77z)| lr 7.80e-04 | 4173.68 ms | 32.3% bf16 MFU | 125571 tok/s step 2855/19560 | loss 3.730036 (+0.51z)| norm 0.3286 (+2.00z)| lr 7.80e-04 | 4163.89 ms | 32.4% bf16 MFU | 125588 tok/s step 2856/19560 | loss 3.704711 (-0.08z)| norm 0.3297 (+1.99z)| lr 7.80e-04 | 4160.35 ms | 32.5% bf16 MFU | 125610 tok/s step 2857/19560 | loss 3.676948 (-0.73z)| norm 0.2820 (+0.35z)| lr 7.80e-04 | 4166.02 ms | 32.4% bf16 MFU | 125622 tok/s step 2858/19560 | loss 3.732437 (+0.58z)| norm 0.2789 (+0.24z)| lr 7.80e-04 | 4165.17 ms | 32.4% bf16 MFU | 125634 tok/s step 2859/19560 | loss 3.719269 (+0.27z)| norm 0.2885 (+0.61z)| lr 7.80e-04 | 4173.44 ms | 32.4% bf16 MFU | 125634 tok/s step 2860/19560 | loss 3.696473 (-0.26z)| norm 0.2653 (-0.21z)| lr 7.80e-04 | 4161.16 ms | 32.4% bf16 MFU | 125652 tok/s step 2861/19560 | loss 3.651822 (-1.29z)| norm 0.2357 (-1.23z)| lr 7.80e-04 | 4167.57 ms | 32.4% bf16 MFU | 125659 tok/s step 2862/19560 | loss 3.670963 (-0.84z)| norm 0.2592 (-0.40z)| lr 7.80e-04 | 4166.33 ms | 32.4% bf16 MFU | 125668 tok/s step 2863/19560 | loss 3.732920 (+0.68z)| norm 0.2719 (+0.05z)| lr 7.80e-04 | 4161.35 ms | 32.4% bf16 MFU | 125684 tok/s step 2864/19560 | loss 3.735220 (+0.74z)| norm 0.2813 (+0.38z)| lr 7.80e-04 | 4166.22 ms | 32.4% bf16 MFU | 125692 tok/s step 2865/19560 | loss 3.715087 (+0.24z)| norm 0.2537 (-0.59z)| lr 7.80e-04 | 4257.39 ms | 31.7% bf16 MFU | 125565 tok/s step 2866/19560 | loss 3.630169 (-1.82z)| norm 0.2795 (+0.32z)| lr 7.80e-04 | 4161.43 ms | 32.4% bf16 MFU | 125586 tok/s step 2867/19560 | loss 3.721983 (+0.44z)| norm 0.2784 (+0.27z)| lr 7.80e-04 | 4171.72 ms | 32.4% bf16 MFU | 125591 tok/s step 2868/19560 | loss 3.705781 (+0.04z)| norm 0.2810 (+0.36z)| lr 7.80e-04 | 4231.35 ms | 31.9% bf16 MFU | 125507 tok/s step 2869/19560 | loss 3.699503 (-0.11z)| norm 0.2509 (-0.72z)| lr 7.80e-04 | 4165.21 ms | 32.4% bf16 MFU | 125525 tok/s step 2870/19560 | loss 3.698727 (-0.13z)| norm 0.2642 (-0.25z)| lr 7.80e-04 | 4214.17 ms | 32.0% bf16 MFU | 125469 tok/s step 2871/19560 | loss 3.683706 (-0.49z)| norm 0.2444 (-0.96z)| lr 7.80e-04 | 4169.01 ms | 32.4% bf16 MFU | 125484 tok/s step 2872/19560 | loss 3.691384 (-0.30z)| norm 0.2585 (-0.46z)| lr 7.80e-04 | 4164.42 ms | 32.4% bf16 MFU | 125504 tok/s step 2873/19560 | loss 3.660687 (-1.04z)| norm 0.2416 (-1.05z)| lr 7.80e-04 | 4165.53 ms | 32.4% bf16 MFU | 125522 tok/s step 2874/19560 | loss 3.673018 (-0.72z)| norm 0.2533 (-0.63z)| lr 7.80e-04 | 4176.42 ms | 32.3% bf16 MFU | 125523 tok/s step 2875/19560 | loss 3.658663 (-1.07z)| norm 0.2575 (-0.49z)| lr 7.80e-04 | 4170.80 ms | 32.4% bf16 MFU | 125532 tok/s step 2876/19560 | loss 3.639397 (-1.52z)| norm 0.2400 (-1.11z)| lr 7.80e-04 | 4168.54 ms | 32.4% bf16 MFU | 125544 tok/s step 2877/19560 | loss 3.656711 (-1.08z)| norm 0.2585 (-0.45z)| lr 7.80e-04 | 4168.12 ms | 32.4% bf16 MFU | 125556 tok/s step 2878/19560 | loss 3.731476 (+0.76z)| norm 0.2657 (-0.20z)| lr 7.80e-04 | 4162.79 ms | 32.4% bf16 MFU | 125576 tok/s step 2879/19560 | loss 3.618384 (-1.98z)| norm 0.2952 (+0.86z)| lr 7.80e-04 | 4170.40 ms | 32.4% bf16 MFU | 125583 tok/s step 2880/19560 | loss 3.669376 (-0.72z)| norm 0.3156 (+1.57z)| lr 7.80e-04 | 4162.90 ms | 32.4% bf16 MFU | 125601 tok/s step 2881/19560 | loss 3.671130 (-0.67z)| norm 0.3151 (+1.53z)| lr 7.80e-04 | 4178.36 ms | 32.3% bf16 MFU | 125594 tok/s step 2882/19560 | loss 3.618491 (-1.91z)| norm 0.2719 (+0.01z)| lr 7.80e-04 | 4170.10 ms | 32.4% bf16 MFU | 125601 tok/s step 2883/19560 | loss 3.751703 (+1.29z)| norm 0.2538 (-0.62z)| lr 7.80e-04 | 4657.30 ms | 29.0% bf16 MFU | 124950 tok/s step 2884/19560 | loss 3.666431 (-0.74z)| norm 0.2509 (-0.72z)| lr 7.80e-04 | 4188.34 ms | 32.2% bf16 MFU | 124961 tok/s step 2885/19560 | loss 3.701127 (+0.10z)| norm 0.2525 (-0.65z)| lr 7.80e-04 | 4163.47 ms | 32.4% bf16 MFU | 125009 tok/s step 2886/19560 | loss 3.718412 (+0.52z)| norm 0.2744 (+0.12z)| lr 7.80e-04 | 4166.58 ms | 32.4% bf16 MFU | 125050 tok/s step 2887/19560 | loss 3.704986 (+0.18z)| norm 0.2371 (-1.17z)| lr 7.80e-04 | 4160.82 ms | 32.4% bf16 MFU | 125098 tok/s step 2888/19560 | loss 3.637315 (-1.44z)| norm 0.2477 (-0.79z)| lr 7.80e-04 | 4166.66 ms | 32.4% bf16 MFU | 125135 tok/s step 2889/19560 | loss 3.687880 (-0.21z)| norm 0.2593 (-0.38z)| lr 7.80e-04 | 4160.61 ms | 32.5% bf16 MFU | 125179 tok/s step 2890/19560 | loss 3.687972 (-0.20z)| norm 0.2695 (-0.03z)| lr 7.80e-04 | 4158.64 ms | 32.5% bf16 MFU | 125223 tok/s step 2891/19560 | loss 3.654387 (-1.00z)| norm 0.2794 (+0.32z)| lr 7.80e-04 | 4170.61 ms | 32.4% bf16 MFU | 125248 tok/s step 2892/19560 | loss 3.665320 (-0.72z)| norm 0.2606 (-0.35z)| lr 7.80e-04 | 4162.94 ms | 32.4% bf16 MFU | 125282 tok/s step 2893/19560 | loss 3.633109 (-1.49z)| norm 0.2382 (-1.15z)| lr 7.80e-04 | 4155.19 ms | 32.5% bf16 MFU | 125327 tok/s step 2894/19560 | loss 3.631747 (-1.49z)| norm 0.2469 (-0.84z)| lr 7.80e-04 | 4170.01 ms | 32.4% bf16 MFU | 125347 tok/s step 2895/19560 | loss 3.658257 (-0.84z)| norm 0.2671 (-0.12z)| lr 7.80e-04 | 4162.41 ms | 32.4% bf16 MFU | 125378 tok/s step 2896/19560 | loss 3.741319 (+1.17z)| norm 0.2586 (-0.41z)| lr 7.80e-04 | 4165.20 ms | 32.4% bf16 MFU | 125402 tok/s step 2897/19560 | loss 3.686513 (-0.14z)| norm 0.2360 (-1.19z)| lr 7.80e-04 | 4160.54 ms | 32.5% bf16 MFU | 125433 tok/s step 2898/19560 | loss 3.768822 (+1.84z)| norm 0.2712 (+0.04z)| lr 7.80e-04 | 4168.15 ms | 32.4% bf16 MFU | 125451 tok/s step 2899/19560 | loss 3.726887 (+0.89z)| norm 0.2917 (+0.75z)| lr 7.80e-04 | 4158.74 ms | 32.5% bf16 MFU | 125481 tok/s step 2900/19560 | loss 3.676445 (-0.39z)| norm 0.2919 (+0.75z)| lr 7.80e-04 | 4217.34 ms | 32.0% bf16 MFU | 125423 tok/s step 2901/19560 | loss 3.719961 (+0.74z)| norm 0.3275 (+2.00z)| lr 7.79e-04 | 4167.68 ms | 32.4% bf16 MFU | 125442 tok/s step 2902/19560 | loss 3.697900 (+0.17z)| norm 0.3251 (+1.99z)| lr 7.79e-04 | 4260.39 ms | 31.7% bf16 MFU | 125323 tok/s step 2903/19560 | loss 3.650869 (-1.03z)| norm 0.2852 (+0.57z)| lr 7.79e-04 | 4287.81 ms | 31.5% bf16 MFU | 125171 tok/s step 2904/19560 | loss 3.647022 (-1.12z)| norm 0.2634 (-0.23z)| lr 7.79e-04 | 4166.05 ms | 32.4% bf16 MFU | 125204 tok/s step 2905/19560 | loss 3.684853 (-0.14z)| norm 0.2662 (-0.11z)| lr 7.79e-04 | 4162.56 ms | 32.4% bf16 MFU | 125242 tok/s step 2906/19560 | loss 3.706137 (+0.41z)| norm 0.3181 (+1.87z)| lr 7.79e-04 | 4171.67 ms | 32.4% bf16 MFU | 125264 tok/s step 2907/19560 | loss 3.631223 (-1.51z)| norm 0.2968 (+1.05z)| lr 7.79e-04 | 4158.24 ms | 32.5% bf16 MFU | 125305 tok/s step 2908/19560 | loss 3.662701 (-0.69z)| norm 0.2733 (+0.16z)| lr 7.79e-04 | 4166.79 ms | 32.4% bf16 MFU | 125331 tok/s step 2909/19560 | loss 3.711918 (+0.57z)| norm 0.2443 (-0.93z)| lr 7.79e-04 | 4167.61 ms | 32.4% bf16 MFU | 125354 tok/s step 2910/19560 | loss 3.735252 (+1.15z)| norm 0.2553 (-0.51z)| lr 7.79e-04 | 4182.07 ms | 32.3% bf16 MFU | 125355 tok/s step 2911/19560 | loss 3.675387 (-0.38z)| norm 0.2424 (-0.99z)| lr 7.79e-04 | 4162.12 ms | 32.4% bf16 MFU | 125385 tok/s step 2912/19560 | loss 3.739201 (+1.24z)| norm 0.2397 (-1.09z)| lr 7.79e-04 | 4171.16 ms | 32.4% bf16 MFU | 125401 tok/s step 2913/19560 | loss 3.701731 (+0.28z)| norm 0.2466 (-0.83z)| lr 7.79e-04 | 4162.38 ms | 32.4% bf16 MFU | 125429 tok/s step 2914/19560 | loss 3.687522 (-0.08z)| norm 0.2695 (+0.03z)| lr 7.79e-04 | 4169.53 ms | 32.4% bf16 MFU | 125444 tok/s step 2915/19560 | loss 3.681924 (-0.23z)| norm 0.2981 (+1.08z)| lr 7.79e-04 | 4200.72 ms | 32.1% bf16 MFU | 125413 tok/s step 2916/19560 | loss 3.639093 (-1.32z)| norm 0.3109 (+1.54z)| lr 7.79e-04 | 4220.35 ms | 32.0% bf16 MFU | 125353 tok/s step 2917/19560 | loss 3.632069 (-1.48z)| norm 0.2495 (-0.73z)| lr 7.79e-04 | 4159.39 ms | 32.5% bf16 MFU | 125388 tok/s step 2918/19560 | loss 3.637697 (-1.32z)| norm 0.2374 (-1.19z)| lr 7.79e-04 | 4211.54 ms | 32.1% bf16 MFU | 125343 tok/s step 2919/19560 | loss 3.686543 (-0.09z)| norm 0.2527 (-0.63z)| lr 7.79e-04 | 4158.87 ms | 32.5% bf16 MFU | 125379 tok/s step 2920/19560 | loss 3.713049 (+0.57z)| norm 0.2564 (-0.49z)| lr 7.79e-04 | 4156.43 ms | 32.5% bf16 MFU | 125417 tok/s step 2921/19560 | loss 3.698972 (+0.22z)| norm 0.2334 (-1.35z)| lr 7.79e-04 | 4165.49 ms | 32.4% bf16 MFU | 125440 tok/s step 2922/19560 | loss 3.710589 (+0.52z)| norm 0.2197 (-1.82z)| lr 7.79e-04 | 4155.47 ms | 32.5% bf16 MFU | 125476 tok/s step 2923/19560 | loss 3.666650 (-0.58z)| norm 0.2479 (-0.78z)| lr 7.79e-04 | 4164.45 ms | 32.4% bf16 MFU | 125497 tok/s step 2924/19560 | loss 3.689809 (+0.01z)| norm 0.2717 (+0.09z)| lr 7.79e-04 | 4220.99 ms | 32.0% bf16 MFU | 125433 tok/s step 2925/19560 | loss 3.684409 (-0.14z)| norm 0.2994 (+1.10z)| lr 7.79e-04 | 4171.12 ms | 32.4% bf16 MFU | 125446 tok/s step 2926/19560 | loss 3.649928 (-1.03z)| norm 0.2739 (+0.16z)| lr 7.79e-04 | 4163.95 ms | 32.4% bf16 MFU | 125469 tok/s step 2927/19560 | loss 3.733375 (+1.10z)| norm 0.2453 (-0.88z)| lr 7.79e-04 | 4160.91 ms | 32.4% bf16 MFU | 125496 tok/s step 2928/19560 | loss 3.701447 (+0.29z)| norm 0.2757 (+0.28z)| lr 7.79e-04 | 4164.31 ms | 32.4% bf16 MFU | 125516 tok/s step 2929/19560 | loss 3.694151 (+0.12z)| norm 0.2849 (+0.71z)| lr 7.79e-04 | 4231.71 ms | 31.9% bf16 MFU | 125435 tok/s step 2930/19560 | loss 3.632899 (-1.49z)| norm 0.2698 (+0.09z)| lr 7.79e-04 | 4163.65 ms | 32.4% bf16 MFU | 125459 tok/s step 2931/19560 | loss 3.652453 (-0.97z)| norm 0.2949 (+1.15z)| lr 7.79e-04 | 4175.48 ms | 32.3% bf16 MFU | 125464 tok/s step 2932/19560 | loss 3.641921 (-1.23z)| norm 0.2448 (-0.94z)| lr 7.79e-04 | 4156.12 ms | 32.5% bf16 MFU | 125499 tok/s step 2933/19560 | loss 3.645125 (-1.13z)| norm 0.2620 (-0.23z)| lr 7.79e-04 | 4158.99 ms | 32.5% bf16 MFU | 125527 tok/s step 2934/19560 | loss 3.617891 (-1.83z)| norm 0.2967 (+1.21z)| lr 7.79e-04 | 4175.40 ms | 32.3% bf16 MFU | 125529 tok/s step 2935/19560 | loss 3.629216 (-1.51z)| norm 0.3107 (+1.79z)| lr 7.79e-04 | 4187.67 ms | 32.2% bf16 MFU | 125512 tok/s step 2936/19560 | loss 3.695253 (+0.25z)| norm 0.2883 (+0.86z)| lr 7.79e-04 | 4498.12 ms | 30.0% bf16 MFU | 125064 tok/s step 2937/19560 | loss 3.659780 (-0.70z)| norm 0.2879 (+0.84z)| lr 7.79e-04 | 4165.77 ms | 32.4% bf16 MFU | 125104 tok/s step 2938/19560 | loss 3.727730 (+1.10z)| norm 0.3148 (+1.91z)| lr 7.79e-04 | 4173.93 ms | 32.3% bf16 MFU | 125129 tok/s step 2939/19560 | loss 3.704411 (+0.47z)| norm 0.3146 (+1.86z)| lr 7.79e-04 | 4156.03 ms | 32.5% bf16 MFU | 125180 tok/s step 2940/19560 | loss 3.704559 (+0.47z)| norm 0.2922 (+0.94z)| lr 7.79e-04 | 4234.45 ms | 31.9% bf16 MFU | 125112 tok/s step 2941/19560 | loss 3.654620 (-0.86z)| norm 0.3101 (+1.63z)| lr 7.79e-04 | 4168.98 ms | 32.4% bf16 MFU | 125144 tok/s step 2942/19560 | loss 3.803002 (+2.95z)| norm 0.2872 (+0.70z)| lr 7.79e-04 | 4392.23 ms | 30.7% bf16 MFU | 124856 tok/s step 2943/19560 | loss 3.686318 (-0.05z)| norm 0.2520 (-0.72z)| lr 7.79e-04 | 4170.77 ms | 32.4% bf16 MFU | 124898 tok/s step 2944/19560 | loss 3.668437 (-0.51z)| norm 0.2962 (+1.05z)| lr 7.79e-04 | 4261.27 ms | 31.7% bf16 MFU | 124805 tok/s step 2945/19560 | loss 3.659017 (-0.74z)| norm 0.2991 (+1.15z)| lr 7.79e-04 | 4156.20 ms | 32.5% bf16 MFU | 124872 tok/s step 2946/19560 | loss 3.652191 (-0.91z)| norm 0.2997 (+1.17z)| lr 7.79e-04 | 4155.29 ms | 32.5% bf16 MFU | 124937 tok/s step 2947/19560 | loss 3.705745 (+0.45z)| norm 0.2985 (+1.11z)| lr 7.78e-04 | 4224.22 ms | 32.0% bf16 MFU | 124896 tok/s step 2948/19560 | loss 3.683828 (-0.12z)| norm 0.2706 (-0.01z)| lr 7.78e-04 | 4187.51 ms | 32.2% bf16 MFU | 124911 tok/s step 2949/19560 | loss 3.694114 (+0.16z)| norm 0.2638 (-0.28z)| lr 7.78e-04 | 4198.05 ms | 32.2% bf16 MFU | 124910 tok/s step 2950/19560 | loss 3.689312 (+0.03z)| norm 0.2997 (+1.14z)| lr 7.78e-04 | 4157.81 ms | 32.5% bf16 MFU | 124970 tok/s step 2951/19560 | loss 3.730637 (+1.09z)| norm 0.2453 (-1.04z)| lr 7.78e-04 | 4157.39 ms | 32.5% bf16 MFU | 125027 tok/s step 2952/19560 | loss 3.707713 (+0.49z)| norm 0.2551 (-0.65z)| lr 7.78e-04 | 4170.33 ms | 32.4% bf16 MFU | 125061 tok/s step 2953/19560 | loss 3.632905 (-1.42z)| norm 0.2581 (-0.53z)| lr 7.78e-04 | 4163.07 ms | 32.4% bf16 MFU | 125105 tok/s step 2954/19560 | loss 3.694098 (+0.15z)| norm 0.2673 (-0.15z)| lr 7.78e-04 | 4160.06 ms | 32.5% bf16 MFU | 125151 tok/s step 2955/19560 | loss 3.750532 (+1.62z)| norm 0.2568 (-0.56z)| lr 7.78e-04 | 4185.15 ms | 32.3% bf16 MFU | 125157 tok/s step 2956/19560 | loss 3.633160 (-1.41z)| norm 0.2869 (+0.66z)| lr 7.78e-04 | 4164.56 ms | 32.4% bf16 MFU | 125194 tok/s step 2957/19560 | loss 3.626323 (-1.57z)| norm 0.2807 (+0.40z)| lr 7.78e-04 | 4162.57 ms | 32.4% bf16 MFU | 125232 tok/s step 2958/19560 | loss 3.681814 (-0.15z)| norm 0.2530 (-0.75z)| lr 7.78e-04 | 4161.29 ms | 32.4% bf16 MFU | 125270 tok/s step 2959/19560 | loss 3.636598 (-1.30z)| norm 0.2587 (-0.53z)| lr 7.78e-04 | 4160.23 ms | 32.5% bf16 MFU | 125308 tok/s step 2960/19560 | loss 3.689121 (+0.05z)| norm 0.2838 (+0.51z)| lr 7.78e-04 | 4219.60 ms | 32.0% bf16 MFU | 125255 tok/s step 2961/19560 | loss 3.767431 (+2.02z)| norm 0.2827 (+0.46z)| lr 7.78e-04 | 4154.39 ms | 32.5% bf16 MFU | 125302 tok/s step 2962/19560 | loss 3.636814 (-1.27z)| norm 0.2894 (+0.73z)| lr 7.78e-04 | 4154.36 ms | 32.5% bf16 MFU | 125347 tok/s step 2963/19560 | loss 3.694135 (+0.20z)| norm 0.3033 (+1.29z)| lr 7.78e-04 | 4154.88 ms | 32.5% bf16 MFU | 125389 tok/s step 2964/19560 | loss 3.674838 (-0.29z)| norm 0.2628 (-0.41z)| lr 7.78e-04 | 4163.37 ms | 32.4% bf16 MFU | 125416 tok/s step 2965/19560 | loss 3.654314 (-0.80z)| norm 0.2548 (-0.74z)| lr 7.78e-04 | 4161.74 ms | 32.4% bf16 MFU | 125444 tok/s step 2966/19560 | loss 3.669550 (-0.40z)| norm 0.2755 (+0.14z)| lr 7.78e-04 | 4844.43 ms | 27.9% bf16 MFU | 124583 tok/s step 2967/19560 | loss 3.667304 (-0.45z)| norm 0.2896 (+0.75z)| lr 7.78e-04 | 4856.23 ms | 27.8% bf16 MFU | 123752 tok/s step 2968/19560 | loss 3.728522 (+1.19z)| norm 0.2903 (+0.78z)| lr 7.78e-04 | 4781.92 ms | 28.2% bf16 MFU | 123046 tok/s step 2969/19560 | loss 3.758006 (+1.93z)| norm 0.2487 (-0.99z)| lr 7.78e-04 | 4580.20 ms | 29.5% bf16 MFU | 122618 tok/s step 2970/19560 | loss 3.742519 (+1.50z)| norm 0.2249 (-1.95z)| lr 7.78e-04 | 4565.26 ms | 29.6% bf16 MFU | 122229 tok/s step 2971/19560 | loss 3.715341 (+0.77z)| norm 0.2469 (-1.02z)| lr 7.78e-04 | 4406.16 ms | 30.6% bf16 MFU | 122067 tok/s step 2972/19560 | loss 3.800052 (+2.88z)| norm 0.2652 (-0.26z)| lr 7.78e-04 | 4199.99 ms | 32.1% bf16 MFU | 122205 tok/s step 2973/19560 | loss 3.680206 (-0.16z)| norm 0.2745 (+0.12z)| lr 7.78e-04 | 4242.14 ms | 31.8% bf16 MFU | 122274 tok/s step 2974/19560 | loss 3.680282 (-0.15z)| norm 0.2724 (+0.03z)| lr 7.78e-04 | 4268.81 ms | 31.6% bf16 MFU | 122302 tok/s step 2975/19560 | loss 3.730886 (+1.12z)| norm 0.2594 (-0.52z)| lr 7.78e-04 | 4191.27 ms | 32.2% bf16 MFU | 122441 tok/s step 2976/19560 | loss 3.629256 (-1.47z)| norm 0.3148 (+1.77z)| lr 7.78e-04 | 4151.88 ms | 32.5% bf16 MFU | 122633 tok/s step 2977/19560 | loss 3.668368 (-0.47z)| norm 0.3025 (+1.25z)| lr 7.78e-04 | 4222.31 ms | 32.0% bf16 MFU | 122710 tok/s step 2978/19560 | loss 3.789194 (+2.55z)| norm 0.2450 (-1.12z)| lr 7.78e-04 | 4245.31 ms | 31.8% bf16 MFU | 122749 tok/s step 2979/19560 | loss 3.671132 (-0.39z)| norm 0.2667 (-0.22z)| lr 7.78e-04 | 4252.46 ms | 31.8% bf16 MFU | 122776 tok/s step 2980/19560 | loss 3.687276 (+0.01z)| norm 0.2552 (-0.70z)| lr 7.78e-04 | 4177.01 ms | 32.3% bf16 MFU | 122913 tok/s step 2981/19560 | loss 3.667437 (-0.49z)| norm 0.2291 (-1.75z)| lr 7.78e-04 | 4175.64 ms | 32.3% bf16 MFU | 123046 tok/s step 2982/19560 | loss 3.718399 (+0.81z)| norm 0.2470 (-1.00z)| lr 7.78e-04 | 4175.17 ms | 32.3% bf16 MFU | 123172 tok/s step 2983/19560 | loss 3.763662 (+1.94z)| norm 0.2423 (-1.18z)| lr 7.78e-04 | 4183.00 ms | 32.3% bf16 MFU | 123280 tok/s step 2984/19560 | loss 3.707066 (+0.51z)| norm 0.2376 (-1.37z)| lr 7.78e-04 | 4178.63 ms | 32.3% bf16 MFU | 123390 tok/s step 2985/19560 | loss 3.667342 (-0.49z)| norm 0.2403 (-1.23z)| lr 7.78e-04 | 4200.66 ms | 32.1% bf16 MFU | 123461 tok/s step 2986/19560 | loss 3.738668 (+1.30z)| norm 0.2582 (-0.48z)| lr 7.78e-04 | 4175.79 ms | 32.3% bf16 MFU | 123565 tok/s step 2987/19560 | loss 3.716290 (+0.74z)| norm 0.2506 (-0.78z)| lr 7.78e-04 | 4174.36 ms | 32.3% bf16 MFU | 123667 tok/s step 2988/19560 | loss 3.692590 (+0.15z)| norm 0.2638 (-0.23z)| lr 7.78e-04 | 4171.46 ms | 32.4% bf16 MFU | 123768 tok/s step 2989/19560 | loss 3.710162 (+0.58z)| norm 0.2535 (-0.67z)| lr 7.78e-04 | 4168.77 ms | 32.4% bf16 MFU | 123868 tok/s step 2990/19560 | loss 3.762265 (+1.85z)| norm 0.2369 (-1.35z)| lr 7.78e-04 | 4179.87 ms | 32.3% bf16 MFU | 123946 tok/s step 2991/19560 | loss 3.700117 (+0.31z)| norm 0.2485 (-0.86z)| lr 7.78e-04 | 4276.51 ms | 31.6% bf16 MFU | 123878 tok/s step 2992/19560 | loss 3.724355 (+0.92z)| norm 0.2824 (+0.56z)| lr 7.77e-04 | 4195.73 ms | 32.2% bf16 MFU | 123932 tok/s step 2993/19560 | loss 3.673013 (-0.36z)| norm 0.2977 (+1.18z)| lr 7.77e-04 | 4181.34 ms | 32.3% bf16 MFU | 124005 tok/s step 2994/19560 | loss 3.726662 (+0.97z)| norm 0.2383 (-1.27z)| lr 7.77e-04 | 4224.59 ms | 32.0% bf16 MFU | 124010 tok/s step 2995/19560 | loss 3.726585 (+0.97z)| norm 0.2468 (-0.91z)| lr 7.77e-04 | 4171.88 ms | 32.4% bf16 MFU | 124093 tok/s step 2996/19560 | loss 3.698681 (+0.27z)| norm 0.2898 (+0.86z)| lr 7.77e-04 | 4179.31 ms | 32.3% bf16 MFU | 124161 tok/s step 2997/19560 | loss 3.723590 (+0.89z)| norm 0.2615 (-0.31z)| lr 7.77e-04 | 4204.86 ms | 32.1% bf16 MFU | 124187 tok/s step 2998/19560 | loss 3.679506 (-0.22z)| norm 0.2623 (-0.27z)| lr 7.77e-04 | 4183.98 ms | 32.3% bf16 MFU | 124243 tok/s step 2999/19560 | loss 3.683033 (-0.13z)| norm 0.2601 (-0.37z)| lr 7.77e-04 | 4251.26 ms | 31.8% bf16 MFU | 124197 tok/s step 3000/19560 | loss 3.776864 (+2.17z)| norm 0.2649 (-0.17z)| lr 7.77e-04 | 4165.39 ms | 32.4% bf16 MFU | 124281 tok/s val loss 3.684744 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2720/10042 = 0.270862 step 3001/19560 | loss 3.734023 (+1.10z)| norm 0.2623 (-0.29z)| lr 7.77e-04 | 4380.86 ms | 30.8% bf16 MFU | 124051 tok/s step 3002/19560 | loss 3.690328 (+0.02z)| norm 0.2986 (+1.20z)| lr 7.77e-04 | 4338.14 ms | 31.1% bf16 MFU | 123891 tok/s step 3003/19560 | loss 3.698772 (+0.22z)| norm 0.2793 (+0.39z)| lr 7.77e-04 | 4241.73 ms | 31.8% bf16 MFU | 123876 tok/s step 3004/19560 | loss 3.655552 (-0.85z)| norm 0.2693 (-0.03z)| lr 7.77e-04 | 4245.22 ms | 31.8% bf16 MFU | 123858 tok/s step 3005/19560 | loss 3.680599 (-0.24z)| norm 0.2548 (-0.63z)| lr 7.77e-04 | 4162.21 ms | 32.4% bf16 MFU | 123963 tok/s step 3006/19560 | loss 3.720511 (+0.76z)| norm 0.2539 (-0.67z)| lr 7.77e-04 | 4262.51 ms | 31.7% bf16 MFU | 123915 tok/s step 3007/19560 | loss 3.630556 (-1.49z)| norm 0.2878 (+0.75z)| lr 7.77e-04 | 4242.46 ms | 31.8% bf16 MFU | 123898 tok/s step 3008/19560 | loss 3.662037 (-0.70z)| norm 0.2369 (-1.36z)| lr 7.77e-04 | 4161.47 ms | 32.4% bf16 MFU | 124003 tok/s step 3009/19560 | loss 3.680789 (-0.23z)| norm 0.2319 (-1.55z)| lr 7.77e-04 | 4172.45 ms | 32.4% bf16 MFU | 124085 tok/s step 3010/19560 | loss 3.698726 (+0.20z)| norm 0.2582 (-0.43z)| lr 7.77e-04 | 4179.08 ms | 32.3% bf16 MFU | 124154 tok/s step 3011/19560 | loss 3.690701 (+0.01z)| norm 0.2706 (+0.08z)| lr 7.77e-04 | 4178.28 ms | 32.3% bf16 MFU | 124220 tok/s step 3012/19560 | loss 3.768680 (+1.95z)| norm 0.2602 (-0.36z)| lr 7.77e-04 | 4241.78 ms | 31.8% bf16 MFU | 124189 tok/s step 3013/19560 | loss 3.629525 (-1.52z)| norm 0.2989 (+1.26z)| lr 7.77e-04 | 4205.28 ms | 32.1% bf16 MFU | 124213 tok/s step 3014/19560 | loss 3.760413 (+1.72z)| norm 0.3315 (+2.55z)| lr 7.77e-04 | 4167.14 ms | 32.4% bf16 MFU | 124293 tok/s step 3015/19560 | loss 3.730700 (+0.98z)| norm 0.3271 (+2.31z)| lr 7.77e-04 | 4212.55 ms | 32.1% bf16 MFU | 124302 tok/s step 3016/19560 | loss 3.686796 (-0.11z)| norm 0.3148 (+1.77z)| lr 7.77e-04 | 4241.99 ms | 31.8% bf16 MFU | 124266 tok/s step 3017/19560 | loss 3.702537 (+0.27z)| norm 0.3026 (+1.26z)| lr 7.77e-04 | 4302.80 ms | 31.4% bf16 MFU | 124145 tok/s step 3018/19560 | loss 3.713804 (+0.55z)| norm 0.2676 (-0.14z)| lr 7.77e-04 | 4286.51 ms | 31.5% bf16 MFU | 124054 tok/s step 3019/19560 | loss 3.781331 (+2.16z)| norm 0.2746 (+0.14z)| lr 7.77e-04 | 4171.86 ms | 32.4% bf16 MFU | 124135 tok/s step 3020/19560 | loss 3.678674 (-0.35z)| norm 0.2673 (-0.15z)| lr 7.77e-04 | 4171.40 ms | 32.4% bf16 MFU | 124212 tok/s step 3021/19560 | loss 3.692039 (-0.03z)| norm 0.2715 (+0.01z)| lr 7.77e-04 | 4241.94 ms | 31.8% bf16 MFU | 124181 tok/s step 3022/19560 | loss 3.676664 (-0.42z)| norm 0.2815 (+0.40z)| lr 7.77e-04 | 4235.96 ms | 31.9% bf16 MFU | 124161 tok/s step 3023/19560 | loss 3.720133 (+0.65z)| norm 0.2710 (-0.02z)| lr 7.77e-04 | 4171.12 ms | 32.4% bf16 MFU | 124238 tok/s step 3024/19560 | loss 3.730441 (+0.91z)| norm 0.2553 (-0.66z)| lr 7.77e-04 | 4198.96 ms | 32.2% bf16 MFU | 124269 tok/s step 3025/19560 | loss 3.689149 (-0.12z)| norm 0.2381 (-1.35z)| lr 7.77e-04 | 4203.79 ms | 32.1% bf16 MFU | 124291 tok/s step 3026/19560 | loss 3.701440 (+0.20z)| norm 0.2559 (-0.63z)| lr 7.77e-04 | 4160.90 ms | 32.4% bf16 MFU | 124377 tok/s step 3027/19560 | loss 3.671760 (-0.54z)| norm 0.2502 (-0.84z)| lr 7.77e-04 | 4189.39 ms | 32.2% bf16 MFU | 124415 tok/s step 3028/19560 | loss 3.735098 (+1.05z)| norm 0.2527 (-0.73z)| lr 7.77e-04 | 4166.66 ms | 32.4% bf16 MFU | 124486 tok/s step 3029/19560 | loss 3.832505 (+3.33z)| norm 0.2516 (-0.77z)| lr 7.77e-04 | 4171.15 ms | 32.4% bf16 MFU | 124546 tok/s step 3030/19560 | loss 3.739748 (+1.08z)| norm 0.2400 (-1.23z)| lr 7.77e-04 | 4168.27 ms | 32.4% bf16 MFU | 124608 tok/s step 3031/19560 | loss 3.663688 (-0.75z)| norm 0.2276 (-1.72z)| lr 7.77e-04 | 4185.32 ms | 32.3% bf16 MFU | 124641 tok/s step 3032/19560 | loss 3.731406 (+0.87z)| norm 0.2500 (-0.78z)| lr 7.77e-04 | 4159.85 ms | 32.5% bf16 MFU | 124711 tok/s step 3033/19560 | loss 3.695508 (-0.00z)| norm 0.2373 (-1.29z)| lr 7.77e-04 | 4178.27 ms | 32.3% bf16 MFU | 124749 tok/s step 3034/19560 | loss 3.665919 (-0.71z)| norm 0.2548 (-0.56z)| lr 7.77e-04 | 4297.73 ms | 31.4% bf16 MFU | 124611 tok/s step 3035/19560 | loss 3.750382 (+1.31z)| norm 0.3048 (+1.51z)| lr 7.77e-04 | 4166.75 ms | 32.4% bf16 MFU | 124672 tok/s step 3036/19560 | loss 3.667762 (-0.69z)| norm 0.2866 (+0.75z)| lr 7.77e-04 | 4184.02 ms | 32.3% bf16 MFU | 124704 tok/s step 3037/19560 | loss 3.722797 (+0.64z)| norm 0.3023 (+1.37z)| lr 7.76e-04 | 4163.49 ms | 32.4% bf16 MFU | 124765 tok/s step 3038/19560 | loss 3.704947 (+0.21z)| norm 0.2939 (+1.01z)| lr 7.76e-04 | 4164.82 ms | 32.4% bf16 MFU | 124821 tok/s step 3039/19560 | loss 3.714824 (+0.45z)| norm 0.2773 (+0.32z)| lr 7.76e-04 | 4172.82 ms | 32.4% bf16 MFU | 124862 tok/s step 3040/19560 | loss 3.712144 (+0.39z)| norm 0.2786 (+0.36z)| lr 7.76e-04 | 4413.92 ms | 30.6% bf16 MFU | 124558 tok/s step 3041/19560 | loss 3.716395 (+0.49z)| norm 0.2746 (+0.19z)| lr 7.76e-04 | 4168.94 ms | 32.4% bf16 MFU | 124618 tok/s step 3042/19560 | loss 3.685188 (-0.27z)| norm 0.2924 (+0.92z)| lr 7.76e-04 | 4175.90 ms | 32.3% bf16 MFU | 124665 tok/s step 3043/19560 | loss 3.688024 (-0.20z)| norm 0.2652 (-0.20z)| lr 7.76e-04 | 4179.78 ms | 32.3% bf16 MFU | 124703 tok/s step 3044/19560 | loss 3.686732 (-0.24z)| norm 0.2502 (-0.81z)| lr 7.76e-04 | 4163.84 ms | 32.4% bf16 MFU | 124764 tok/s step 3045/19560 | loss 3.677578 (-0.48z)| norm 0.2403 (-1.22z)| lr 7.76e-04 | 4174.02 ms | 32.3% bf16 MFU | 124806 tok/s step 3046/19560 | loss 3.686202 (-0.28z)| norm 0.2277 (-1.74z)| lr 7.76e-04 | 4162.29 ms | 32.4% bf16 MFU | 124864 tok/s step 3047/19560 | loss 3.720964 (+0.58z)| norm 0.2441 (-1.05z)| lr 7.76e-04 | 4183.04 ms | 32.3% bf16 MFU | 124887 tok/s step 3048/19560 | loss 3.754091 (+1.38z)| norm 0.3052 (+1.47z)| lr 7.76e-04 | 4210.88 ms | 32.1% bf16 MFU | 124868 tok/s step 3049/19560 | loss 3.685338 (-0.31z)| norm 0.3094 (+1.61z)| lr 7.76e-04 | 4174.18 ms | 32.3% bf16 MFU | 124905 tok/s step 3050/19560 | loss 3.665551 (-0.79z)| norm 0.2813 (+0.44z)| lr 7.76e-04 | 4166.04 ms | 32.4% bf16 MFU | 124952 tok/s step 3051/19560 | loss 3.639592 (-1.42z)| norm 0.2625 (-0.35z)| lr 7.76e-04 | 4166.56 ms | 32.4% bf16 MFU | 124996 tok/s step 3052/19560 | loss 3.745372 (+1.16z)| norm 0.2670 (-0.16z)| lr 7.76e-04 | 4169.49 ms | 32.4% bf16 MFU | 125034 tok/s step 3053/19560 | loss 3.661427 (-0.88z)| norm 0.2634 (-0.30z)| lr 7.76e-04 | 4172.19 ms | 32.4% bf16 MFU | 125065 tok/s step 3054/19560 | loss 3.712795 (+0.36z)| norm 0.2525 (-0.76z)| lr 7.76e-04 | 4217.22 ms | 32.0% bf16 MFU | 125028 tok/s step 3055/19560 | loss 3.758873 (+1.47z)| norm 0.2437 (-1.13z)| lr 7.76e-04 | 4180.60 ms | 32.3% bf16 MFU | 125047 tok/s step 3056/19560 | loss 3.683156 (-0.36z)| norm 0.2496 (-0.86z)| lr 7.76e-04 | 4197.24 ms | 32.2% bf16 MFU | 125040 tok/s step 3057/19560 | loss 3.760071 (+1.48z)| norm 0.2433 (-1.11z)| lr 7.76e-04 | 4167.40 ms | 32.4% bf16 MFU | 125079 tok/s step 3058/19560 | loss 3.725854 (+0.64z)| norm 0.2579 (-0.49z)| lr 7.76e-04 | 4184.42 ms | 32.3% bf16 MFU | 125089 tok/s step 3059/19560 | loss 3.714531 (+0.36z)| norm 0.2578 (-0.49z)| lr 7.76e-04 | 4170.78 ms | 32.4% bf16 MFU | 125120 tok/s step 3060/19560 | loss 3.696876 (-0.09z)| norm 0.2738 (+0.18z)| lr 7.76e-04 | 4178.97 ms | 32.3% bf16 MFU | 125137 tok/s step 3061/19560 | loss 3.693604 (-0.18z)| norm 0.2560 (-0.57z)| lr 7.76e-04 | 4170.09 ms | 32.4% bf16 MFU | 125167 tok/s step 3062/19560 | loss 3.676878 (-0.61z)| norm 0.2771 (+0.33z)| lr 7.76e-04 | 4181.73 ms | 32.3% bf16 MFU | 125177 tok/s step 3063/19560 | loss 3.726821 (+0.63z)| norm 0.2727 (+0.15z)| lr 7.76e-04 | 4177.81 ms | 32.3% bf16 MFU | 125193 tok/s step 3064/19560 | loss 3.671646 (-0.76z)| norm 0.3184 (+2.07z)| lr 7.76e-04 | 4161.62 ms | 32.4% bf16 MFU | 125232 tok/s step 3065/19560 | loss 3.676319 (-0.65z)| norm 0.2928 (+0.98z)| lr 7.76e-04 | 4160.51 ms | 32.5% bf16 MFU | 125271 tok/s step 3066/19560 | loss 3.768088 (+1.66z)| norm 0.2957 (+1.13z)| lr 7.76e-04 | 4178.08 ms | 32.3% bf16 MFU | 125282 tok/s step 3067/19560 | loss 3.665431 (-0.92z)| norm 0.2716 (+0.11z)| lr 7.76e-04 | 4175.85 ms | 32.3% bf16 MFU | 125296 tok/s step 3068/19560 | loss 3.742431 (+1.01z)| norm 0.3035 (+1.48z)| lr 7.76e-04 | 4255.22 ms | 31.7% bf16 MFU | 125191 tok/s step 3069/19560 | loss 3.738073 (+0.88z)| norm 0.2629 (-0.25z)| lr 7.76e-04 | 4173.01 ms | 32.4% bf16 MFU | 125214 tok/s step 3070/19560 | loss 3.718638 (+0.42z)| norm 0.2624 (-0.26z)| lr 7.76e-04 | 4178.57 ms | 32.3% bf16 MFU | 125227 tok/s step 3071/19560 | loss 3.676241 (-0.66z)| norm 0.2415 (-1.17z)| lr 7.76e-04 | 4178.86 ms | 32.3% bf16 MFU | 125238 tok/s step 3072/19560 | loss 3.792288 (+2.25z)| norm 0.2379 (-1.31z)| lr 7.76e-04 | 4168.92 ms | 32.4% bf16 MFU | 125264 tok/s step 3073/19560 | loss 3.675671 (-0.70z)| norm 0.2592 (-0.37z)| lr 7.76e-04 | 4175.82 ms | 32.3% bf16 MFU | 125279 tok/s step 3074/19560 | loss 3.729169 (+0.65z)| norm 0.2486 (-0.82z)| lr 7.76e-04 | 4173.27 ms | 32.4% bf16 MFU | 125296 tok/s step 3075/19560 | loss 3.728993 (+0.64z)| norm 0.2664 (-0.03z)| lr 7.76e-04 | 4154.06 ms | 32.5% bf16 MFU | 125342 tok/s step 3076/19560 | loss 3.739918 (+0.90z)| norm 0.2726 (+0.25z)| lr 7.76e-04 | 4186.32 ms | 32.3% bf16 MFU | 125337 tok/s step 3077/19560 | loss 3.671088 (-0.84z)| norm 0.2560 (-0.48z)| lr 7.76e-04 | 4175.82 ms | 32.3% bf16 MFU | 125348 tok/s step 3078/19560 | loss 3.677086 (-0.68z)| norm 0.2947 (+1.23z)| lr 7.76e-04 | 4198.27 ms | 32.2% bf16 MFU | 125325 tok/s step 3079/19560 | loss 3.704252 (+0.01z)| norm 0.2896 (+0.99z)| lr 7.76e-04 | 4192.65 ms | 32.2% bf16 MFU | 125311 tok/s step 3080/19560 | loss 3.703604 (-0.01z)| norm 0.3139 (+2.02z)| lr 7.75e-04 | 4170.23 ms | 32.4% bf16 MFU | 125331 tok/s step 3081/19560 | loss 3.727815 (+0.59z)| norm 0.3032 (+1.52z)| lr 7.75e-04 | 4186.64 ms | 32.2% bf16 MFU | 125326 tok/s step 3082/19560 | loss 3.788798 (+2.10z)| norm 0.2933 (+1.08z)| lr 7.75e-04 | 4167.80 ms | 32.4% bf16 MFU | 125350 tok/s step 3083/19560 | loss 3.729490 (+0.61z)| norm 0.2465 (-0.93z)| lr 7.75e-04 | 4180.97 ms | 32.3% bf16 MFU | 125352 tok/s step 3084/19560 | loss 3.707323 (+0.04z)| norm 0.3077 (+1.68z)| lr 7.75e-04 | 4169.38 ms | 32.4% bf16 MFU | 125372 tok/s step 3085/19560 | loss 3.706160 (-0.01z)| norm 0.2906 (+0.95z)| lr 7.75e-04 | 4188.88 ms | 32.2% bf16 MFU | 125361 tok/s step 3086/19560 | loss 3.670651 (-0.93z)| norm 0.2902 (+0.91z)| lr 7.75e-04 | 4164.72 ms | 32.4% bf16 MFU | 125388 tok/s step 3087/19560 | loss 3.731910 (+0.65z)| norm 0.2678 (-0.04z)| lr 7.75e-04 | 4183.39 ms | 32.3% bf16 MFU | 125385 tok/s step 3088/19560 | loss 3.684652 (-0.59z)| norm 0.2724 (+0.16z)| lr 7.75e-04 | 4200.47 ms | 32.1% bf16 MFU | 125356 tok/s step 3089/19560 | loss 3.722553 (+0.42z)| norm 0.2852 (+0.70z)| lr 7.75e-04 | 4175.71 ms | 32.3% bf16 MFU | 125366 tok/s step 3090/19560 | loss 3.720453 (+0.35z)| norm 0.2577 (-0.46z)| lr 7.75e-04 | 4171.99 ms | 32.4% bf16 MFU | 125381 tok/s step 3091/19560 | loss 3.637390 (-1.84z)| norm 0.2549 (-0.57z)| lr 7.75e-04 | 4167.62 ms | 32.4% bf16 MFU | 125402 tok/s step 3092/19560 | loss 3.706887 (-0.01z)| norm 0.2682 (+0.00z)| lr 7.75e-04 | 4168.31 ms | 32.4% bf16 MFU | 125421 tok/s step 3093/19560 | loss 3.675783 (-0.84z)| norm 0.2425 (-1.09z)| lr 7.75e-04 | 4181.49 ms | 32.3% bf16 MFU | 125419 tok/s step 3094/19560 | loss 3.686800 (-0.55z)| norm 0.2437 (-1.03z)| lr 7.75e-04 | 4192.77 ms | 32.2% bf16 MFU | 125401 tok/s step 3095/19560 | loss 3.656267 (-1.36z)| norm 0.2308 (-1.55z)| lr 7.75e-04 | 4193.21 ms | 32.2% bf16 MFU | 125382 tok/s step 3096/19560 | loss 3.690936 (-0.43z)| norm 0.2271 (-1.67z)| lr 7.75e-04 | 4186.36 ms | 32.3% bf16 MFU | 125375 tok/s step 3097/19560 | loss 3.739666 (+0.88z)| norm 0.2328 (-1.42z)| lr 7.75e-04 | 4179.40 ms | 32.3% bf16 MFU | 125378 tok/s step 3098/19560 | loss 3.632086 (-1.96z)| norm 0.2441 (-0.96z)| lr 7.75e-04 | 4174.15 ms | 32.3% bf16 MFU | 125390 tok/s step 3099/19560 | loss 3.720780 (+0.39z)| norm 0.2714 (+0.19z)| lr 7.75e-04 | 4175.10 ms | 32.3% bf16 MFU | 125399 tok/s step 3100/19560 | loss 3.730552 (+0.68z)| norm 0.2559 (-0.47z)| lr 7.75e-04 | 4171.81 ms | 32.4% bf16 MFU | 125413 tok/s step 3101/19560 | loss 3.694899 (-0.29z)| norm 0.2319 (-1.46z)| lr 7.75e-04 | 4171.98 ms | 32.4% bf16 MFU | 125426 tok/s step 3102/19560 | loss 3.685603 (-0.55z)| norm 0.2523 (-0.59z)| lr 7.75e-04 | 4169.06 ms | 32.4% bf16 MFU | 125442 tok/s step 3103/19560 | loss 3.675874 (-0.80z)| norm 0.2373 (-1.21z)| lr 7.75e-04 | 4281.59 ms | 31.5% bf16 MFU | 125293 tok/s step 3104/19560 | loss 3.680655 (-0.69z)| norm 0.2502 (-0.66z)| lr 7.75e-04 | 4157.97 ms | 32.5% bf16 MFU | 125333 tok/s step 3105/19560 | loss 3.703636 (-0.07z)| norm 0.2629 (-0.11z)| lr 7.75e-04 | 4173.73 ms | 32.3% bf16 MFU | 125347 tok/s step 3106/19560 | loss 3.701999 (-0.09z)| norm 0.2815 (+0.67z)| lr 7.75e-04 | 4312.36 ms | 31.3% bf16 MFU | 125158 tok/s step 3107/19560 | loss 3.688052 (-0.49z)| norm 0.2418 (-1.01z)| lr 7.75e-04 | 4241.32 ms | 31.8% bf16 MFU | 125081 tok/s step 3108/19560 | loss 3.692768 (-0.36z)| norm 0.2536 (-0.51z)| lr 7.75e-04 | 4185.63 ms | 32.3% bf16 MFU | 125090 tok/s step 3109/19560 | loss 3.675471 (-0.86z)| norm 0.2476 (-0.78z)| lr 7.75e-04 | 4178.39 ms | 32.3% bf16 MFU | 125109 tok/s step 3110/19560 | loss 3.723277 (+0.50z)| norm 0.2611 (-0.20z)| lr 7.75e-04 | 4181.63 ms | 32.3% bf16 MFU | 125123 tok/s step 3111/19560 | loss 3.697532 (-0.22z)| norm 0.2696 (+0.16z)| lr 7.75e-04 | 4179.66 ms | 32.3% bf16 MFU | 125139 tok/s step 3112/19560 | loss 3.676176 (-0.82z)| norm 0.2687 (+0.10z)| lr 7.75e-04 | 4165.75 ms | 32.4% bf16 MFU | 125174 tok/s step 3113/19560 | loss 3.677109 (-0.80z)| norm 0.2734 (+0.30z)| lr 7.75e-04 | 4180.04 ms | 32.3% bf16 MFU | 125187 tok/s step 3114/19560 | loss 3.669769 (-0.99z)| norm 0.2614 (-0.22z)| lr 7.75e-04 | 4180.09 ms | 32.3% bf16 MFU | 125199 tok/s step 3115/19560 | loss 3.752155 (+1.36z)| norm 0.2449 (-0.94z)| lr 7.75e-04 | 4219.26 ms | 32.0% bf16 MFU | 125152 tok/s step 3116/19560 | loss 3.715584 (+0.31z)| norm 0.2227 (-1.87z)| lr 7.75e-04 | 4168.48 ms | 32.4% bf16 MFU | 125183 tok/s step 3117/19560 | loss 3.722167 (+0.49z)| norm 0.2184 (-2.01z)| lr 7.75e-04 | 4178.92 ms | 32.3% bf16 MFU | 125197 tok/s step 3118/19560 | loss 3.678591 (-0.74z)| norm 0.2114 (-2.27z)| lr 7.75e-04 | 4161.91 ms | 32.4% bf16 MFU | 125236 tok/s step 3119/19560 | loss 3.682144 (-0.63z)| norm 0.2348 (-1.28z)| lr 7.75e-04 | 4222.16 ms | 32.0% bf16 MFU | 125183 tok/s step 3120/19560 | loss 3.666905 (-1.05z)| norm 0.2561 (-0.39z)| lr 7.75e-04 | 4178.87 ms | 32.3% bf16 MFU | 125197 tok/s step 3121/19560 | loss 3.672113 (-0.90z)| norm 0.2455 (-0.81z)| lr 7.75e-04 | 4172.01 ms | 32.4% bf16 MFU | 125220 tok/s step 3122/19560 | loss 3.719898 (+0.47z)| norm 0.2535 (-0.49z)| lr 7.74e-04 | 4203.13 ms | 32.1% bf16 MFU | 125196 tok/s step 3123/19560 | loss 3.765339 (+1.74z)| norm 0.3095 (+1.82z)| lr 7.74e-04 | 4434.15 ms | 30.4% bf16 MFU | 124848 tok/s step 3124/19560 | loss 3.756974 (+1.48z)| norm 0.3351 (+2.79z)| lr 7.74e-04 | 4205.89 ms | 32.1% bf16 MFU | 124839 tok/s step 3125/19560 | loss 3.684969 (-0.54z)| norm 0.3465 (+3.10z)| lr 7.74e-04 | 4174.30 ms | 32.3% bf16 MFU | 124877 tok/s step 3126/19560 | loss 3.696297 (-0.22z)| norm 0.3549 (+3.25z)| lr 7.74e-04 | 4186.96 ms | 32.2% bf16 MFU | 124894 tok/s step 3127/19560 | loss 3.687926 (-0.46z)| norm 0.3331 (+2.37z)| lr 7.74e-04 | 4291.44 ms | 31.5% bf16 MFU | 124758 tok/s step 3128/19560 | loss 3.697472 (-0.18z)| norm 0.2944 (+0.95z)| lr 7.74e-04 | 4165.23 ms | 32.4% bf16 MFU | 124813 tok/s step 3129/19560 | loss 3.701478 (-0.05z)| norm 0.3023 (+1.22z)| lr 7.74e-04 | 4173.83 ms | 32.3% bf16 MFU | 124853 tok/s step 3130/19560 | loss 3.677056 (-0.75z)| norm 0.2415 (-0.96z)| lr 7.74e-04 | 4175.42 ms | 32.3% bf16 MFU | 124889 tok/s step 3131/19560 | loss 3.683163 (-0.57z)| norm 0.2464 (-0.77z)| lr 7.74e-04 | 4163.46 ms | 32.4% bf16 MFU | 124941 tok/s step 3132/19560 | loss 3.666093 (-1.07z)| norm 0.2495 (-0.65z)| lr 7.74e-04 | 4168.19 ms | 32.4% bf16 MFU | 124983 tok/s step 3133/19560 | loss 3.688245 (-0.43z)| norm 0.2467 (-0.75z)| lr 7.74e-04 | 4175.48 ms | 32.3% bf16 MFU | 125012 tok/s step 3134/19560 | loss 3.629543 (-2.06z)| norm 0.2472 (-0.73z)| lr 7.74e-04 | 4165.29 ms | 32.4% bf16 MFU | 125055 tok/s step 3135/19560 | loss 3.672423 (-0.87z)| norm 0.2622 (-0.18z)| lr 7.74e-04 | 4187.37 ms | 32.2% bf16 MFU | 125062 tok/s step 3136/19560 | loss 3.688782 (-0.41z)| norm 0.2493 (-0.65z)| lr 7.74e-04 | 4215.38 ms | 32.0% bf16 MFU | 125028 tok/s step 3137/19560 | loss 3.649043 (-1.54z)| norm 0.2418 (-0.93z)| lr 7.74e-04 | 4239.99 ms | 31.8% bf16 MFU | 124959 tok/s step 3138/19560 | loss 3.684978 (-0.51z)| norm 0.2597 (-0.28z)| lr 7.74e-04 | 4160.17 ms | 32.5% bf16 MFU | 125013 tok/s step 3139/19560 | loss 3.684629 (-0.52z)| norm 0.2807 (+0.47z)| lr 7.74e-04 | 11658.41 ms | 11.6% bf16 MFU | 121011 tok/s step 3140/19560 | loss 3.691312 (-0.31z)| norm 0.2725 (+0.18z)| lr 7.74e-04 | 4327.15 ms | 31.2% bf16 MFU | 121018 tok/s step 3141/19560 | loss 3.672904 (-0.87z)| norm 0.3069 (+1.41z)| lr 7.74e-04 | 4168.93 ms | 32.4% bf16 MFU | 121255 tok/s step 3142/19560 | loss 3.662159 (-1.17z)| norm 0.2859 (+0.68z)| lr 7.74e-04 | 4184.94 ms | 32.3% bf16 MFU | 121457 tok/s step 3143/19560 | loss 3.679276 (-0.65z)| norm 0.2928 (+0.96z)| lr 7.74e-04 | 4159.33 ms | 32.5% bf16 MFU | 121686 tok/s step 3144/19560 | loss 3.724195 (+0.67z)| norm 0.2601 (-0.25z)| lr 7.74e-04 | 4245.99 ms | 31.8% bf16 MFU | 121776 tok/s step 3145/19560 | loss 3.706014 (+0.13z)| norm 0.2614 (-0.19z)| lr 7.74e-04 | 4170.75 ms | 32.4% bf16 MFU | 121972 tok/s step 3146/19560 | loss 3.667641 (-0.99z)| norm 0.2777 (+0.43z)| lr 7.74e-04 | 4240.78 ms | 31.8% bf16 MFU | 122055 tok/s step 3147/19560 | loss 3.654666 (-1.36z)| norm 0.2535 (-0.48z)| lr 7.74e-04 | 4161.50 ms | 32.4% bf16 MFU | 122252 tok/s step 3148/19560 | loss 3.714832 (+0.43z)| norm 0.2772 (+0.41z)| lr 7.74e-04 | 4142.79 ms | 32.6% bf16 MFU | 122467 tok/s step 3149/19560 | loss 3.655913 (-1.32z)| norm 0.2613 (-0.19z)| lr 7.74e-04 | 4213.66 ms | 32.0% bf16 MFU | 122565 tok/s step 3150/19560 | loss 3.715921 (+0.46z)| norm 0.2458 (-0.76z)| lr 7.74e-04 | 4197.15 ms | 32.2% bf16 MFU | 122682 tok/s step 3151/19560 | loss 3.660773 (-1.16z)| norm 0.2440 (-0.82z)| lr 7.74e-04 | 4165.93 ms | 32.4% bf16 MFU | 122841 tok/s step 3152/19560 | loss 3.613883 (-2.47z)| norm 0.2455 (-0.76z)| lr 7.74e-04 | 4157.83 ms | 32.5% bf16 MFU | 123004 tok/s step 3153/19560 | loss 3.598343 (-2.81z)| norm 0.2889 (+0.86z)| lr 7.74e-04 | 4165.34 ms | 32.4% bf16 MFU | 123147 tok/s step 3154/19560 | loss 3.683199 (-0.42z)| norm 0.3023 (+1.34z)| lr 7.74e-04 | 4163.13 ms | 32.4% bf16 MFU | 123286 tok/s step 3155/19560 | loss 3.680264 (-0.51z)| norm 0.2917 (+0.93z)| lr 7.74e-04 | 4157.39 ms | 32.5% bf16 MFU | 123428 tok/s step 3156/19560 | loss 3.688059 (-0.28z)| norm 0.2881 (+0.79z)| lr 7.74e-04 | 4149.98 ms | 32.5% bf16 MFU | 123573 tok/s step 3157/19560 | loss 3.733746 (+1.09z)| norm 0.2712 (+0.15z)| lr 7.74e-04 | 4710.00 ms | 28.7% bf16 MFU | 122960 tok/s step 3158/19560 | loss 3.689933 (-0.21z)| norm 0.2517 (-0.59z)| lr 7.74e-04 | 4831.66 ms | 27.9% bf16 MFU | 122237 tok/s step 3159/19560 | loss 3.739862 (+1.27z)| norm 0.2775 (+0.37z)| lr 7.74e-04 | 4366.63 ms | 30.9% bf16 MFU | 122129 tok/s step 3160/19560 | loss 3.684335 (-0.38z)| norm 0.2467 (-0.79z)| lr 7.74e-04 | 4634.81 ms | 29.1% bf16 MFU | 121678 tok/s step 3161/19560 | loss 3.645384 (-1.53z)| norm 0.2701 (+0.08z)| lr 7.74e-04 | 4499.44 ms | 30.0% bf16 MFU | 121421 tok/s step 3162/19560 | loss 3.675565 (-0.63z)| norm 0.2614 (-0.25z)| lr 7.74e-04 | 4537.89 ms | 29.8% bf16 MFU | 121126 tok/s step 3163/19560 | loss 3.724558 (+0.84z)| norm 0.2453 (-0.85z)| lr 7.74e-04 | 4405.03 ms | 30.7% bf16 MFU | 121021 tok/s step 3164/19560 | loss 3.672071 (-0.74z)| norm 0.3073 (+1.51z)| lr 7.73e-04 | 4203.64 ms | 32.1% bf16 MFU | 121206 tok/s step 3165/19560 | loss 3.690920 (-0.16z)| norm 0.2871 (+0.75z)| lr 7.73e-04 | 4216.74 ms | 32.0% bf16 MFU | 121363 tok/s step 3166/19560 | loss 3.729466 (+0.99z)| norm 0.2494 (-0.68z)| lr 7.73e-04 | 4307.87 ms | 31.3% bf16 MFU | 121380 tok/s step 3167/19560 | loss 3.678476 (-0.53z)| norm 0.2329 (-1.29z)| lr 7.73e-04 | 4726.82 ms | 28.6% bf16 MFU | 120857 tok/s step 3168/19560 | loss 3.720040 (+0.71z)| norm 0.2377 (-1.09z)| lr 7.73e-04 | 4196.86 ms | 32.2% bf16 MFU | 121060 tok/s step 3169/19560 | loss 3.637847 (-1.72z)| norm 0.2433 (-0.87z)| lr 7.73e-04 | 4164.50 ms | 32.4% bf16 MFU | 121302 tok/s step 3170/19560 | loss 3.625290 (-2.04z)| norm 0.2518 (-0.54z)| lr 7.73e-04 | 4261.71 ms | 31.7% bf16 MFU | 121388 tok/s step 3171/19560 | loss 3.613996 (-2.31z)| norm 0.2399 (-0.98z)| lr 7.73e-04 | 4233.75 ms | 31.9% bf16 MFU | 121510 tok/s step 3172/19560 | loss 3.637100 (-1.62z)| norm 0.2223 (-1.62z)| lr 7.73e-04 | 4225.64 ms | 32.0% bf16 MFU | 121638 tok/s step 3173/19560 | loss 3.666764 (-0.77z)| norm 0.2200 (-1.68z)| lr 7.73e-04 | 4215.55 ms | 32.0% bf16 MFU | 121775 tok/s step 3174/19560 | loss 3.721109 (+0.75z)| norm 0.2433 (-0.83z)| lr 7.73e-04 | 4170.90 ms | 32.4% bf16 MFU | 121971 tok/s step 3175/19560 | loss 3.704815 (+0.30z)| norm 0.2745 (+0.33z)| lr 7.73e-04 | 4184.49 ms | 32.3% bf16 MFU | 122137 tok/s step 3176/19560 | loss 3.636430 (-1.61z)| norm 0.2798 (+0.54z)| lr 7.73e-04 | 4176.99 ms | 32.3% bf16 MFU | 122306 tok/s step 3177/19560 | loss 3.646390 (-1.31z)| norm 0.2753 (+0.38z)| lr 7.73e-04 | 5451.95 ms | 24.8% bf16 MFU | 120999 tok/s step 3178/19560 | loss 3.713811 (+0.57z)| norm 0.2860 (+0.79z)| lr 7.73e-04 | 4160.92 ms | 32.4% bf16 MFU | 121249 tok/s step 3179/19560 | loss 3.640664 (-1.49z)| norm 0.2748 (+0.36z)| lr 7.73e-04 | 4157.49 ms | 32.5% bf16 MFU | 121492 tok/s step 3180/19560 | loss 3.633820 (-1.65z)| norm 0.2643 (-0.04z)| lr 7.73e-04 | 4172.40 ms | 32.4% bf16 MFU | 121701 tok/s step 3181/19560 | loss 3.637413 (-1.53z)| norm 0.2943 (+1.08z)| lr 7.73e-04 | 4166.38 ms | 32.4% bf16 MFU | 121907 tok/s step 3182/19560 | loss 3.762039 (+1.91z)| norm 0.2713 (+0.21z)| lr 7.73e-04 | 4318.42 ms | 31.3% bf16 MFU | 121882 tok/s step 3183/19560 | loss 3.715881 (+0.65z)| norm 0.2645 (-0.06z)| lr 7.73e-04 | 4166.17 ms | 32.4% bf16 MFU | 122080 tok/s step 3184/19560 | loss 3.661961 (-0.84z)| norm 0.2671 (+0.04z)| lr 7.73e-04 | 4193.52 ms | 32.2% bf16 MFU | 122228 tok/s step 3185/19560 | loss 3.740738 (+1.36z)| norm 0.2854 (+0.72z)| lr 7.73e-04 | 4200.67 ms | 32.1% bf16 MFU | 122357 tok/s step 3186/19560 | loss 3.689136 (-0.08z)| norm 0.2772 (+0.40z)| lr 7.73e-04 | 4163.18 ms | 32.4% bf16 MFU | 122536 tok/s step 3187/19560 | loss 3.650012 (-1.16z)| norm 0.2554 (-0.43z)| lr 7.73e-04 | 4226.99 ms | 31.9% bf16 MFU | 122611 tok/s step 3188/19560 | loss 3.779222 (+2.38z)| norm 0.3077 (+1.54z)| lr 7.73e-04 | 4366.19 ms | 30.9% bf16 MFU | 122484 tok/s step 3189/19560 | loss 3.713717 (+0.59z)| norm 0.3070 (+1.49z)| lr 7.73e-04 | 4203.39 ms | 32.1% bf16 MFU | 122596 tok/s step 3190/19560 | loss 3.670484 (-0.59z)| norm 0.2748 (+0.28z)| lr 7.73e-04 | 4210.90 ms | 32.1% bf16 MFU | 122692 tok/s step 3191/19560 | loss 3.630090 (-1.66z)| norm 0.2600 (-0.27z)| lr 7.73e-04 | 4167.67 ms | 32.4% bf16 MFU | 122847 tok/s step 3192/19560 | loss 3.614500 (-2.04z)| norm 0.2823 (+0.59z)| lr 7.73e-04 | 4174.27 ms | 32.3% bf16 MFU | 122985 tok/s step 3193/19560 | loss 3.647919 (-1.14z)| norm 0.2603 (-0.24z)| lr 7.73e-04 | 4172.49 ms | 32.4% bf16 MFU | 123118 tok/s step 3194/19560 | loss 3.653486 (-0.98z)| norm 0.2900 (+0.89z)| lr 7.73e-04 | 4180.08 ms | 32.3% bf16 MFU | 123234 tok/s step 3195/19560 | loss 3.635268 (-1.45z)| norm 0.2740 (+0.28z)| lr 7.73e-04 | 4152.05 ms | 32.5% bf16 MFU | 123386 tok/s step 3196/19560 | loss 3.687524 (-0.04z)| norm 0.2793 (+0.50z)| lr 7.73e-04 | 4176.00 ms | 32.3% bf16 MFU | 123494 tok/s step 3197/19560 | loss 3.695137 (+0.17z)| norm 0.3171 (+1.90z)| lr 7.73e-04 | 4175.47 ms | 32.3% bf16 MFU | 123597 tok/s step 3198/19560 | loss 3.681482 (-0.19z)| norm 0.2973 (+1.14z)| lr 7.73e-04 | 4158.74 ms | 32.5% bf16 MFU | 123721 tok/s step 3199/19560 | loss 3.667142 (-0.58z)| norm 0.3102 (+1.59z)| lr 7.73e-04 | 4218.88 ms | 32.0% bf16 MFU | 123748 tok/s step 3200/19560 | loss 3.751273 (+1.75z)| norm 0.2781 (+0.38z)| lr 7.73e-04 | 4183.17 ms | 32.3% bf16 MFU | 123828 tok/s step 3201/19560 | loss 3.634492 (-1.46z)| norm 0.2698 (+0.07z)| lr 7.73e-04 | 4176.87 ms | 32.3% bf16 MFU | 123912 tok/s step 3202/19560 | loss 3.717226 (+0.81z)| norm 0.2696 (+0.06z)| lr 7.73e-04 | 4294.68 ms | 31.4% bf16 MFU | 123821 tok/s step 3203/19560 | loss 3.665847 (-0.59z)| norm 0.2575 (-0.40z)| lr 7.73e-04 | 4202.24 ms | 32.1% bf16 MFU | 123868 tok/s step 3204/19560 | loss 3.622310 (-1.76z)| norm 0.2805 (+0.46z)| lr 7.73e-04 | 4189.00 ms | 32.2% bf16 MFU | 123932 tok/s step 3205/19560 | loss 3.607047 (-2.13z)| norm 0.2431 (-0.93z)| lr 7.72e-04 | 4297.22 ms | 31.4% bf16 MFU | 123836 tok/s step 3206/19560 | loss 3.622537 (-1.68z)| norm 0.2413 (-0.98z)| lr 7.72e-04 | 4181.23 ms | 32.3% bf16 MFU | 123914 tok/s step 3207/19560 | loss 3.627089 (-1.53z)| norm 0.2503 (-0.64z)| lr 7.72e-04 | 4179.66 ms | 32.3% bf16 MFU | 123990 tok/s step 3208/19560 | loss 3.680978 (-0.10z)| norm 0.2407 (-0.98z)| lr 7.72e-04 | 4194.73 ms | 32.2% bf16 MFU | 124040 tok/s step 3209/19560 | loss 3.629384 (-1.44z)| norm 0.2424 (-0.91z)| lr 7.72e-04 | 4303.97 ms | 31.4% bf16 MFU | 123929 tok/s step 3210/19560 | loss 3.634044 (-1.32z)| norm 0.2220 (-1.65z)| lr 7.72e-04 | 4186.35 ms | 32.3% bf16 MFU | 123994 tok/s step 3211/19560 | loss 3.681071 (-0.03z)| norm 0.2480 (-0.67z)| lr 7.72e-04 | 4178.23 ms | 32.3% bf16 MFU | 124068 tok/s step 3212/19560 | loss 3.709906 (+0.76z)| norm 0.2526 (-0.48z)| lr 7.72e-04 | 4201.28 ms | 32.1% bf16 MFU | 124105 tok/s step 3213/19560 | loss 3.615002 (-1.80z)| norm 0.2461 (-0.72z)| lr 7.72e-04 | 4175.03 ms | 32.3% bf16 MFU | 124178 tok/s step 3214/19560 | loss 3.668624 (-0.35z)| norm 0.2766 (+0.45z)| lr 7.72e-04 | 4185.79 ms | 32.3% bf16 MFU | 124232 tok/s step 3215/19560 | loss 3.595815 (-2.26z)| norm 0.3008 (+1.36z)| lr 7.72e-04 | 4170.47 ms | 32.4% bf16 MFU | 124306 tok/s step 3216/19560 | loss 3.661123 (-0.51z)| norm 0.2754 (+0.39z)| lr 7.72e-04 | 4190.11 ms | 32.2% bf16 MFU | 124347 tok/s step 3217/19560 | loss 3.660786 (-0.51z)| norm 0.2629 (-0.08z)| lr 7.72e-04 | 4203.56 ms | 32.1% bf16 MFU | 124366 tok/s step 3218/19560 | loss 3.659292 (-0.54z)| norm 0.2629 (-0.08z)| lr 7.72e-04 | 4188.88 ms | 32.2% bf16 MFU | 124406 tok/s step 3219/19560 | loss 3.612798 (-1.77z)| norm 0.2296 (-1.33z)| lr 7.72e-04 | 4181.28 ms | 32.3% bf16 MFU | 124455 tok/s step 3220/19560 | loss 3.662841 (-0.43z)| norm 0.2479 (-0.63z)| lr 7.72e-04 | 4182.80 ms | 32.3% bf16 MFU | 124499 tok/s step 3221/19560 | loss 3.671303 (-0.20z)| norm 0.2348 (-1.12z)| lr 7.72e-04 | 4231.35 ms | 31.9% bf16 MFU | 124470 tok/s step 3222/19560 | loss 3.676467 (-0.06z)| norm 0.2283 (-1.35z)| lr 7.72e-04 | 4175.98 ms | 32.3% bf16 MFU | 124524 tok/s step 3223/19560 | loss 3.678233 (-0.02z)| norm 0.2427 (-0.82z)| lr 7.72e-04 | 4178.16 ms | 32.3% bf16 MFU | 124572 tok/s step 3224/19560 | loss 3.688826 (+0.27z)| norm 0.2465 (-0.69z)| lr 7.72e-04 | 4172.63 ms | 32.4% bf16 MFU | 124625 tok/s step 3225/19560 | loss 3.680519 (+0.06z)| norm 0.2252 (-1.49z)| lr 7.72e-04 | 4178.67 ms | 32.3% bf16 MFU | 124668 tok/s step 3226/19560 | loss 3.713058 (+0.92z)| norm 0.2765 (+0.44z)| lr 7.72e-04 | 4167.81 ms | 32.4% bf16 MFU | 124724 tok/s step 3227/19560 | loss 3.683990 (+0.14z)| norm 0.3092 (+1.65z)| lr 7.72e-04 | 4254.23 ms | 31.7% bf16 MFU | 124650 tok/s step 3228/19560 | loss 3.674472 (-0.10z)| norm 0.3031 (+1.40z)| lr 7.72e-04 | 4323.62 ms | 31.2% bf16 MFU | 124480 tok/s step 3229/19560 | loss 3.658345 (-0.54z)| norm 0.2808 (+0.56z)| lr 7.72e-04 | 4203.51 ms | 32.1% bf16 MFU | 124493 tok/s step 3230/19560 | loss 3.728003 (+1.35z)| norm 0.2970 (+1.14z)| lr 7.72e-04 | 4237.43 ms | 31.9% bf16 MFU | 124454 tok/s step 3231/19560 | loss 3.655435 (-0.62z)| norm 0.3142 (+1.75z)| lr 7.72e-04 | 4163.77 ms | 32.4% bf16 MFU | 124527 tok/s step 3232/19560 | loss 3.782867 (+2.74z)| norm 0.3112 (+1.61z)| lr 7.72e-04 | 4185.44 ms | 32.3% bf16 MFU | 124564 tok/s step 3233/19560 | loss 3.680802 (+0.05z)| norm 0.3220 (+1.95z)| lr 7.72e-04 | 4203.41 ms | 32.1% bf16 MFU | 124573 tok/s step 3234/19560 | loss 3.686354 (+0.20z)| norm 0.3207 (+1.87z)| lr 7.72e-04 | 4290.21 ms | 31.5% bf16 MFU | 124454 tok/s step 3235/19560 | loss 3.642304 (-0.95z)| norm 0.2644 (-0.14z)| lr 7.72e-04 | 4205.44 ms | 32.1% bf16 MFU | 124465 tok/s step 3236/19560 | loss 3.683144 (+0.13z)| norm 0.2937 (+0.89z)| lr 7.72e-04 | 4177.89 ms | 32.3% bf16 MFU | 124516 tok/s step 3237/19560 | loss 3.646005 (-0.84z)| norm 0.2605 (-0.29z)| lr 7.72e-04 | 4194.71 ms | 32.2% bf16 MFU | 124540 tok/s step 3238/19560 | loss 3.633871 (-1.14z)| norm 0.2749 (+0.22z)| lr 7.72e-04 | 4232.26 ms | 31.9% bf16 MFU | 124507 tok/s step 3239/19560 | loss 3.678625 (+0.04z)| norm 0.2640 (-0.17z)| lr 7.72e-04 | 4181.94 ms | 32.3% bf16 MFU | 124550 tok/s step 3240/19560 | loss 3.671178 (-0.16z)| norm 0.2546 (-0.50z)| lr 7.72e-04 | 4171.12 ms | 32.4% bf16 MFU | 124607 tok/s step 3241/19560 | loss 3.662285 (-0.39z)| norm 0.2393 (-1.03z)| lr 7.72e-04 | 4181.85 ms | 32.3% bf16 MFU | 124645 tok/s step 3242/19560 | loss 3.687042 (+0.26z)| norm 0.2285 (-1.40z)| lr 7.72e-04 | 4203.98 ms | 32.1% bf16 MFU | 124649 tok/s step 3243/19560 | loss 3.624242 (-1.38z)| norm 0.2360 (-1.13z)| lr 7.72e-04 | 4585.32 ms | 29.4% bf16 MFU | 124133 tok/s step 3244/19560 | loss 3.654280 (-0.57z)| norm 0.2413 (-0.95z)| lr 7.72e-04 | 4177.24 ms | 32.3% bf16 MFU | 124202 tok/s step 3245/19560 | loss 3.697935 (+0.60z)| norm 0.2505 (-0.64z)| lr 7.71e-04 | 4172.99 ms | 32.4% bf16 MFU | 124274 tok/s step 3246/19560 | loss 3.701507 (+0.69z)| norm 0.2635 (-0.20z)| lr 7.71e-04 | 4237.86 ms | 31.9% bf16 MFU | 124246 tok/s step 3247/19560 | loss 3.682887 (+0.19z)| norm 0.2611 (-0.29z)| lr 7.71e-04 | 4185.51 ms | 32.3% bf16 MFU | 124297 tok/s step 3248/19560 | loss 3.672137 (-0.09z)| norm 0.2613 (-0.29z)| lr 7.71e-04 | 4244.46 ms | 31.8% bf16 MFU | 124258 tok/s step 3249/19560 | loss 3.626126 (-1.30z)| norm 0.2963 (+0.98z)| lr 7.71e-04 | 4186.18 ms | 32.3% bf16 MFU | 124307 tok/s step 3250/19560 | loss 3.695515 (+0.54z)| norm 0.2781 (+0.31z)| lr 7.71e-04 | 4329.85 ms | 31.2% bf16 MFU | 124146 tok/s val loss 3.658435 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2672/10042 = 0.266082 step 3251/19560 | loss 3.703191 (+0.78z)| norm 0.2434 (-0.95z)| lr 7.71e-04 | 4569.00 ms | 29.6% bf16 MFU | 123677 tok/s step 3252/19560 | loss 3.706630 (+0.90z)| norm 0.2558 (-0.48z)| lr 7.71e-04 | 4348.59 ms | 31.0% bf16 MFU | 123521 tok/s step 3253/19560 | loss 3.685754 (+0.32z)| norm 0.2542 (-0.53z)| lr 7.71e-04 | 4299.20 ms | 31.4% bf16 MFU | 123442 tok/s step 3254/19560 | loss 3.643356 (-0.84z)| norm 0.2874 (+0.82z)| lr 7.71e-04 | 4246.88 ms | 31.8% bf16 MFU | 123443 tok/s step 3255/19560 | loss 3.631902 (-1.14z)| norm 0.2614 (-0.22z)| lr 7.71e-04 | 4179.32 ms | 32.3% bf16 MFU | 123543 tok/s step 3256/19560 | loss 3.639205 (-0.93z)| norm 0.2462 (-0.85z)| lr 7.71e-04 | 4205.48 ms | 32.1% bf16 MFU | 123599 tok/s step 3257/19560 | loss 3.696873 (+0.66z)| norm 0.2591 (-0.29z)| lr 7.71e-04 | 4357.59 ms | 31.0% bf16 MFU | 123435 tok/s step 3258/19560 | loss 3.736522 (+1.71z)| norm 0.2927 (+1.11z)| lr 7.71e-04 | 4358.97 ms | 31.0% bf16 MFU | 123277 tok/s step 3259/19560 | loss 3.622335 (-1.36z)| norm 0.3161 (+2.05z)| lr 7.71e-04 | 4323.73 ms | 31.2% bf16 MFU | 123176 tok/s step 3260/19560 | loss 3.622172 (-1.35z)| norm 0.2790 (+0.49z)| lr 7.71e-04 | 4311.18 ms | 31.3% bf16 MFU | 123098 tok/s step 3261/19560 | loss 3.708878 (+0.96z)| norm 0.2495 (-0.74z)| lr 7.71e-04 | 4231.90 ms | 31.9% bf16 MFU | 123138 tok/s step 3262/19560 | loss 3.642852 (-0.80z)| norm 0.2661 (-0.06z)| lr 7.71e-04 | 4495.68 ms | 30.0% bf16 MFU | 122812 tok/s step 3263/19560 | loss 3.651212 (-0.57z)| norm 0.2877 (+0.84z)| lr 7.71e-04 | 4191.41 ms | 32.2% bf16 MFU | 122926 tok/s step 3264/19560 | loss 3.687068 (+0.39z)| norm 0.2705 (+0.11z)| lr 7.71e-04 | 4203.28 ms | 32.1% bf16 MFU | 123016 tok/s step 3265/19560 | loss 3.604511 (-1.79z)| norm 0.2385 (-1.22z)| lr 7.71e-04 | 4272.54 ms | 31.6% bf16 MFU | 123001 tok/s step 3266/19560 | loss 3.713008 (+1.07z)| norm 0.2507 (-0.71z)| lr 7.71e-04 | 4180.15 ms | 32.3% bf16 MFU | 123122 tok/s step 3267/19560 | loss 3.665056 (-0.19z)| norm 0.2655 (-0.08z)| lr 7.71e-04 | 4332.19 ms | 31.2% bf16 MFU | 123017 tok/s step 3268/19560 | loss 3.671615 (-0.02z)| norm 0.3000 (+1.34z)| lr 7.71e-04 | 4297.69 ms | 31.4% bf16 MFU | 122966 tok/s step 3269/19560 | loss 3.661231 (-0.29z)| norm 0.2728 (+0.22z)| lr 7.71e-04 | 4354.21 ms | 31.0% bf16 MFU | 122838 tok/s step 3270/19560 | loss 3.668212 (-0.10z)| norm 0.2661 (-0.05z)| lr 7.71e-04 | 4319.43 ms | 31.3% bf16 MFU | 122765 tok/s step 3271/19560 | loss 3.702682 (+0.80z)| norm 0.2898 (+0.95z)| lr 7.71e-04 | 4177.63 ms | 32.3% bf16 MFU | 122902 tok/s step 3272/19560 | loss 3.685966 (+0.37z)| norm 0.2863 (+0.79z)| lr 7.71e-04 | 4174.93 ms | 32.3% bf16 MFU | 123036 tok/s step 3273/19560 | loss 3.654949 (-0.44z)| norm 0.2754 (+0.33z)| lr 7.71e-04 | 4257.72 ms | 31.7% bf16 MFU | 123041 tok/s step 3274/19560 | loss 3.697380 (+0.68z)| norm 0.2664 (-0.05z)| lr 7.71e-04 | 4229.22 ms | 31.9% bf16 MFU | 123087 tok/s step 3275/19560 | loss 3.777417 (+2.69z)| norm 0.2648 (-0.12z)| lr 7.71e-04 | 4165.29 ms | 32.4% bf16 MFU | 123226 tok/s step 3276/19560 | loss 3.665446 (-0.18z)| norm 0.2628 (-0.20z)| lr 7.71e-04 | 4177.01 ms | 32.3% bf16 MFU | 123341 tok/s step 3277/19560 | loss 3.763300 (+2.28z)| norm 0.2616 (-0.25z)| lr 7.71e-04 | 4171.11 ms | 32.4% bf16 MFU | 123458 tok/s step 3278/19560 | loss 3.736832 (+1.60z)| norm 0.2826 (+0.62z)| lr 7.71e-04 | 4288.81 ms | 31.5% bf16 MFU | 123398 tok/s step 3279/19560 | loss 3.690346 (+0.42z)| norm 0.2749 (+0.29z)| lr 7.71e-04 | 4237.03 ms | 31.9% bf16 MFU | 123415 tok/s step 3280/19560 | loss 3.698713 (+0.62z)| norm 0.2618 (-0.27z)| lr 7.71e-04 | 4164.06 ms | 32.4% bf16 MFU | 123540 tok/s step 3281/19560 | loss 3.653513 (-0.55z)| norm 0.2725 (+0.19z)| lr 7.71e-04 | 4233.85 ms | 31.9% bf16 MFU | 123554 tok/s step 3282/19560 | loss 3.680389 (+0.14z)| norm 0.2504 (-0.74z)| lr 7.71e-04 | 4290.88 ms | 31.5% bf16 MFU | 123486 tok/s step 3283/19560 | loss 3.638083 (-0.93z)| norm 0.2426 (-1.06z)| lr 7.71e-04 | 4180.26 ms | 32.3% bf16 MFU | 123583 tok/s step 3284/19560 | loss 3.651564 (-0.58z)| norm 0.2262 (-1.73z)| lr 7.71e-04 | 4304.65 ms | 31.4% bf16 MFU | 123493 tok/s step 3285/19560 | loss 3.689173 (+0.40z)| norm 0.2429 (-1.00z)| lr 7.70e-04 | 4332.91 ms | 31.2% bf16 MFU | 123369 tok/s step 3286/19560 | loss 3.754036 (+2.03z)| norm 0.2755 (+0.37z)| lr 7.70e-04 | 4252.48 ms | 31.8% bf16 MFU | 123365 tok/s step 3287/19560 | loss 3.689782 (+0.41z)| norm 0.2851 (+0.77z)| lr 7.70e-04 | 4167.99 ms | 32.4% bf16 MFU | 123486 tok/s step 3288/19560 | loss 3.650817 (-0.59z)| norm 0.2641 (-0.12z)| lr 7.70e-04 | 4230.68 ms | 31.9% bf16 MFU | 123508 tok/s step 3289/19560 | loss 3.751919 (+1.96z)| norm 0.2900 (+0.97z)| lr 7.70e-04 | 4330.43 ms | 31.2% bf16 MFU | 123386 tok/s step 3290/19560 | loss 3.794088 (+2.91z)| norm 0.2927 (+1.07z)| lr 7.70e-04 | 4180.78 ms | 32.3% bf16 MFU | 123487 tok/s step 3291/19560 | loss 3.631268 (-1.06z)| norm 0.2675 (-0.00z)| lr 7.70e-04 | 4309.51 ms | 31.3% bf16 MFU | 123396 tok/s step 3292/19560 | loss 3.699306 (+0.60z)| norm 0.2667 (-0.02z)| lr 7.70e-04 | 4183.70 ms | 32.3% bf16 MFU | 123492 tok/s step 3293/19560 | loss 3.670126 (-0.11z)| norm 0.2537 (-0.56z)| lr 7.70e-04 | 4171.12 ms | 32.4% bf16 MFU | 123602 tok/s step 3294/19560 | loss 3.610949 (-1.54z)| norm 0.2627 (-0.19z)| lr 7.70e-04 | 4169.51 ms | 32.4% bf16 MFU | 123709 tok/s step 3295/19560 | loss 3.653635 (-0.49z)| norm 0.2663 (-0.04z)| lr 7.70e-04 | 4192.14 ms | 32.2% bf16 MFU | 123777 tok/s step 3296/19560 | loss 3.642418 (-0.75z)| norm 0.2606 (-0.30z)| lr 7.70e-04 | 4182.58 ms | 32.3% bf16 MFU | 123855 tok/s step 3297/19560 | loss 3.634351 (-0.95z)| norm 0.2628 (-0.21z)| lr 7.70e-04 | 4174.23 ms | 32.3% bf16 MFU | 123943 tok/s step 3298/19560 | loss 3.652833 (-0.50z)| norm 0.2497 (-0.78z)| lr 7.70e-04 | 4186.69 ms | 32.2% bf16 MFU | 124007 tok/s step 3299/19560 | loss 3.659025 (-0.36z)| norm 0.2689 (+0.05z)| lr 7.70e-04 | 4173.30 ms | 32.4% bf16 MFU | 124088 tok/s step 3300/19560 | loss 3.671751 (-0.05z)| norm 0.2755 (+0.33z)| lr 7.70e-04 | 4185.51 ms | 32.3% bf16 MFU | 124147 tok/s step 3301/19560 | loss 3.660367 (-0.33z)| norm 0.2655 (-0.14z)| lr 7.70e-04 | 4184.34 ms | 32.3% bf16 MFU | 124204 tok/s step 3302/19560 | loss 3.687665 (+0.35z)| norm 0.2764 (+0.35z)| lr 7.70e-04 | 4281.01 ms | 31.5% bf16 MFU | 124117 tok/s step 3303/19560 | loss 3.689178 (+0.40z)| norm 0.2423 (-1.20z)| lr 7.70e-04 | 4281.65 ms | 31.5% bf16 MFU | 124034 tok/s step 3304/19560 | loss 3.654496 (-0.48z)| norm 0.2644 (-0.19z)| lr 7.70e-04 | 4175.98 ms | 32.3% bf16 MFU | 124110 tok/s step 3305/19560 | loss 3.663553 (-0.25z)| norm 0.2826 (+0.64z)| lr 7.70e-04 | 4171.15 ms | 32.4% bf16 MFU | 124189 tok/s step 3306/19560 | loss 3.622304 (-1.27z)| norm 0.3026 (+1.53z)| lr 7.70e-04 | 4193.01 ms | 32.2% bf16 MFU | 124231 tok/s step 3307/19560 | loss 3.677511 (+0.11z)| norm 0.3376 (+2.98z)| lr 7.70e-04 | 4310.23 ms | 31.3% bf16 MFU | 124102 tok/s step 3308/19560 | loss 3.637950 (-0.89z)| norm 0.2866 (+0.75z)| lr 7.70e-04 | 4347.17 ms | 31.1% bf16 MFU | 123927 tok/s step 3309/19560 | loss 3.630730 (-1.07z)| norm 0.2424 (-1.15z)| lr 7.70e-04 | 4263.86 ms | 31.7% bf16 MFU | 123879 tok/s step 3310/19560 | loss 3.692986 (+0.52z)| norm 0.2579 (-0.47z)| lr 7.70e-04 | 4166.60 ms | 32.4% bf16 MFU | 123976 tok/s step 3311/19560 | loss 3.737814 (+1.65z)| norm 0.2826 (+0.59z)| lr 7.70e-04 | 4166.57 ms | 32.4% bf16 MFU | 124069 tok/s step 3312/19560 | loss 3.639363 (-0.84z)| norm 0.2743 (+0.23z)| lr 7.70e-04 | 4200.79 ms | 32.1% bf16 MFU | 124106 tok/s step 3313/19560 | loss 3.667624 (-0.11z)| norm 0.2596 (-0.40z)| lr 7.70e-04 | 4191.03 ms | 32.2% bf16 MFU | 124155 tok/s step 3314/19560 | loss 3.701328 (+0.75z)| norm 0.2706 (+0.08z)| lr 7.70e-04 | 4185.79 ms | 32.3% bf16 MFU | 124210 tok/s step 3315/19560 | loss 3.678051 (+0.14z)| norm 0.2634 (-0.24z)| lr 7.70e-04 | 4183.39 ms | 32.3% bf16 MFU | 124266 tok/s step 3316/19560 | loss 3.677156 (+0.15z)| norm 0.2268 (-1.80z)| lr 7.70e-04 | 4184.20 ms | 32.3% bf16 MFU | 124318 tok/s step 3317/19560 | loss 3.645164 (-0.69z)| norm 0.2377 (-1.31z)| lr 7.70e-04 | 4172.76 ms | 32.4% bf16 MFU | 124384 tok/s step 3318/19560 | loss 3.760306 (+2.30z)| norm 0.2479 (-0.85z)| lr 7.70e-04 | 4310.69 ms | 31.3% bf16 MFU | 124246 tok/s step 3319/19560 | loss 3.603694 (-1.75z)| norm 0.2326 (-1.49z)| lr 7.70e-04 | 4177.12 ms | 32.3% bf16 MFU | 124310 tok/s step 3320/19560 | loss 3.601969 (-1.78z)| norm 0.2512 (-0.68z)| lr 7.70e-04 | 4193.88 ms | 32.2% bf16 MFU | 124345 tok/s step 3321/19560 | loss 3.596856 (-1.88z)| norm 0.2547 (-0.53z)| lr 7.70e-04 | 4168.81 ms | 32.4% bf16 MFU | 124416 tok/s step 3322/19560 | loss 3.621442 (-1.25z)| norm 0.2707 (+0.17z)| lr 7.70e-04 | 4183.74 ms | 32.3% bf16 MFU | 124461 tok/s step 3323/19560 | loss 3.661412 (-0.24z)| norm 0.2699 (+0.13z)| lr 7.69e-04 | 4192.36 ms | 32.2% bf16 MFU | 124491 tok/s step 3324/19560 | loss 3.661490 (-0.24z)| norm 0.2610 (-0.25z)| lr 7.69e-04 | 4198.84 ms | 32.2% bf16 MFU | 124509 tok/s step 3325/19560 | loss 3.675102 (+0.11z)| norm 0.2549 (-0.50z)| lr 7.69e-04 | 4175.06 ms | 32.3% bf16 MFU | 124563 tok/s step 3326/19560 | loss 3.639783 (-0.77z)| norm 0.2660 (+0.00z)| lr 7.69e-04 | 4176.57 ms | 32.3% bf16 MFU | 124611 tok/s step 3327/19560 | loss 3.706474 (+0.90z)| norm 0.2516 (-0.63z)| lr 7.69e-04 | 4174.84 ms | 32.3% bf16 MFU | 124660 tok/s step 3328/19560 | loss 3.644802 (-0.64z)| norm 0.2425 (-1.02z)| lr 7.69e-04 | 4350.62 ms | 31.0% bf16 MFU | 124452 tok/s step 3329/19560 | loss 3.675646 (+0.14z)| norm 0.2493 (-0.70z)| lr 7.69e-04 | 4212.52 ms | 32.1% bf16 MFU | 124453 tok/s step 3330/19560 | loss 3.670135 (+0.01z)| norm 0.2410 (-1.06z)| lr 7.69e-04 | 4177.46 ms | 32.3% bf16 MFU | 124505 tok/s step 3331/19560 | loss 3.690958 (+0.54z)| norm 0.2205 (-1.94z)| lr 7.69e-04 | 4179.15 ms | 32.3% bf16 MFU | 124553 tok/s step 3332/19560 | loss 3.686890 (+0.43z)| norm 0.2432 (-0.92z)| lr 7.69e-04 | 4197.40 ms | 32.2% bf16 MFU | 124570 tok/s step 3333/19560 | loss 3.666602 (-0.11z)| norm 0.2445 (-0.87z)| lr 7.69e-04 | 4224.22 ms | 32.0% bf16 MFU | 124548 tok/s step 3334/19560 | loss 3.608152 (-1.64z)| norm 0.2736 (+0.40z)| lr 7.69e-04 | 4227.63 ms | 31.9% bf16 MFU | 124521 tok/s step 3335/19560 | loss 3.678971 (+0.20z)| norm 0.2882 (+1.03z)| lr 7.69e-04 | 4177.33 ms | 32.3% bf16 MFU | 124570 tok/s step 3336/19560 | loss 3.674316 (+0.08z)| norm 0.2831 (+0.79z)| lr 7.69e-04 | 4473.80 ms | 30.2% bf16 MFU | 124201 tok/s step 3337/19560 | loss 3.705850 (+0.90z)| norm 0.2705 (+0.23z)| lr 7.69e-04 | 4162.02 ms | 32.4% bf16 MFU | 124290 tok/s step 3338/19560 | loss 3.715448 (+1.13z)| norm 0.2826 (+0.75z)| lr 7.69e-04 | 4265.56 ms | 31.7% bf16 MFU | 124221 tok/s step 3339/19560 | loss 3.644812 (-0.72z)| norm 0.2946 (+1.27z)| lr 7.69e-04 | 4293.89 ms | 31.4% bf16 MFU | 124115 tok/s step 3340/19560 | loss 3.696769 (+0.65z)| norm 0.3041 (+1.66z)| lr 7.69e-04 | 4185.21 ms | 32.3% bf16 MFU | 124173 tok/s step 3341/19560 | loss 3.690384 (+0.47z)| norm 0.3269 (+2.58z)| lr 7.69e-04 | 4213.67 ms | 32.0% bf16 MFU | 124185 tok/s step 3342/19560 | loss 3.631952 (-1.06z)| norm 0.3205 (+2.24z)| lr 7.69e-04 | 4166.23 ms | 32.4% bf16 MFU | 124268 tok/s step 3343/19560 | loss 3.665874 (-0.19z)| norm 0.3000 (+1.38z)| lr 7.69e-04 | 4170.64 ms | 32.4% bf16 MFU | 124340 tok/s step 3344/19560 | loss 3.657610 (-0.41z)| norm 0.2424 (-1.05z)| lr 7.69e-04 | 4167.84 ms | 32.4% bf16 MFU | 124413 tok/s step 3345/19560 | loss 3.661406 (-0.31z)| norm 0.2619 (-0.23z)| lr 7.69e-04 | 4181.54 ms | 32.3% bf16 MFU | 124461 tok/s step 3346/19560 | loss 3.694472 (+0.57z)| norm 0.2673 (-0.00z)| lr 7.69e-04 | 4224.89 ms | 32.0% bf16 MFU | 124443 tok/s step 3347/19560 | loss 3.611342 (-1.65z)| norm 0.2396 (-1.18z)| lr 7.69e-04 | 4165.13 ms | 32.4% bf16 MFU | 124515 tok/s step 3348/19560 | loss 3.764315 (+2.36z)| norm 0.2538 (-0.58z)| lr 7.69e-04 | 4707.47 ms | 28.7% bf16 MFU | 123858 tok/s step 3349/19560 | loss 3.652803 (-0.55z)| norm 0.2956 (+1.18z)| lr 7.69e-04 | 4555.10 ms | 29.6% bf16 MFU | 123420 tok/s step 3350/19560 | loss 3.667320 (-0.17z)| norm 0.2564 (-0.50z)| lr 7.69e-04 | 4294.13 ms | 31.4% bf16 MFU | 123353 tok/s step 3351/19560 | loss 3.720974 (+1.22z)| norm 0.2668 (-0.06z)| lr 7.69e-04 | 4566.76 ms | 29.6% bf16 MFU | 122926 tok/s step 3352/19560 | loss 3.654276 (-0.51z)| norm 0.2693 (+0.03z)| lr 7.69e-04 | 4383.64 ms | 30.8% bf16 MFU | 122760 tok/s step 3353/19560 | loss 3.641812 (-0.82z)| norm 0.2409 (-1.21z)| lr 7.69e-04 | 4369.72 ms | 30.9% bf16 MFU | 122621 tok/s step 3354/19560 | loss 3.651945 (-0.55z)| norm 0.2581 (-0.45z)| lr 7.69e-04 | 4365.98 ms | 30.9% bf16 MFU | 122494 tok/s step 3355/19560 | loss 3.659402 (-0.35z)| norm 0.2993 (+1.36z)| lr 7.69e-04 | 4161.74 ms | 32.4% bf16 MFU | 122668 tok/s step 3356/19560 | loss 3.727229 (+1.39z)| norm 0.3163 (+2.08z)| lr 7.69e-04 | 4280.39 ms | 31.5% bf16 MFU | 122659 tok/s step 3357/19560 | loss 3.647210 (-0.67z)| norm 0.3473 (+3.27z)| lr 7.69e-04 | 4217.05 ms | 32.0% bf16 MFU | 122742 tok/s step 3358/19560 | loss 3.680983 (+0.21z)| norm 0.3334 (+2.62z)| lr 7.69e-04 | 4198.50 ms | 32.2% bf16 MFU | 122849 tok/s step 3359/19560 | loss 3.651710 (-0.54z)| norm 0.2827 (+0.57z)| lr 7.69e-04 | 4276.57 ms | 31.6% bf16 MFU | 122836 tok/s step 3360/19560 | loss 3.769257 (+2.52z)| norm 0.3016 (+1.35z)| lr 7.69e-04 | 4225.02 ms | 32.0% bf16 MFU | 122899 tok/s step 3361/19560 | loss 3.698262 (+0.67z)| norm 0.2686 (+0.00z)| lr 7.69e-04 | 4157.20 ms | 32.5% bf16 MFU | 123060 tok/s step 3362/19560 | loss 3.697919 (+0.65z)| norm 0.2543 (-0.59z)| lr 7.68e-04 | 4156.43 ms | 32.5% bf16 MFU | 123214 tok/s step 3363/19560 | loss 3.687386 (+0.37z)| norm 0.2694 (+0.06z)| lr 7.68e-04 | 4211.76 ms | 32.1% bf16 MFU | 123277 tok/s step 3364/19560 | loss 3.644939 (-0.73z)| norm 0.2720 (+0.18z)| lr 7.68e-04 | 4176.64 ms | 32.3% bf16 MFU | 123390 tok/s step 3365/19560 | loss 3.692527 (+0.50z)| norm 0.2754 (+0.32z)| lr 7.68e-04 | 4157.60 ms | 32.5% bf16 MFU | 123526 tok/s step 3366/19560 | loss 3.671451 (-0.05z)| norm 0.2970 (+1.24z)| lr 7.68e-04 | 4225.01 ms | 32.0% bf16 MFU | 123554 tok/s step 3367/19560 | loss 3.660811 (-0.33z)| norm 0.2634 (-0.21z)| lr 7.68e-04 | 4170.28 ms | 32.4% bf16 MFU | 123662 tok/s step 3368/19560 | loss 3.675564 (+0.06z)| norm 0.2520 (-0.70z)| lr 7.68e-04 | 4207.28 ms | 32.1% bf16 MFU | 123710 tok/s step 3369/19560 | loss 3.700656 (+0.70z)| norm 0.2564 (-0.51z)| lr 7.68e-04 | 4168.41 ms | 32.4% bf16 MFU | 123813 tok/s step 3370/19560 | loss 3.760881 (+2.22z)| norm 0.2700 (+0.06z)| lr 7.68e-04 | 4236.67 ms | 31.9% bf16 MFU | 123810 tok/s step 3371/19560 | loss 3.621716 (-1.35z)| norm 0.2400 (-1.25z)| lr 7.68e-04 | 4221.53 ms | 32.0% bf16 MFU | 123829 tok/s step 3372/19560 | loss 3.693898 (+0.49z)| norm 0.2525 (-0.71z)| lr 7.68e-04 | 4171.31 ms | 32.4% bf16 MFU | 123922 tok/s step 3373/19560 | loss 3.662810 (-0.30z)| norm 0.2609 (-0.35z)| lr 7.68e-04 | 4197.18 ms | 32.2% bf16 MFU | 123972 tok/s step 3374/19560 | loss 3.760601 (+2.16z)| norm 0.2782 (+0.41z)| lr 7.68e-04 | 4164.59 ms | 32.4% bf16 MFU | 124068 tok/s step 3375/19560 | loss 3.684552 (+0.25z)| norm 0.2774 (+0.37z)| lr 7.68e-04 | 4165.05 ms | 32.4% bf16 MFU | 124158 tok/s step 3376/19560 | loss 3.660269 (-0.36z)| norm 0.2547 (-0.63z)| lr 7.68e-04 | 4172.54 ms | 32.4% bf16 MFU | 124233 tok/s step 3377/19560 | loss 3.678329 (+0.08z)| norm 0.2609 (-0.35z)| lr 7.68e-04 | 4173.52 ms | 32.4% bf16 MFU | 124302 tok/s step 3378/19560 | loss 3.634034 (-1.03z)| norm 0.2456 (-1.01z)| lr 7.68e-04 | 4157.78 ms | 32.5% bf16 MFU | 124392 tok/s step 3379/19560 | loss 3.679844 (+0.14z)| norm 0.2345 (-1.49z)| lr 7.68e-04 | 4174.35 ms | 32.3% bf16 MFU | 124452 tok/s step 3380/19560 | loss 3.718987 (+1.12z)| norm 0.2572 (-0.49z)| lr 7.68e-04 | 4162.69 ms | 32.4% bf16 MFU | 124527 tok/s step 3381/19560 | loss 3.653923 (-0.51z)| norm 0.2541 (-0.63z)| lr 7.68e-04 | 4161.45 ms | 32.4% bf16 MFU | 124600 tok/s step 3382/19560 | loss 3.695910 (+0.54z)| norm 0.2825 (+0.62z)| lr 7.68e-04 | 4156.78 ms | 32.5% bf16 MFU | 124677 tok/s step 3383/19560 | loss 3.656831 (-0.46z)| norm 0.2664 (-0.09z)| lr 7.68e-04 | 4171.85 ms | 32.4% bf16 MFU | 124726 tok/s step 3384/19560 | loss 3.652966 (-0.56z)| norm 0.3018 (+1.44z)| lr 7.68e-04 | 4159.44 ms | 32.5% bf16 MFU | 124793 tok/s step 3385/19560 | loss 3.662989 (-0.30z)| norm 0.3379 (+2.90z)| lr 7.68e-04 | 4159.81 ms | 32.5% bf16 MFU | 124855 tok/s step 3386/19560 | loss 3.604201 (-1.76z)| norm 0.3158 (+1.93z)| lr 7.68e-04 | 4175.35 ms | 32.3% bf16 MFU | 124890 tok/s step 3387/19560 | loss 3.719736 (+1.15z)| norm 0.2634 (-0.25z)| lr 7.68e-04 | 4166.47 ms | 32.4% bf16 MFU | 124938 tok/s step 3388/19560 | loss 3.739138 (+1.61z)| norm 0.2589 (-0.43z)| lr 7.68e-04 | 4172.36 ms | 32.4% bf16 MFU | 124974 tok/s step 3389/19560 | loss 3.652434 (-0.57z)| norm 0.2407 (-1.20z)| lr 7.68e-04 | 4159.23 ms | 32.5% bf16 MFU | 125028 tok/s step 3390/19560 | loss 3.648925 (-0.66z)| norm 0.2548 (-0.60z)| lr 7.68e-04 | 4174.23 ms | 32.3% bf16 MFU | 125056 tok/s step 3391/19560 | loss 3.637369 (-0.95z)| norm 0.2556 (-0.55z)| lr 7.68e-04 | 4172.57 ms | 32.4% bf16 MFU | 125086 tok/s step 3392/19560 | loss 3.662778 (-0.30z)| norm 0.2655 (-0.13z)| lr 7.68e-04 | 4165.51 ms | 32.4% bf16 MFU | 125125 tok/s step 3393/19560 | loss 3.639070 (-0.92z)| norm 0.2724 (+0.15z)| lr 7.68e-04 | 4526.78 ms | 29.8% bf16 MFU | 124660 tok/s step 3394/19560 | loss 3.675988 (+0.03z)| norm 0.2480 (-0.89z)| lr 7.68e-04 | 4160.03 ms | 32.5% bf16 MFU | 124728 tok/s step 3395/19560 | loss 3.698953 (+0.62z)| norm 0.2337 (-1.47z)| lr 7.68e-04 | 4172.75 ms | 32.4% bf16 MFU | 124774 tok/s step 3396/19560 | loss 3.650767 (-0.61z)| norm 0.2464 (-0.92z)| lr 7.68e-04 | 4205.31 ms | 32.1% bf16 MFU | 124769 tok/s step 3397/19560 | loss 3.719836 (+1.14z)| norm 0.2729 (+0.20z)| lr 7.68e-04 | 4253.91 ms | 31.7% bf16 MFU | 124693 tok/s step 3398/19560 | loss 3.651213 (-0.61z)| norm 0.2532 (-0.63z)| lr 7.68e-04 | 4244.13 ms | 31.8% bf16 MFU | 124635 tok/s step 3399/19560 | loss 3.617451 (-1.44z)| norm 0.2769 (+0.38z)| lr 7.67e-04 | 4155.05 ms | 32.5% bf16 MFU | 124712 tok/s step 3400/19560 | loss 3.689857 (+0.39z)| norm 0.2730 (+0.22z)| lr 7.67e-04 | 4155.88 ms | 32.5% bf16 MFU | 124784 tok/s step 3401/19560 | loss 3.632294 (-1.06z)| norm 0.2470 (-0.88z)| lr 7.67e-04 | 4157.97 ms | 32.5% bf16 MFU | 124850 tok/s step 3402/19560 | loss 3.629648 (-1.11z)| norm 0.2474 (-0.85z)| lr 7.67e-04 | 4236.90 ms | 31.9% bf16 MFU | 124794 tok/s step 3403/19560 | loss 3.654923 (-0.46z)| norm 0.2478 (-0.83z)| lr 7.67e-04 | 4169.02 ms | 32.4% bf16 MFU | 124843 tok/s step 3404/19560 | loss 3.649827 (-0.59z)| norm 0.2390 (-1.19z)| lr 7.67e-04 | 4164.94 ms | 32.4% bf16 MFU | 124895 tok/s step 3405/19560 | loss 3.626946 (-1.17z)| norm 0.2694 (+0.09z)| lr 7.67e-04 | 4167.72 ms | 32.4% bf16 MFU | 124940 tok/s step 3406/19560 | loss 3.690311 (+0.50z)| norm 0.2810 (+0.58z)| lr 7.67e-04 | 4176.39 ms | 32.3% bf16 MFU | 124970 tok/s step 3407/19560 | loss 3.709369 (+1.00z)| norm 0.2595 (-0.32z)| lr 7.67e-04 | 4155.93 ms | 32.5% bf16 MFU | 125029 tok/s step 3408/19560 | loss 3.616488 (-1.42z)| norm 0.2475 (-0.82z)| lr 7.67e-04 | 4169.19 ms | 32.4% bf16 MFU | 125065 tok/s step 3409/19560 | loss 3.674370 (+0.09z)| norm 0.2496 (-0.72z)| lr 7.67e-04 | 4156.00 ms | 32.5% bf16 MFU | 125119 tok/s step 3410/19560 | loss 3.654628 (-0.42z)| norm 0.2327 (-1.41z)| lr 7.67e-04 | 4166.38 ms | 32.4% bf16 MFU | 125155 tok/s step 3411/19560 | loss 3.775162 (+2.64z)| norm 0.2489 (-0.74z)| lr 7.67e-04 | 4169.14 ms | 32.4% bf16 MFU | 125185 tok/s step 3412/19560 | loss 3.628741 (-1.09z)| norm 0.2501 (-0.71z)| lr 7.67e-04 | 4161.63 ms | 32.4% bf16 MFU | 125225 tok/s step 3413/19560 | loss 3.660174 (-0.29z)| norm 0.2473 (-0.83z)| lr 7.67e-04 | 4157.60 ms | 32.5% bf16 MFU | 125269 tok/s step 3414/19560 | loss 3.658803 (-0.31z)| norm 0.2707 (+0.16z)| lr 7.67e-04 | 4201.99 ms | 32.1% bf16 MFU | 125244 tok/s step 3415/19560 | loss 3.712979 (+1.09z)| norm 0.2822 (+0.65z)| lr 7.67e-04 | 4168.83 ms | 32.4% bf16 MFU | 125270 tok/s step 3416/19560 | loss 3.654757 (-0.42z)| norm 0.2738 (+0.28z)| lr 7.67e-04 | 4164.37 ms | 32.4% bf16 MFU | 125301 tok/s step 3417/19560 | loss 3.704249 (+0.89z)| norm 0.2593 (-0.32z)| lr 7.67e-04 | 4158.29 ms | 32.5% bf16 MFU | 125341 tok/s step 3418/19560 | loss 3.690392 (+0.57z)| norm 0.2695 (+0.13z)| lr 7.67e-04 | 4162.19 ms | 32.4% bf16 MFU | 125372 tok/s step 3419/19560 | loss 3.592885 (-2.07z)| norm 0.2619 (-0.20z)| lr 7.67e-04 | 4159.79 ms | 32.5% bf16 MFU | 125405 tok/s step 3420/19560 | loss 3.697989 (+0.77z)| norm 0.2429 (-0.99z)| lr 7.67e-04 | 4159.81 ms | 32.5% bf16 MFU | 125437 tok/s step 3421/19560 | loss 3.694791 (+0.68z)| norm 0.2519 (-0.61z)| lr 7.67e-04 | 4167.41 ms | 32.4% bf16 MFU | 125455 tok/s step 3422/19560 | loss 3.625070 (-1.21z)| norm 0.2627 (-0.15z)| lr 7.67e-04 | 4158.43 ms | 32.5% bf16 MFU | 125486 tok/s step 3423/19560 | loss 3.616554 (-1.42z)| norm 0.2379 (-1.19z)| lr 7.67e-04 | 4171.21 ms | 32.4% bf16 MFU | 125497 tok/s step 3424/19560 | loss 3.653427 (-0.43z)| norm 0.2489 (-0.72z)| lr 7.67e-04 | 4159.79 ms | 32.5% bf16 MFU | 125524 tok/s step 3425/19560 | loss 3.695797 (+0.70z)| norm 0.2691 (+0.13z)| lr 7.67e-04 | 4161.15 ms | 32.4% bf16 MFU | 125547 tok/s step 3426/19560 | loss 3.687441 (+0.46z)| norm 0.2492 (-0.71z)| lr 7.67e-04 | 4158.84 ms | 32.5% bf16 MFU | 125573 tok/s step 3427/19560 | loss 3.592061 (-2.06z)| norm 0.2527 (-0.55z)| lr 7.67e-04 | 4159.76 ms | 32.5% bf16 MFU | 125596 tok/s step 3428/19560 | loss 3.730542 (+1.58z)| norm 0.2361 (-1.23z)| lr 7.67e-04 | 4163.96 ms | 32.4% bf16 MFU | 125612 tok/s step 3429/19560 | loss 3.637792 (-0.84z)| norm 0.2502 (-0.64z)| lr 7.67e-04 | 4171.11 ms | 32.4% bf16 MFU | 125616 tok/s step 3430/19560 | loss 3.651110 (-0.49z)| norm 0.2490 (-0.68z)| lr 7.67e-04 | 4197.92 ms | 32.2% bf16 MFU | 125580 tok/s step 3431/19560 | loss 3.674674 (+0.13z)| norm 0.2821 (+0.69z)| lr 7.67e-04 | 4164.33 ms | 32.4% bf16 MFU | 125596 tok/s step 3432/19560 | loss 3.631762 (-0.98z)| norm 0.2877 (+0.91z)| lr 7.67e-04 | 4166.78 ms | 32.4% bf16 MFU | 125607 tok/s step 3433/19560 | loss 3.648210 (-0.55z)| norm 0.2869 (+0.87z)| lr 7.67e-04 | 4180.17 ms | 32.3% bf16 MFU | 125598 tok/s step 3434/19560 | loss 3.624856 (-1.16z)| norm 0.2849 (+0.80z)| lr 7.67e-04 | 4156.04 ms | 32.5% bf16 MFU | 125626 tok/s step 3435/19560 | loss 3.686004 (+0.43z)| norm 0.2747 (+0.41z)| lr 7.67e-04 | 4160.06 ms | 32.5% bf16 MFU | 125646 tok/s step 3436/19560 | loss 3.653316 (-0.42z)| norm 0.2427 (-0.96z)| lr 7.67e-04 | 4157.31 ms | 32.5% bf16 MFU | 125669 tok/s step 3437/19560 | loss 3.691658 (+0.57z)| norm 0.2526 (-0.54z)| lr 7.66e-04 | 4211.94 ms | 32.1% bf16 MFU | 125610 tok/s step 3438/19560 | loss 3.650008 (-0.51z)| norm 0.2606 (-0.19z)| lr 7.66e-04 | 4184.24 ms | 32.3% bf16 MFU | 125594 tok/s step 3439/19560 | loss 3.649540 (-0.51z)| norm 0.2958 (+1.33z)| lr 7.66e-04 | 4155.43 ms | 32.5% bf16 MFU | 125623 tok/s step 3440/19560 | loss 3.633493 (-0.94z)| norm 0.2854 (+0.87z)| lr 7.66e-04 | 4162.60 ms | 32.4% bf16 MFU | 125639 tok/s step 3441/19560 | loss 3.626696 (-1.10z)| norm 0.2734 (+0.35z)| lr 7.66e-04 | 4162.71 ms | 32.4% bf16 MFU | 125655 tok/s step 3442/19560 | loss 3.761000 (+2.38z)| norm 0.3035 (+1.62z)| lr 7.66e-04 | 4156.80 ms | 32.5% bf16 MFU | 125679 tok/s step 3443/19560 | loss 3.601634 (-1.71z)| norm 0.2699 (+0.18z)| lr 7.66e-04 | 4183.26 ms | 32.3% bf16 MFU | 125661 tok/s step 3444/19560 | loss 3.728141 (+1.50z)| norm 0.2655 (-0.01z)| lr 7.66e-04 | 4193.79 ms | 32.2% bf16 MFU | 125629 tok/s step 3445/19560 | loss 3.635670 (-0.84z)| norm 0.2512 (-0.64z)| lr 7.66e-04 | 4156.99 ms | 32.5% bf16 MFU | 125654 tok/s step 3446/19560 | loss 3.737940 (+1.77z)| norm 0.2416 (-1.05z)| lr 7.66e-04 | 4159.98 ms | 32.5% bf16 MFU | 125672 tok/s step 3447/19560 | loss 3.657045 (-0.31z)| norm 0.3044 (+1.63z)| lr 7.66e-04 | 4216.58 ms | 32.0% bf16 MFU | 125606 tok/s step 3448/19560 | loss 3.669929 (+0.01z)| norm 0.2645 (-0.09z)| lr 7.66e-04 | 4156.05 ms | 32.5% bf16 MFU | 125633 tok/s step 3449/19560 | loss 3.711779 (+1.09z)| norm 0.2766 (+0.43z)| lr 7.66e-04 | 4171.17 ms | 32.4% bf16 MFU | 125636 tok/s step 3450/19560 | loss 3.661602 (-0.24z)| norm 0.3153 (+2.05z)| lr 7.66e-04 | 4168.69 ms | 32.4% bf16 MFU | 125643 tok/s step 3451/19560 | loss 3.672322 (+0.04z)| norm 0.3425 (+3.06z)| lr 7.66e-04 | 4159.26 ms | 32.5% bf16 MFU | 125663 tok/s step 3452/19560 | loss 3.660006 (-0.29z)| norm 0.3119 (+1.77z)| lr 7.66e-04 | 4160.53 ms | 32.5% bf16 MFU | 125681 tok/s step 3453/19560 | loss 3.691117 (+0.53z)| norm 0.3193 (+2.02z)| lr 7.66e-04 | 4164.23 ms | 32.4% bf16 MFU | 125692 tok/s step 3454/19560 | loss 3.627901 (-1.14z)| norm 0.3076 (+1.53z)| lr 7.66e-04 | 4158.08 ms | 32.5% bf16 MFU | 125712 tok/s step 3455/19560 | loss 3.672264 (+0.04z)| norm 0.2682 (-0.03z)| lr 7.66e-04 | 4161.33 ms | 32.4% bf16 MFU | 125726 tok/s step 3456/19560 | loss 3.667437 (-0.09z)| norm 0.2515 (-0.70z)| lr 7.66e-04 | 4161.05 ms | 32.4% bf16 MFU | 125739 tok/s step 3457/19560 | loss 3.658670 (-0.32z)| norm 0.2498 (-0.76z)| lr 7.66e-04 | 4158.20 ms | 32.5% bf16 MFU | 125757 tok/s step 3458/19560 | loss 3.664511 (-0.16z)| norm 0.2530 (-0.64z)| lr 7.66e-04 | 4167.65 ms | 32.4% bf16 MFU | 125759 tok/s step 3459/19560 | loss 3.648136 (-0.59z)| norm 0.2522 (-0.69z)| lr 7.66e-04 | 4167.24 ms | 32.4% bf16 MFU | 125761 tok/s step 3460/19560 | loss 3.744243 (+1.93z)| norm 0.2407 (-1.15z)| lr 7.66e-04 | 4160.28 ms | 32.5% bf16 MFU | 125774 tok/s step 3461/19560 | loss 3.681647 (+0.28z)| norm 0.2515 (-0.73z)| lr 7.66e-04 | 4156.58 ms | 32.5% bf16 MFU | 125792 tok/s step 3462/19560 | loss 3.679151 (+0.21z)| norm 0.2676 (-0.07z)| lr 7.66e-04 | 4173.50 ms | 32.4% bf16 MFU | 125784 tok/s step 3463/19560 | loss 3.689940 (+0.49z)| norm 0.2436 (-1.02z)| lr 7.66e-04 | 4168.64 ms | 32.4% bf16 MFU | 125783 tok/s step 3464/19560 | loss 3.618155 (-1.39z)| norm 0.2317 (-1.48z)| lr 7.66e-04 | 4157.42 ms | 32.5% bf16 MFU | 125800 tok/s step 3465/19560 | loss 3.632404 (-1.00z)| norm 0.2758 (+0.28z)| lr 7.66e-04 | 4153.42 ms | 32.5% bf16 MFU | 125821 tok/s step 3466/19560 | loss 3.642364 (-0.72z)| norm 0.2491 (-0.77z)| lr 7.66e-04 | 4163.66 ms | 32.4% bf16 MFU | 125826 tok/s step 3467/19560 | loss 3.654381 (-0.41z)| norm 0.2732 (+0.20z)| lr 7.66e-04 | 4256.90 ms | 31.7% bf16 MFU | 125693 tok/s step 3468/19560 | loss 3.635652 (-0.89z)| norm 0.2866 (+0.74z)| lr 7.66e-04 | 4168.00 ms | 32.4% bf16 MFU | 125698 tok/s step 3469/19560 | loss 3.673904 (+0.12z)| norm 0.2809 (+0.54z)| lr 7.66e-04 | 4168.14 ms | 32.4% bf16 MFU | 125702 tok/s step 3470/19560 | loss 3.645595 (-0.63z)| norm 0.2464 (-0.87z)| lr 7.66e-04 | 4402.44 ms | 30.7% bf16 MFU | 125371 tok/s step 3471/19560 | loss 3.654690 (-0.39z)| norm 0.2349 (-1.33z)| lr 7.66e-04 | 4162.35 ms | 32.4% bf16 MFU | 125401 tok/s step 3472/19560 | loss 3.659907 (-0.25z)| norm 0.2651 (-0.08z)| lr 7.66e-04 | 4245.49 ms | 31.8% bf16 MFU | 125305 tok/s step 3473/19560 | loss 3.634431 (-0.91z)| norm 0.2519 (-0.62z)| lr 7.65e-04 | 4162.86 ms | 32.4% bf16 MFU | 125337 tok/s step 3474/19560 | loss 3.640271 (-0.75z)| norm 0.2573 (-0.39z)| lr 7.65e-04 | 4165.57 ms | 32.4% bf16 MFU | 125364 tok/s step 3475/19560 | loss 3.655366 (-0.36z)| norm 0.2528 (-0.59z)| lr 7.65e-04 | 4170.44 ms | 32.4% bf16 MFU | 125381 tok/s step 3476/19560 | loss 3.774617 (+2.78z)| norm 0.2485 (-0.77z)| lr 7.65e-04 | 4159.97 ms | 32.5% bf16 MFU | 125414 tok/s step 3477/19560 | loss 3.662709 (-0.17z)| norm 0.2553 (-0.47z)| lr 7.65e-04 | 4162.81 ms | 32.4% bf16 MFU | 125440 tok/s step 3478/19560 | loss 3.669441 (+0.00z)| norm 0.2462 (-0.85z)| lr 7.65e-04 | 4163.63 ms | 32.4% bf16 MFU | 125464 tok/s step 3479/19560 | loss 3.667836 (-0.03z)| norm 0.2491 (-0.72z)| lr 7.65e-04 | 4223.71 ms | 32.0% bf16 MFU | 125398 tok/s step 3480/19560 | loss 3.687419 (+0.49z)| norm 0.2587 (-0.31z)| lr 7.65e-04 | 4173.84 ms | 32.3% bf16 MFU | 125408 tok/s step 3481/19560 | loss 3.768901 (+2.56z)| norm 0.2758 (+0.39z)| lr 7.65e-04 | 4173.23 ms | 32.4% bf16 MFU | 125419 tok/s step 3482/19560 | loss 3.656247 (-0.36z)| norm 0.2715 (+0.21z)| lr 7.65e-04 | 4160.45 ms | 32.5% bf16 MFU | 125449 tok/s step 3483/19560 | loss 3.722678 (+1.34z)| norm 0.2702 (+0.16z)| lr 7.65e-04 | 4174.26 ms | 32.3% bf16 MFU | 125457 tok/s step 3484/19560 | loss 3.641232 (-0.75z)| norm 0.2421 (-1.02z)| lr 7.65e-04 | 4164.09 ms | 32.4% bf16 MFU | 125479 tok/s step 3485/19560 | loss 3.685351 (+0.39z)| norm 0.2512 (-0.62z)| lr 7.65e-04 | 4158.69 ms | 32.5% bf16 MFU | 125509 tok/s step 3486/19560 | loss 3.624693 (-1.17z)| norm 0.2611 (-0.16z)| lr 7.65e-04 | 4165.24 ms | 32.4% bf16 MFU | 125527 tok/s step 3487/19560 | loss 3.651440 (-0.47z)| norm 0.2649 (+0.03z)| lr 7.65e-04 | 4161.45 ms | 32.4% bf16 MFU | 125550 tok/s step 3488/19560 | loss 3.648589 (-0.54z)| norm 0.2486 (-0.73z)| lr 7.65e-04 | 4163.41 ms | 32.4% bf16 MFU | 125569 tok/s step 3489/19560 | loss 3.676332 (+0.20z)| norm 0.2733 (+0.45z)| lr 7.65e-04 | 4170.41 ms | 32.4% bf16 MFU | 125576 tok/s step 3490/19560 | loss 3.714969 (+1.22z)| norm 0.2487 (-0.72z)| lr 7.65e-04 | 4163.88 ms | 32.4% bf16 MFU | 125593 tok/s step 3491/19560 | loss 3.682744 (+0.37z)| norm 0.2490 (-0.70z)| lr 7.65e-04 | 4166.08 ms | 32.4% bf16 MFU | 125606 tok/s step 3492/19560 | loss 3.705971 (+0.97z)| norm 0.2712 (+0.35z)| lr 7.65e-04 | 4156.97 ms | 32.5% bf16 MFU | 125632 tok/s step 3493/19560 | loss 3.663003 (-0.16z)| norm 0.2933 (+1.39z)| lr 7.65e-04 | 4161.45 ms | 32.4% bf16 MFU | 125649 tok/s step 3494/19560 | loss 3.661522 (-0.20z)| norm 0.2559 (-0.37z)| lr 7.65e-04 | 4168.77 ms | 32.4% bf16 MFU | 125655 tok/s step 3495/19560 | loss 3.748374 (+2.05z)| norm 0.2721 (+0.41z)| lr 7.65e-04 | 4159.91 ms | 32.5% bf16 MFU | 125674 tok/s step 3496/19560 | loss 3.695254 (+0.66z)| norm 0.2627 (-0.05z)| lr 7.65e-04 | 4171.74 ms | 32.4% bf16 MFU | 125674 tok/s step 3497/19560 | loss 3.649348 (-0.52z)| norm 0.2653 (+0.08z)| lr 7.65e-04 | 4172.80 ms | 32.4% bf16 MFU | 125673 tok/s step 3498/19560 | loss 3.655928 (-0.34z)| norm 0.2614 (-0.11z)| lr 7.65e-04 | 4164.42 ms | 32.4% bf16 MFU | 125684 tok/s step 3499/19560 | loss 3.673793 (+0.13z)| norm 0.2696 (+0.27z)| lr 7.65e-04 | 4162.00 ms | 32.4% bf16 MFU | 125698 tok/s step 3500/19560 | loss 3.582816 (-2.24z)| norm 0.2801 (+0.77z)| lr 7.65e-04 | 4174.49 ms | 32.3% bf16 MFU | 125693 tok/s val loss 3.632842 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2677/10042 = 0.266580 step 3501/19560 | loss 3.601520 (-1.72z)| norm 0.2483 (-0.75z)| lr 7.65e-04 | 4345.06 ms | 31.1% bf16 MFU | 125442 tok/s step 3502/19560 | loss 3.700866 (+0.89z)| norm 0.2508 (-0.62z)| lr 7.65e-04 | 4219.24 ms | 32.0% bf16 MFU | 125382 tok/s step 3503/19560 | loss 3.663654 (-0.09z)| norm 0.2450 (-0.89z)| lr 7.65e-04 | 4207.54 ms | 32.1% bf16 MFU | 125344 tok/s step 3504/19560 | loss 3.690223 (+0.61z)| norm 0.2546 (-0.43z)| lr 7.65e-04 | 4161.50 ms | 32.4% bf16 MFU | 125376 tok/s step 3505/19560 | loss 3.675689 (+0.22z)| norm 0.2555 (-0.38z)| lr 7.65e-04 | 4161.91 ms | 32.4% bf16 MFU | 125406 tok/s step 3506/19560 | loss 3.653759 (-0.36z)| norm 0.2718 (+0.38z)| lr 7.65e-04 | 4158.27 ms | 32.5% bf16 MFU | 125440 tok/s step 3507/19560 | loss 3.640893 (-0.69z)| norm 0.2471 (-0.80z)| lr 7.65e-04 | 4162.09 ms | 32.4% bf16 MFU | 125466 tok/s step 3508/19560 | loss 3.610853 (-1.47z)| norm 0.2307 (-1.57z)| lr 7.65e-04 | 4154.12 ms | 32.5% bf16 MFU | 125503 tok/s step 3509/19560 | loss 3.659291 (-0.19z)| norm 0.2980 (+1.61z)| lr 7.64e-04 | 4235.80 ms | 31.9% bf16 MFU | 125417 tok/s step 3510/19560 | loss 3.685209 (+0.50z)| norm 0.3514 (+3.87z)| lr 7.64e-04 | 4219.73 ms | 32.0% bf16 MFU | 125358 tok/s step 3511/19560 | loss 3.655613 (-0.28z)| norm 0.2800 (+0.68z)| lr 7.64e-04 | 4162.67 ms | 32.4% bf16 MFU | 125388 tok/s step 3512/19560 | loss 3.645179 (-0.56z)| norm 0.3022 (+1.67z)| lr 7.64e-04 | 4308.11 ms | 31.3% bf16 MFU | 125203 tok/s step 3513/19560 | loss 3.649293 (-0.44z)| norm 0.2642 (+0.01z)| lr 7.64e-04 | 4171.68 ms | 32.4% bf16 MFU | 125227 tok/s step 3514/19560 | loss 3.676862 (+0.28z)| norm 0.2667 (+0.15z)| lr 7.64e-04 | 4155.68 ms | 32.5% bf16 MFU | 125274 tok/s step 3515/19560 | loss 3.634821 (-0.84z)| norm 0.2449 (-0.88z)| lr 7.64e-04 | 4158.37 ms | 32.5% bf16 MFU | 125314 tok/s step 3516/19560 | loss 3.622063 (-1.17z)| norm 0.2673 (+0.18z)| lr 7.64e-04 | 4163.56 ms | 32.4% bf16 MFU | 125345 tok/s step 3517/19560 | loss 3.721359 (+1.50z)| norm 0.2243 (-1.84z)| lr 7.64e-04 | 4282.91 ms | 31.5% bf16 MFU | 125198 tok/s step 3518/19560 | loss 3.685611 (+0.53z)| norm 0.2570 (-0.30z)| lr 7.64e-04 | 4159.70 ms | 32.5% bf16 MFU | 125240 tok/s step 3519/19560 | loss 3.664690 (-0.04z)| norm 0.2541 (-0.44z)| lr 7.64e-04 | 4350.32 ms | 31.0% bf16 MFU | 125004 tok/s step 3520/19560 | loss 3.707661 (+1.11z)| norm 0.2614 (-0.09z)| lr 7.64e-04 | 4158.71 ms | 32.5% bf16 MFU | 125057 tok/s step 3521/19560 | loss 3.671158 (+0.12z)| norm 0.2435 (-0.92z)| lr 7.64e-04 | 4161.70 ms | 32.4% bf16 MFU | 125103 tok/s step 3522/19560 | loss 3.668901 (+0.06z)| norm 0.2769 (+0.63z)| lr 7.64e-04 | 4260.30 ms | 31.7% bf16 MFU | 125001 tok/s step 3523/19560 | loss 3.690661 (+0.65z)| norm 0.3256 (+2.81z)| lr 7.64e-04 | 4213.77 ms | 32.0% bf16 MFU | 124972 tok/s step 3524/19560 | loss 3.687541 (+0.56z)| norm 0.2736 (+0.42z)| lr 7.64e-04 | 4158.09 ms | 32.5% bf16 MFU | 125028 tok/s step 3525/19560 | loss 3.678115 (+0.32z)| norm 0.2525 (-0.53z)| lr 7.64e-04 | 4195.39 ms | 32.2% bf16 MFU | 125025 tok/s step 3526/19560 | loss 3.747065 (+2.13z)| norm 0.2639 (-0.02z)| lr 7.64e-04 | 4155.02 ms | 32.5% bf16 MFU | 125083 tok/s step 3527/19560 | loss 3.775681 (+2.79z)| norm 0.3193 (+2.45z)| lr 7.64e-04 | 4197.71 ms | 32.2% bf16 MFU | 125074 tok/s step 3528/19560 | loss 3.709034 (+1.05z)| norm 0.3343 (+2.99z)| lr 7.64e-04 | 4164.82 ms | 32.4% bf16 MFU | 125114 tok/s step 3529/19560 | loss 3.721927 (+1.36z)| norm 0.2991 (+1.44z)| lr 7.64e-04 | 4230.15 ms | 31.9% bf16 MFU | 125056 tok/s step 3530/19560 | loss 3.629524 (-1.03z)| norm 0.3260 (+2.51z)| lr 7.64e-04 | 4222.78 ms | 32.0% bf16 MFU | 125011 tok/s step 3531/19560 | loss 3.676629 (+0.18z)| norm 0.3717 (+4.10z)| lr 7.64e-04 | 4160.03 ms | 32.5% bf16 MFU | 125062 tok/s step 3532/19560 | loss 3.674015 (+0.11z)| norm 0.2861 (+0.73z)| lr 7.64e-04 | 4234.56 ms | 31.9% bf16 MFU | 124999 tok/s step 3533/19560 | loss 3.732309 (+1.59z)| norm 0.2828 (+0.60z)| lr 7.64e-04 | 4159.27 ms | 32.5% bf16 MFU | 125052 tok/s step 3534/19560 | loss 3.636755 (-0.86z)| norm 0.2798 (+0.48z)| lr 7.64e-04 | 4209.25 ms | 32.1% bf16 MFU | 125027 tok/s step 3535/19560 | loss 3.641606 (-0.72z)| norm 0.2816 (+0.54z)| lr 7.64e-04 | 4160.04 ms | 32.5% bf16 MFU | 125077 tok/s step 3536/19560 | loss 3.716210 (+1.18z)| norm 0.2404 (-1.06z)| lr 7.64e-04 | 4564.59 ms | 29.6% bf16 MFU | 124566 tok/s step 3537/19560 | loss 3.614052 (-1.43z)| norm 0.3046 (+1.42z)| lr 7.64e-04 | 4158.26 ms | 32.5% bf16 MFU | 124642 tok/s step 3538/19560 | loss 3.618658 (-1.30z)| norm 0.2967 (+1.10z)| lr 7.64e-04 | 4362.51 ms | 30.9% bf16 MFU | 124419 tok/s step 3539/19560 | loss 3.627324 (-1.07z)| norm 0.2517 (-0.66z)| lr 7.64e-04 | 4689.66 ms | 28.8% bf16 MFU | 123788 tok/s step 3540/19560 | loss 3.686476 (+0.46z)| norm 0.2636 (-0.20z)| lr 7.64e-04 | 4484.59 ms | 30.1% bf16 MFU | 123444 tok/s step 3541/19560 | loss 3.739687 (+1.81z)| norm 0.2857 (+0.65z)| lr 7.64e-04 | 4444.80 ms | 30.4% bf16 MFU | 123170 tok/s step 3542/19560 | loss 3.677025 (+0.19z)| norm 0.2707 (+0.07z)| lr 7.64e-04 | 4436.91 ms | 30.4% bf16 MFU | 122919 tok/s step 3543/19560 | loss 3.652850 (-0.42z)| norm 0.2672 (-0.07z)| lr 7.64e-04 | 4183.46 ms | 32.3% bf16 MFU | 123040 tok/s step 3544/19560 | loss 3.594711 (-1.89z)| norm 0.2539 (-0.58z)| lr 7.64e-04 | 4259.06 ms | 31.7% bf16 MFU | 123043 tok/s step 3545/19560 | loss 3.644398 (-0.61z)| norm 0.2240 (-1.72z)| lr 7.63e-04 | 4272.40 ms | 31.6% bf16 MFU | 123026 tok/s step 3546/19560 | loss 3.657112 (-0.28z)| norm 0.2520 (-0.63z)| lr 7.63e-04 | 4262.66 ms | 31.7% bf16 MFU | 123025 tok/s step 3547/19560 | loss 3.670080 (+0.04z)| norm 0.2367 (-1.21z)| lr 7.63e-04 | 4255.48 ms | 31.7% bf16 MFU | 123034 tok/s step 3548/19560 | loss 3.589648 (-2.01z)| norm 0.2384 (-1.14z)| lr 7.63e-04 | 4171.35 ms | 32.4% bf16 MFU | 123166 tok/s step 3549/19560 | loss 3.663249 (-0.11z)| norm 0.2450 (-0.88z)| lr 7.63e-04 | 4422.00 ms | 30.5% bf16 MFU | 122936 tok/s step 3550/19560 | loss 3.663026 (-0.12z)| norm 0.2338 (-1.29z)| lr 7.63e-04 | 4311.20 ms | 31.3% bf16 MFU | 122870 tok/s step 3551/19560 | loss 3.636844 (-0.81z)| norm 0.2423 (-0.97z)| lr 7.63e-04 | 4200.37 ms | 32.1% bf16 MFU | 122967 tok/s step 3552/19560 | loss 3.643640 (-0.63z)| norm 0.2372 (-1.16z)| lr 7.63e-04 | 4165.00 ms | 32.4% bf16 MFU | 123113 tok/s step 3553/19560 | loss 3.735568 (+1.73z)| norm 0.2600 (-0.29z)| lr 7.63e-04 | 4201.99 ms | 32.1% bf16 MFU | 123196 tok/s step 3554/19560 | loss 3.651307 (-0.43z)| norm 0.2724 (+0.17z)| lr 7.63e-04 | 4235.89 ms | 31.9% bf16 MFU | 123225 tok/s step 3555/19560 | loss 3.625798 (-1.10z)| norm 0.2795 (+0.44z)| lr 7.63e-04 | 4337.37 ms | 31.1% bf16 MFU | 123107 tok/s step 3556/19560 | loss 3.617908 (-1.29z)| norm 0.2704 (+0.08z)| lr 7.63e-04 | 4162.76 ms | 32.4% bf16 MFU | 123249 tok/s step 3557/19560 | loss 3.702063 (+0.89z)| norm 0.2422 (-1.00z)| lr 7.63e-04 | 4162.86 ms | 32.4% bf16 MFU | 123384 tok/s step 3558/19560 | loss 3.663595 (-0.11z)| norm 0.2866 (+0.69z)| lr 7.63e-04 | 4243.73 ms | 31.8% bf16 MFU | 123392 tok/s step 3559/19560 | loss 3.665278 (-0.07z)| norm 0.2861 (+0.67z)| lr 7.63e-04 | 4166.93 ms | 32.4% bf16 MFU | 123514 tok/s step 3560/19560 | loss 3.632508 (-0.92z)| norm 0.2962 (+1.05z)| lr 7.63e-04 | 4182.08 ms | 32.3% bf16 MFU | 123606 tok/s step 3561/19560 | loss 3.657297 (-0.28z)| norm 0.2917 (+0.88z)| lr 7.63e-04 | 4173.63 ms | 32.4% bf16 MFU | 123707 tok/s step 3562/19560 | loss 3.636565 (-0.82z)| norm 0.3002 (+1.19z)| lr 7.63e-04 | 4171.46 ms | 32.4% bf16 MFU | 123806 tok/s step 3563/19560 | loss 3.644546 (-0.60z)| norm 0.2569 (-0.45z)| lr 7.63e-04 | 4175.32 ms | 32.3% bf16 MFU | 123894 tok/s step 3564/19560 | loss 3.649485 (-0.47z)| norm 0.2345 (-1.29z)| lr 7.63e-04 | 4176.26 ms | 32.3% bf16 MFU | 123976 tok/s step 3565/19560 | loss 3.639840 (-0.72z)| norm 0.2455 (-0.87z)| lr 7.63e-04 | 4219.18 ms | 32.0% bf16 MFU | 123990 tok/s step 3566/19560 | loss 3.683357 (+0.42z)| norm 0.2642 (-0.17z)| lr 7.63e-04 | 4175.32 ms | 32.3% bf16 MFU | 124069 tok/s step 3567/19560 | loss 3.714100 (+1.20z)| norm 0.2677 (-0.03z)| lr 7.63e-04 | 4167.47 ms | 32.4% bf16 MFU | 124156 tok/s step 3568/19560 | loss 3.601932 (-1.70z)| norm 0.2547 (-0.51z)| lr 7.63e-04 | 4186.97 ms | 32.2% bf16 MFU | 124209 tok/s step 3569/19560 | loss 3.623564 (-1.14z)| norm 0.2425 (-0.96z)| lr 7.63e-04 | 4167.29 ms | 32.4% bf16 MFU | 124289 tok/s step 3570/19560 | loss 3.713187 (+1.20z)| norm 0.2799 (+0.46z)| lr 7.63e-04 | 4164.30 ms | 32.4% bf16 MFU | 124370 tok/s step 3571/19560 | loss 3.599960 (-1.77z)| norm 0.2935 (+0.97z)| lr 7.63e-04 | 4173.73 ms | 32.3% bf16 MFU | 124432 tok/s step 3572/19560 | loss 3.608908 (-1.51z)| norm 0.2713 (+0.13z)| lr 7.63e-04 | 4368.68 ms | 30.9% bf16 MFU | 124211 tok/s step 3573/19560 | loss 3.711292 (+1.16z)| norm 0.2589 (-0.35z)| lr 7.63e-04 | 4164.51 ms | 32.4% bf16 MFU | 124295 tok/s step 3574/19560 | loss 3.553005 (-2.89z)| norm 0.3067 (+1.44z)| lr 7.63e-04 | 4241.84 ms | 31.8% bf16 MFU | 124260 tok/s step 3575/19560 | loss 3.631926 (-0.85z)| norm 0.2792 (+0.41z)| lr 7.63e-04 | 4164.49 ms | 32.4% bf16 MFU | 124342 tok/s step 3576/19560 | loss 3.630028 (-0.89z)| norm 0.2635 (-0.18z)| lr 7.63e-04 | 4165.45 ms | 32.4% bf16 MFU | 124418 tok/s step 3577/19560 | loss 3.637176 (-0.70z)| norm 0.2380 (-1.13z)| lr 7.63e-04 | 4172.93 ms | 32.4% bf16 MFU | 124479 tok/s step 3578/19560 | loss 3.631395 (-0.84z)| norm 0.2400 (-1.04z)| lr 7.63e-04 | 4183.73 ms | 32.3% bf16 MFU | 124521 tok/s step 3579/19560 | loss 3.637513 (-0.67z)| norm 0.2388 (-1.09z)| lr 7.63e-04 | 4175.07 ms | 32.3% bf16 MFU | 124574 tok/s step 3580/19560 | loss 3.603436 (-1.52z)| norm 0.2370 (-1.15z)| lr 7.62e-04 | 4171.43 ms | 32.4% bf16 MFU | 124630 tok/s step 3581/19560 | loss 3.672956 (+0.24z)| norm 0.2451 (-0.82z)| lr 7.62e-04 | 4177.82 ms | 32.3% bf16 MFU | 124673 tok/s step 3582/19560 | loss 3.676033 (+0.31z)| norm 0.2427 (-0.90z)| lr 7.62e-04 | 4182.76 ms | 32.3% bf16 MFU | 124706 tok/s step 3583/19560 | loss 3.681975 (+0.46z)| norm 0.2458 (-0.77z)| lr 7.62e-04 | 4183.90 ms | 32.3% bf16 MFU | 124737 tok/s step 3584/19560 | loss 3.644666 (-0.48z)| norm 0.2589 (-0.24z)| lr 7.62e-04 | 4200.46 ms | 32.1% bf16 MFU | 124741 tok/s step 3585/19560 | loss 3.654682 (-0.23z)| norm 0.2382 (-1.07z)| lr 7.62e-04 | 4161.42 ms | 32.4% bf16 MFU | 124803 tok/s step 3586/19560 | loss 3.628271 (-0.88z)| norm 0.2518 (-0.52z)| lr 7.62e-04 | 4547.05 ms | 29.7% bf16 MFU | 124328 tok/s step 3587/19560 | loss 3.698925 (+0.89z)| norm 0.2518 (-0.52z)| lr 7.62e-04 | 4167.82 ms | 32.4% bf16 MFU | 124401 tok/s step 3588/19560 | loss 3.573098 (-2.24z)| norm 0.2658 (+0.04z)| lr 7.62e-04 | 4331.88 ms | 31.2% bf16 MFU | 124233 tok/s step 3589/19560 | loss 3.668658 (+0.16z)| norm 0.2758 (+0.43z)| lr 7.62e-04 | 4164.22 ms | 32.4% bf16 MFU | 124316 tok/s step 3590/19560 | loss 3.668042 (+0.15z)| norm 0.2909 (+1.03z)| lr 7.62e-04 | 4162.76 ms | 32.4% bf16 MFU | 124398 tok/s step 3591/19560 | loss 3.617911 (-1.10z)| norm 0.2743 (+0.35z)| lr 7.62e-04 | 4236.16 ms | 31.9% bf16 MFU | 124366 tok/s step 3592/19560 | loss 3.668903 (+0.17z)| norm 0.2722 (+0.26z)| lr 7.62e-04 | 4166.44 ms | 32.4% bf16 MFU | 124440 tok/s step 3593/19560 | loss 3.669334 (+0.18z)| norm 0.2919 (+1.05z)| lr 7.62e-04 | 4165.12 ms | 32.4% bf16 MFU | 124511 tok/s step 3594/19560 | loss 3.644549 (-0.45z)| norm 0.2974 (+1.25z)| lr 7.62e-04 | 4208.73 ms | 32.1% bf16 MFU | 124514 tok/s step 3595/19560 | loss 3.677181 (+0.37z)| norm 0.2400 (-1.05z)| lr 7.62e-04 | 4254.86 ms | 31.7% bf16 MFU | 124450 tok/s step 3596/19560 | loss 3.676973 (+0.36z)| norm 0.2582 (-0.31z)| lr 7.62e-04 | 4278.11 ms | 31.6% bf16 MFU | 124355 tok/s step 3597/19560 | loss 3.625687 (-0.93z)| norm 0.2663 (+0.02z)| lr 7.62e-04 | 4163.63 ms | 32.4% bf16 MFU | 124433 tok/s step 3598/19560 | loss 3.622811 (-0.99z)| norm 0.2369 (-1.16z)| lr 7.62e-04 | 4171.44 ms | 32.4% bf16 MFU | 124496 tok/s step 3599/19560 | loss 3.669856 (+0.19z)| norm 0.2583 (-0.30z)| lr 7.62e-04 | 4354.96 ms | 31.0% bf16 MFU | 124290 tok/s step 3600/19560 | loss 3.554741 (-2.61z)| norm 0.5258 (+7.66z)| lr 7.62e-04 | 4266.55 ms | 31.6% bf16 MFU | 124220 tok/s step 3601/19560 | loss 3.693805 (+0.78z)| norm 0.8713 (+9.52z)| lr 7.62e-04 | 4163.73 ms | 32.4% bf16 MFU | 124305 tok/s step 3602/19560 | loss 3.736392 (+1.78z)| norm 0.5737 (+4.38z)| lr 7.62e-04 | 4370.04 ms | 30.9% bf16 MFU | 124088 tok/s step 3603/19560 | loss 3.633273 (-0.71z)| norm 0.4420 (+2.37z)| lr 7.62e-04 | 4213.07 ms | 32.0% bf16 MFU | 124106 tok/s step 3604/19560 | loss 3.685602 (+0.59z)| norm 0.3463 (+0.98z)| lr 7.62e-04 | 4184.64 ms | 32.3% bf16 MFU | 124165 tok/s step 3605/19560 | loss 3.660715 (-0.03z)| norm 0.3892 (+1.57z)| lr 7.62e-04 | 4165.84 ms | 32.4% bf16 MFU | 124250 tok/s step 3606/19560 | loss 3.600068 (-1.50z)| norm 0.3112 (+0.46z)| lr 7.62e-04 | 4211.13 ms | 32.1% bf16 MFU | 124262 tok/s step 3607/19560 | loss 3.663085 (+0.04z)| norm 0.2835 (+0.06z)| lr 7.62e-04 | 4201.04 ms | 32.1% bf16 MFU | 124289 tok/s step 3608/19560 | loss 3.588489 (-1.75z)| norm 0.2622 (-0.24z)| lr 7.62e-04 | 4161.35 ms | 32.4% bf16 MFU | 124374 tok/s step 3609/19560 | loss 3.665838 (+0.15z)| norm 0.2973 (+0.25z)| lr 7.62e-04 | 4159.64 ms | 32.5% bf16 MFU | 124457 tok/s step 3610/19560 | loss 3.625409 (-0.85z)| norm 0.2715 (-0.11z)| lr 7.62e-04 | 4429.57 ms | 30.5% bf16 MFU | 124153 tok/s step 3611/19560 | loss 3.722656 (+1.57z)| norm 0.2334 (-0.65z)| lr 7.62e-04 | 4177.76 ms | 32.3% bf16 MFU | 124220 tok/s step 3612/19560 | loss 3.619897 (-0.98z)| norm 0.2447 (-0.49z)| lr 7.62e-04 | 4161.66 ms | 32.4% bf16 MFU | 124308 tok/s step 3613/19560 | loss 3.660944 (+0.04z)| norm 0.2700 (-0.13z)| lr 7.62e-04 | 4203.24 ms | 32.1% bf16 MFU | 124329 tok/s step 3614/19560 | loss 3.650429 (-0.22z)| norm 0.2821 (+0.04z)| lr 7.62e-04 | 4160.21 ms | 32.5% bf16 MFU | 124414 tok/s step 3615/19560 | loss 3.605302 (-1.33z)| norm 0.2442 (-0.50z)| lr 7.61e-04 | 4160.68 ms | 32.5% bf16 MFU | 124494 tok/s step 3616/19560 | loss 3.652471 (-0.16z)| norm 0.2491 (-0.43z)| lr 7.61e-04 | 4266.22 ms | 31.6% bf16 MFU | 124414 tok/s step 3617/19560 | loss 3.637982 (-0.51z)| norm 0.2599 (-0.27z)| lr 7.61e-04 | 4165.59 ms | 32.4% bf16 MFU | 124486 tok/s step 3618/19560 | loss 3.614556 (-1.08z)| norm 0.2440 (-0.50z)| lr 7.61e-04 | 4186.67 ms | 32.2% bf16 MFU | 124523 tok/s step 3619/19560 | loss 3.620165 (-0.92z)| norm 0.2649 (-0.20z)| lr 7.61e-04 | 4176.71 ms | 32.3% bf16 MFU | 124573 tok/s step 3620/19560 | loss 3.668118 (+0.27z)| norm 0.2328 (-0.65z)| lr 7.61e-04 | 4272.35 ms | 31.6% bf16 MFU | 124480 tok/s step 3621/19560 | loss 3.670025 (+0.32z)| norm 0.2404 (-0.54z)| lr 7.61e-04 | 4180.03 ms | 32.3% bf16 MFU | 124528 tok/s step 3622/19560 | loss 3.598101 (-1.45z)| norm 0.2146 (-0.90z)| lr 7.61e-04 | 4165.90 ms | 32.4% bf16 MFU | 124594 tok/s step 3623/19560 | loss 3.677384 (+0.54z)| norm 0.2329 (-0.63z)| lr 7.61e-04 | 4170.35 ms | 32.4% bf16 MFU | 124650 tok/s step 3624/19560 | loss 3.660204 (+0.11z)| norm 0.2437 (-0.48z)| lr 7.61e-04 | 4156.71 ms | 32.5% bf16 MFU | 124724 tok/s step 3625/19560 | loss 3.671043 (+0.38z)| norm 0.2420 (-0.50z)| lr 7.61e-04 | 4166.86 ms | 32.4% bf16 MFU | 124779 tok/s step 3626/19560 | loss 3.654167 (-0.05z)| norm 0.2317 (-0.64z)| lr 7.61e-04 | 4175.85 ms | 32.3% bf16 MFU | 124818 tok/s step 3627/19560 | loss 3.611435 (-1.11z)| norm 0.2331 (-0.62z)| lr 7.61e-04 | 4169.58 ms | 32.4% bf16 MFU | 124864 tok/s step 3628/19560 | loss 3.656479 (+0.01z)| norm 0.2280 (-0.68z)| lr 7.61e-04 | 4326.87 ms | 31.2% bf16 MFU | 124679 tok/s step 3629/19560 | loss 3.648791 (-0.20z)| norm 0.2387 (-0.53z)| lr 7.61e-04 | 4158.12 ms | 32.5% bf16 MFU | 124750 tok/s step 3630/19560 | loss 3.584290 (-1.82z)| norm 0.2319 (-0.62z)| lr 7.61e-04 | 4160.04 ms | 32.5% bf16 MFU | 124814 tok/s step 3631/19560 | loss 3.604665 (-1.28z)| norm 0.2542 (-0.31z)| lr 7.61e-04 | 4261.25 ms | 31.7% bf16 MFU | 124725 tok/s step 3632/19560 | loss 3.642417 (-0.31z)| norm 0.2456 (-0.43z)| lr 7.61e-04 | 4167.62 ms | 32.4% bf16 MFU | 124779 tok/s step 3633/19560 | loss 3.668902 (+0.36z)| norm 0.2292 (-0.66z)| lr 7.61e-04 | 4168.14 ms | 32.4% bf16 MFU | 124829 tok/s step 3634/19560 | loss 3.672914 (+0.46z)| norm 0.2495 (-0.37z)| lr 7.61e-04 | 4192.70 ms | 32.2% bf16 MFU | 124840 tok/s step 3635/19560 | loss 3.600568 (-1.36z)| norm 0.2217 (-0.75z)| lr 7.61e-04 | 4180.28 ms | 32.3% bf16 MFU | 124869 tok/s step 3636/19560 | loss 3.635683 (-0.48z)| norm 0.2413 (-0.48z)| lr 7.61e-04 | 4161.65 ms | 32.4% bf16 MFU | 124924 tok/s step 3637/19560 | loss 3.680905 (+0.66z)| norm 0.2406 (-0.49z)| lr 7.61e-04 | 4166.92 ms | 32.4% bf16 MFU | 124969 tok/s step 3638/19560 | loss 3.635186 (-0.49z)| norm 0.2267 (-0.67z)| lr 7.61e-04 | 4161.52 ms | 32.4% bf16 MFU | 125020 tok/s step 3639/19560 | loss 3.612826 (-1.04z)| norm 0.2520 (-0.31z)| lr 7.61e-04 | 4161.05 ms | 32.4% bf16 MFU | 125069 tok/s step 3640/19560 | loss 3.650103 (-0.10z)| norm 0.2634 (-0.15z)| lr 7.61e-04 | 4177.70 ms | 32.3% bf16 MFU | 125090 tok/s step 3641/19560 | loss 3.639260 (-0.37z)| norm 0.2707 (-0.05z)| lr 7.61e-04 | 4162.87 ms | 32.4% bf16 MFU | 125133 tok/s step 3642/19560 | loss 3.622991 (-0.77z)| norm 0.2925 (+0.25z)| lr 7.61e-04 | 4163.97 ms | 32.4% bf16 MFU | 125172 tok/s step 3643/19560 | loss 3.643801 (-0.25z)| norm 0.2517 (-0.32z)| lr 7.61e-04 | 4226.55 ms | 31.9% bf16 MFU | 125116 tok/s step 3644/19560 | loss 3.584882 (-1.71z)| norm 0.2594 (-0.21z)| lr 7.61e-04 | 4204.31 ms | 32.1% bf16 MFU | 125095 tok/s step 3645/19560 | loss 3.645159 (-0.20z)| norm 0.2612 (-0.19z)| lr 7.61e-04 | 4169.36 ms | 32.4% bf16 MFU | 125128 tok/s step 3646/19560 | loss 3.656337 (+0.09z)| norm 0.2367 (-0.53z)| lr 7.61e-04 | 4179.99 ms | 32.3% bf16 MFU | 125143 tok/s step 3647/19560 | loss 3.552009 (-2.47z)| norm 0.2561 (-0.26z)| lr 7.61e-04 | 4158.13 ms | 32.5% bf16 MFU | 125190 tok/s step 3648/19560 | loss 3.668268 (+0.42z)| norm 0.2773 (+0.04z)| lr 7.61e-04 | 4172.93 ms | 32.4% bf16 MFU | 125212 tok/s step 3649/19560 | loss 3.631421 (-0.49z)| norm 0.2997 (+0.34z)| lr 7.60e-04 | 4161.10 ms | 32.4% bf16 MFU | 125252 tok/s step 3650/19560 | loss 3.609034 (-1.03z)| norm 0.2916 (+0.23z)| lr 7.60e-04 | 4380.46 ms | 30.8% bf16 MFU | 124973 tok/s step 3651/19560 | loss 3.669730 (+0.48z)| norm 0.2575 (-0.24z)| lr 7.60e-04 | 4164.67 ms | 32.4% bf16 MFU | 125019 tok/s step 3652/19560 | loss 3.653754 (+0.09z)| norm 0.2583 (-0.23z)| lr 7.60e-04 | 4167.63 ms | 32.4% bf16 MFU | 125058 tok/s step 3653/19560 | loss 3.613904 (-0.89z)| norm 0.2840 (+0.13z)| lr 7.60e-04 | 4162.30 ms | 32.4% bf16 MFU | 125103 tok/s step 3654/19560 | loss 3.610949 (-0.96z)| norm 0.2651 (-0.13z)| lr 7.60e-04 | 4165.05 ms | 32.4% bf16 MFU | 125142 tok/s step 3655/19560 | loss 3.649042 (+0.04z)| norm 0.2782 (+0.05z)| lr 7.60e-04 | 4166.41 ms | 32.4% bf16 MFU | 125177 tok/s step 3656/19560 | loss 3.631205 (-0.43z)| norm 0.2609 (-0.18z)| lr 7.60e-04 | 4167.48 ms | 32.4% bf16 MFU | 125208 tok/s step 3657/19560 | loss 3.620862 (-0.69z)| norm 0.2399 (-0.47z)| lr 7.60e-04 | 4160.23 ms | 32.5% bf16 MFU | 125249 tok/s step 3658/19560 | loss 3.616316 (-0.81z)| norm 0.2446 (-0.40z)| lr 7.60e-04 | 4180.49 ms | 32.3% bf16 MFU | 125257 tok/s step 3659/19560 | loss 3.543459 (-2.69z)| norm 0.2533 (-0.26z)| lr 7.60e-04 | 4159.28 ms | 32.5% bf16 MFU | 125297 tok/s step 3660/19560 | loss 3.647452 (+0.07z)| norm 0.2544 (-0.24z)| lr 7.60e-04 | 4166.74 ms | 32.4% bf16 MFU | 125324 tok/s step 3661/19560 | loss 3.622948 (-0.57z)| norm 0.2675 (-0.06z)| lr 7.60e-04 | 4160.70 ms | 32.5% bf16 MFU | 125358 tok/s step 3662/19560 | loss 3.678394 (+0.92z)| norm 0.2641 (-0.10z)| lr 7.60e-04 | 4169.87 ms | 32.4% bf16 MFU | 125377 tok/s step 3663/19560 | loss 3.686250 (+1.11z)| norm 0.2621 (-0.13z)| lr 7.60e-04 | 4159.71 ms | 32.5% bf16 MFU | 125410 tok/s step 3664/19560 | loss 3.626042 (-0.49z)| norm 0.3011 (+0.42z)| lr 7.60e-04 | 4193.27 ms | 32.2% bf16 MFU | 125391 tok/s step 3665/19560 | loss 3.673704 (+0.79z)| norm 0.3313 (+0.84z)| lr 7.60e-04 | 4164.09 ms | 32.4% bf16 MFU | 125417 tok/s step 3666/19560 | loss 3.725777 (+2.15z)| norm 0.3061 (+0.48z)| lr 7.60e-04 | 4179.03 ms | 32.3% bf16 MFU | 125419 tok/s step 3667/19560 | loss 3.562140 (-2.17z)| norm 0.2779 (+0.08z)| lr 7.60e-04 | 4180.72 ms | 32.3% bf16 MFU | 125418 tok/s step 3668/19560 | loss 3.692367 (+1.25z)| norm 0.3072 (+0.49z)| lr 7.60e-04 | 4164.00 ms | 32.4% bf16 MFU | 125443 tok/s step 3669/19560 | loss 3.586020 (-1.54z)| norm 0.2853 (+0.18z)| lr 7.60e-04 | 4175.43 ms | 32.3% bf16 MFU | 125449 tok/s step 3670/19560 | loss 3.612574 (-0.82z)| norm 0.3090 (+0.51z)| lr 7.60e-04 | 4163.81 ms | 32.4% bf16 MFU | 125472 tok/s step 3671/19560 | loss 3.621348 (-0.58z)| norm 0.2591 (-0.19z)| lr 7.60e-04 | 4173.88 ms | 32.3% bf16 MFU | 125479 tok/s step 3672/19560 | loss 3.581098 (-1.64z)| norm 0.2834 (+0.15z)| lr 7.60e-04 | 4159.15 ms | 32.5% bf16 MFU | 125508 tok/s step 3673/19560 | loss 3.627363 (-0.41z)| norm 0.2978 (+0.34z)| lr 7.60e-04 | 4164.36 ms | 32.4% bf16 MFU | 125527 tok/s step 3674/19560 | loss 3.694020 (+1.34z)| norm 0.2465 (-0.38z)| lr 7.60e-04 | 4177.26 ms | 32.3% bf16 MFU | 125527 tok/s step 3675/19560 | loss 3.640652 (-0.06z)| norm 0.2777 (+0.06z)| lr 7.60e-04 | 4162.45 ms | 32.4% bf16 MFU | 125548 tok/s step 3676/19560 | loss 3.643657 (+0.01z)| norm 0.2680 (-0.09z)| lr 7.60e-04 | 4233.43 ms | 31.9% bf16 MFU | 125463 tok/s step 3677/19560 | loss 3.691288 (+1.27z)| norm 0.2371 (-0.52z)| lr 7.60e-04 | 4164.80 ms | 32.4% bf16 MFU | 125484 tok/s step 3678/19560 | loss 3.629624 (-0.36z)| norm 0.2461 (-0.40z)| lr 7.60e-04 | 4233.92 ms | 31.9% bf16 MFU | 125401 tok/s step 3679/19560 | loss 3.619347 (-0.63z)| norm 0.2500 (-0.34z)| lr 7.60e-04 | 4159.85 ms | 32.5% bf16 MFU | 125433 tok/s step 3680/19560 | loss 3.649918 (+0.18z)| norm 0.2356 (-0.54z)| lr 7.60e-04 | 4162.87 ms | 32.4% bf16 MFU | 125459 tok/s step 3681/19560 | loss 3.706051 (+1.69z)| norm 0.2351 (-0.55z)| lr 7.60e-04 | 4157.32 ms | 32.5% bf16 MFU | 125491 tok/s step 3682/19560 | loss 3.618602 (-0.64z)| norm 0.2488 (-0.35z)| lr 7.60e-04 | 4228.73 ms | 31.9% bf16 MFU | 125416 tok/s step 3683/19560 | loss 3.653873 (+0.30z)| norm 0.2431 (-0.43z)| lr 7.59e-04 | 4159.06 ms | 32.5% bf16 MFU | 125448 tok/s step 3684/19560 | loss 3.671491 (+0.76z)| norm 0.2206 (-0.74z)| lr 7.59e-04 | 4164.70 ms | 32.4% bf16 MFU | 125470 tok/s step 3685/19560 | loss 3.542552 (-2.61z)| norm 0.2459 (-0.38z)| lr 7.59e-04 | 4177.33 ms | 32.3% bf16 MFU | 125472 tok/s step 3686/19560 | loss 3.678967 (+0.97z)| norm 0.2443 (-0.40z)| lr 7.59e-04 | 4161.94 ms | 32.4% bf16 MFU | 125497 tok/s step 3687/19560 | loss 3.626824 (-0.39z)| norm 0.2388 (-0.47z)| lr 7.59e-04 | 4180.51 ms | 32.3% bf16 MFU | 125493 tok/s step 3688/19560 | loss 3.650692 (+0.23z)| norm 0.2676 (-0.06z)| lr 7.59e-04 | 4174.02 ms | 32.3% bf16 MFU | 125498 tok/s step 3689/19560 | loss 3.635603 (-0.16z)| norm 0.2656 (-0.09z)| lr 7.59e-04 | 4172.98 ms | 32.4% bf16 MFU | 125505 tok/s step 3690/19560 | loss 3.608734 (-0.86z)| norm 0.2386 (-0.46z)| lr 7.59e-04 | 4243.70 ms | 31.8% bf16 MFU | 125407 tok/s step 3691/19560 | loss 3.642118 (+0.02z)| norm 0.2468 (-0.35z)| lr 7.59e-04 | 4161.54 ms | 32.4% bf16 MFU | 125436 tok/s step 3692/19560 | loss 3.616813 (-0.64z)| norm 0.2514 (-0.28z)| lr 7.59e-04 | 4175.00 ms | 32.3% bf16 MFU | 125443 tok/s step 3693/19560 | loss 3.595358 (-1.18z)| norm 0.2711 (-0.01z)| lr 7.59e-04 | 4171.50 ms | 32.4% bf16 MFU | 125455 tok/s step 3694/19560 | loss 3.706971 (+1.70z)| norm 0.3010 (+0.41z)| lr 7.59e-04 | 4157.61 ms | 32.5% bf16 MFU | 125488 tok/s step 3695/19560 | loss 3.641383 (+0.02z)| norm 0.2747 (+0.04z)| lr 7.59e-04 | 4183.13 ms | 32.3% bf16 MFU | 125480 tok/s step 3696/19560 | loss 3.635019 (-0.15z)| norm 0.2614 (-0.15z)| lr 7.59e-04 | 4166.68 ms | 32.4% bf16 MFU | 125497 tok/s step 3697/19560 | loss 3.642645 (+0.04z)| norm 0.2714 (-0.01z)| lr 7.59e-04 | 4169.25 ms | 32.4% bf16 MFU | 125510 tok/s step 3698/19560 | loss 3.724008 (+2.17z)| norm 0.2806 (+0.12z)| lr 7.59e-04 | 4159.26 ms | 32.5% bf16 MFU | 125537 tok/s step 3699/19560 | loss 3.617709 (-0.62z)| norm 0.2497 (-0.32z)| lr 7.59e-04 | 4199.37 ms | 32.2% bf16 MFU | 125503 tok/s step 3700/19560 | loss 3.625213 (-0.42z)| norm 0.2869 (+0.21z)| lr 7.59e-04 | 4166.52 ms | 32.4% bf16 MFU | 125519 tok/s step 3701/19560 | loss 3.652153 (+0.30z)| norm 0.2473 (-0.35z)| lr 7.59e-04 | 4156.58 ms | 32.5% bf16 MFU | 125550 tok/s step 3702/19560 | loss 3.637192 (-0.12z)| norm 0.2424 (-0.41z)| lr 7.59e-04 | 4163.70 ms | 32.4% bf16 MFU | 125569 tok/s step 3703/19560 | loss 3.693118 (+1.38z)| norm 0.2383 (-0.46z)| lr 7.59e-04 | 4154.79 ms | 32.5% bf16 MFU | 125600 tok/s step 3704/19560 | loss 3.618910 (-0.62z)| norm 0.2309 (-0.56z)| lr 7.59e-04 | 4194.85 ms | 32.2% bf16 MFU | 125569 tok/s step 3705/19560 | loss 3.613868 (-0.75z)| norm 0.2228 (-0.68z)| lr 7.59e-04 | 4154.62 ms | 32.5% bf16 MFU | 125600 tok/s step 3706/19560 | loss 3.669389 (+0.74z)| norm 0.2486 (-0.31z)| lr 7.59e-04 | 4156.55 ms | 32.5% bf16 MFU | 125627 tok/s step 3707/19560 | loss 3.649584 (+0.20z)| norm 0.2759 (+0.07z)| lr 7.59e-04 | 4229.04 ms | 31.9% bf16 MFU | 125544 tok/s step 3708/19560 | loss 3.657473 (+0.40z)| norm 0.3075 (+0.50z)| lr 7.59e-04 | 4183.67 ms | 32.3% bf16 MFU | 125533 tok/s step 3709/19560 | loss 3.648076 (+0.16z)| norm 0.3243 (+0.73z)| lr 7.59e-04 | 4158.56 ms | 32.5% bf16 MFU | 125560 tok/s step 3710/19560 | loss 3.628234 (-0.37z)| norm 0.3100 (+0.52z)| lr 7.59e-04 | 4165.41 ms | 32.4% bf16 MFU | 125575 tok/s step 3711/19560 | loss 3.609804 (-0.86z)| norm 0.2719 (-0.02z)| lr 7.59e-04 | 4165.47 ms | 32.4% bf16 MFU | 125590 tok/s step 3712/19560 | loss 3.639105 (-0.06z)| norm 0.3025 (+0.41z)| lr 7.59e-04 | 4162.84 ms | 32.4% bf16 MFU | 125607 tok/s step 3713/19560 | loss 3.654555 (+0.36z)| norm 0.3174 (+0.61z)| lr 7.59e-04 | 4183.84 ms | 32.3% bf16 MFU | 125593 tok/s step 3714/19560 | loss 3.618872 (-0.61z)| norm 0.2534 (-0.29z)| lr 7.59e-04 | 4157.92 ms | 32.5% bf16 MFU | 125618 tok/s step 3715/19560 | loss 3.658105 (+0.47z)| norm 0.2981 (+0.33z)| lr 7.59e-04 | 4422.87 ms | 30.5% bf16 MFU | 125264 tok/s step 3716/19560 | loss 3.613228 (-0.78z)| norm 0.2795 (+0.07z)| lr 7.58e-04 | 4173.88 ms | 32.3% bf16 MFU | 125281 tok/s step 3717/19560 | loss 3.602761 (-1.05z)| norm 0.2859 (+0.16z)| lr 7.58e-04 | 4158.55 ms | 32.5% bf16 MFU | 125321 tok/s step 3718/19560 | loss 3.717000 (+2.07z)| norm 0.2688 (-0.08z)| lr 7.58e-04 | 4188.98 ms | 32.2% bf16 MFU | 125313 tok/s step 3719/19560 | loss 3.655835 (+0.39z)| norm 0.2694 (-0.07z)| lr 7.58e-04 | 4156.89 ms | 32.5% bf16 MFU | 125353 tok/s step 3720/19560 | loss 3.714570 (+1.96z)| norm 0.2510 (-0.33z)| lr 7.58e-04 | 4158.58 ms | 32.5% bf16 MFU | 125389 tok/s step 3721/19560 | loss 3.600358 (-1.10z)| norm 0.2453 (-0.40z)| lr 7.58e-04 | 4158.78 ms | 32.5% bf16 MFU | 125423 tok/s step 3722/19560 | loss 3.587057 (-1.43z)| norm 0.2488 (-0.35z)| lr 7.58e-04 | 4178.33 ms | 32.3% bf16 MFU | 125426 tok/s step 3723/19560 | loss 3.535583 (-2.70z)| norm 0.2634 (-0.14z)| lr 7.58e-04 | 4161.18 ms | 32.4% bf16 MFU | 125455 tok/s step 3724/19560 | loss 3.544432 (-2.40z)| norm 0.2833 (+0.13z)| lr 7.58e-04 | 4153.52 ms | 32.5% bf16 MFU | 125493 tok/s step 3725/19560 | loss 3.616151 (-0.57z)| norm 0.2993 (+0.35z)| lr 7.58e-04 | 4157.57 ms | 32.5% bf16 MFU | 125524 tok/s step 3726/19560 | loss 3.641429 (+0.07z)| norm 0.2558 (-0.26z)| lr 7.58e-04 | 4193.33 ms | 32.2% bf16 MFU | 125499 tok/s step 3727/19560 | loss 3.660861 (+0.57z)| norm 0.2620 (-0.17z)| lr 7.58e-04 | 4243.46 ms | 31.8% bf16 MFU | 125402 tok/s step 3728/19560 | loss 3.671476 (+0.83z)| norm 0.2571 (-0.22z)| lr 7.58e-04 | 4164.15 ms | 32.4% bf16 MFU | 125427 tok/s step 3729/19560 | loss 3.628351 (-0.28z)| norm 0.2287 (-0.92z)| lr 7.58e-04 | 4786.27 ms | 28.2% bf16 MFU | 124633 tok/s step 3730/19560 | loss 3.623503 (-0.39z)| norm 0.2548 (-0.31z)| lr 7.58e-04 | 4659.48 ms | 29.0% bf16 MFU | 124027 tok/s step 3731/19560 | loss 3.566258 (-1.88z)| norm 0.2720 (+0.31z)| lr 7.58e-04 | 4645.65 ms | 29.1% bf16 MFU | 123468 tok/s step 3732/19560 | loss 3.594187 (-1.13z)| norm 0.2661 (+0.12z)| lr 7.58e-04 | 4896.63 ms | 27.6% bf16 MFU | 122649 tok/s step 3733/19560 | loss 3.669167 (+0.85z)| norm 0.2508 (-0.45z)| lr 7.58e-04 | 4256.05 ms | 31.7% bf16 MFU | 122675 tok/s step 3734/19560 | loss 3.687978 (+1.32z)| norm 0.2688 (+0.31z)| lr 7.58e-04 | 4301.89 ms | 31.4% bf16 MFU | 122635 tok/s step 3735/19560 | loss 3.686565 (+1.27z)| norm 0.2724 (+0.46z)| lr 7.58e-04 | 4220.22 ms | 32.0% bf16 MFU | 122715 tok/s step 3736/19560 | loss 3.627956 (-0.27z)| norm 0.2391 (-0.91z)| lr 7.58e-04 | 4165.14 ms | 32.4% bf16 MFU | 122873 tok/s step 3737/19560 | loss 3.658481 (+0.54z)| norm 0.2404 (-0.84z)| lr 7.58e-04 | 4195.98 ms | 32.2% bf16 MFU | 122977 tok/s step 3738/19560 | loss 3.643509 (+0.14z)| norm 0.2662 (+0.24z)| lr 7.58e-04 | 4167.67 ms | 32.4% bf16 MFU | 123118 tok/s step 3739/19560 | loss 3.621713 (-0.42z)| norm 0.2931 (+1.34z)| lr 7.58e-04 | 4234.47 ms | 31.9% bf16 MFU | 123153 tok/s step 3740/19560 | loss 3.699159 (+1.63z)| norm 0.2709 (+0.40z)| lr 7.58e-04 | 4335.90 ms | 31.1% bf16 MFU | 123041 tok/s step 3741/19560 | loss 3.634329 (-0.10z)| norm 0.2727 (+0.48z)| lr 7.58e-04 | 4238.25 ms | 31.9% bf16 MFU | 123074 tok/s step 3742/19560 | loss 3.592837 (-1.18z)| norm 0.2761 (+0.62z)| lr 7.58e-04 | 4200.00 ms | 32.1% bf16 MFU | 123162 tok/s step 3743/19560 | loss 3.598835 (-1.02z)| norm 0.2477 (-0.57z)| lr 7.58e-04 | 4229.05 ms | 31.9% bf16 MFU | 123203 tok/s step 3744/19560 | loss 3.645473 (+0.22z)| norm 0.2787 (+0.72z)| lr 7.58e-04 | 4305.31 ms | 31.4% bf16 MFU | 123131 tok/s step 3745/19560 | loss 3.634544 (-0.07z)| norm 0.2741 (+0.52z)| lr 7.58e-04 | 4164.19 ms | 32.4% bf16 MFU | 123270 tok/s step 3746/19560 | loss 3.588858 (-1.27z)| norm 0.2860 (+1.00z)| lr 7.58e-04 | 4160.82 ms | 32.4% bf16 MFU | 123407 tok/s step 3747/19560 | loss 3.597966 (-1.02z)| norm 0.2715 (+0.40z)| lr 7.58e-04 | 4245.56 ms | 31.8% bf16 MFU | 123411 tok/s step 3748/19560 | loss 3.687335 (+1.32z)| norm 0.3009 (+1.59z)| lr 7.58e-04 | 4209.41 ms | 32.1% bf16 MFU | 123468 tok/s step 3749/19560 | loss 3.654870 (+0.47z)| norm 0.2427 (-0.82z)| lr 7.58e-04 | 4177.94 ms | 32.3% bf16 MFU | 123569 tok/s step 3750/19560 | loss 3.643928 (+0.17z)| norm 0.2631 (+0.01z)| lr 7.57e-04 | 4206.05 ms | 32.1% bf16 MFU | 123623 tok/s val loss 3.618332 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2704/10042 = 0.269269 step 3751/19560 | loss 3.641587 (+0.12z)| norm 0.2802 (+0.72z)| lr 7.57e-04 | 4464.45 ms | 30.2% bf16 MFU | 123314 tok/s step 3752/19560 | loss 3.657893 (+0.55z)| norm 0.2507 (-0.54z)| lr 7.57e-04 | 4378.09 ms | 30.8% bf16 MFU | 123136 tok/s step 3753/19560 | loss 3.675269 (+1.01z)| norm 0.2680 (+0.19z)| lr 7.57e-04 | 4226.90 ms | 31.9% bf16 MFU | 123181 tok/s step 3754/19560 | loss 3.576312 (-1.58z)| norm 0.2498 (-0.60z)| lr 7.57e-04 | 4353.74 ms | 31.0% bf16 MFU | 123043 tok/s step 3755/19560 | loss 3.625750 (-0.28z)| norm 0.2649 (+0.04z)| lr 7.57e-04 | 4646.87 ms | 29.1% bf16 MFU | 122532 tok/s step 3756/19560 | loss 3.657473 (+0.55z)| norm 0.2495 (-0.64z)| lr 7.57e-04 | 4196.05 ms | 32.2% bf16 MFU | 122653 tok/s step 3757/19560 | loss 3.600356 (-0.94z)| norm 0.2565 (-0.34z)| lr 7.57e-04 | 4199.85 ms | 32.1% bf16 MFU | 122762 tok/s step 3758/19560 | loss 3.673976 (+0.97z)| norm 0.2462 (-0.80z)| lr 7.57e-04 | 4285.86 ms | 31.5% bf16 MFU | 122740 tok/s step 3759/19560 | loss 3.664672 (+0.72z)| norm 0.2805 (+0.70z)| lr 7.57e-04 | 4169.00 ms | 32.4% bf16 MFU | 122891 tok/s step 3760/19560 | loss 3.653243 (+0.41z)| norm 0.2641 (-0.03z)| lr 7.57e-04 | 4171.48 ms | 32.4% bf16 MFU | 123031 tok/s step 3761/19560 | loss 3.640926 (+0.10z)| norm 0.2484 (-0.73z)| lr 7.57e-04 | 4166.15 ms | 32.4% bf16 MFU | 123172 tok/s step 3762/19560 | loss 3.685501 (+1.26z)| norm 0.2612 (-0.17z)| lr 7.57e-04 | 4301.43 ms | 31.4% bf16 MFU | 123107 tok/s step 3763/19560 | loss 3.644004 (+0.17z)| norm 0.2793 (+0.63z)| lr 7.57e-04 | 4174.06 ms | 32.3% bf16 MFU | 123232 tok/s step 3764/19560 | loss 3.629272 (-0.22z)| norm 0.3020 (+1.62z)| lr 7.57e-04 | 4193.96 ms | 32.2% bf16 MFU | 123321 tok/s step 3765/19560 | loss 3.718450 (+2.10z)| norm 0.7119 (+9.81z)| lr 7.57e-04 | 4163.22 ms | 32.4% bf16 MFU | 123452 tok/s step 3766/19560 | loss 3.598904 (-1.00z)| norm 0.3763 (+2.30z)| lr 7.57e-04 | 4273.28 ms | 31.6% bf16 MFU | 123414 tok/s step 3767/19560 | loss 3.831583 (+4.56z)| norm 0.4046 (+2.80z)| lr 7.57e-04 | 4172.27 ms | 32.4% bf16 MFU | 123526 tok/s step 3768/19560 | loss 3.689206 (+1.17z)| norm 0.5654 (+5.40z)| lr 7.57e-04 | 4167.85 ms | 32.4% bf16 MFU | 123639 tok/s step 3769/19560 | loss 3.716629 (+1.78z)| norm 0.4384 (+2.92z)| lr 7.57e-04 | 4270.97 ms | 31.6% bf16 MFU | 123595 tok/s step 3770/19560 | loss 3.624875 (-0.36z)| norm 0.3815 (+1.86z)| lr 7.57e-04 | 4189.79 ms | 32.2% bf16 MFU | 123672 tok/s step 3771/19560 | loss 3.636944 (-0.08z)| norm 0.3093 (+0.57z)| lr 7.57e-04 | 4168.12 ms | 32.4% bf16 MFU | 123778 tok/s step 3772/19560 | loss 3.680439 (+0.92z)| norm 0.2890 (+0.21z)| lr 7.57e-04 | 4187.93 ms | 32.2% bf16 MFU | 123848 tok/s step 3773/19560 | loss 3.578210 (-1.44z)| norm 0.2797 (+0.05z)| lr 7.57e-04 | 4175.47 ms | 32.3% bf16 MFU | 123934 tok/s step 3774/19560 | loss 3.618062 (-0.51z)| norm 0.2768 (-0.01z)| lr 7.57e-04 | 4165.45 ms | 32.4% bf16 MFU | 124031 tok/s step 3775/19560 | loss 3.631929 (-0.21z)| norm 0.2607 (-0.30z)| lr 7.57e-04 | 4193.84 ms | 32.2% bf16 MFU | 124080 tok/s step 3776/19560 | loss 3.669895 (+0.69z)| norm 0.2424 (-0.61z)| lr 7.57e-04 | 4163.78 ms | 32.4% bf16 MFU | 124172 tok/s step 3777/19560 | loss 3.666593 (+0.60z)| norm 0.2556 (-0.37z)| lr 7.57e-04 | 4169.39 ms | 32.4% bf16 MFU | 124251 tok/s step 3778/19560 | loss 3.656074 (+0.35z)| norm 0.2477 (-0.51z)| lr 7.57e-04 | 4174.98 ms | 32.3% bf16 MFU | 124317 tok/s step 3779/19560 | loss 3.735562 (+2.17z)| norm 0.2548 (-0.38z)| lr 7.57e-04 | 4236.44 ms | 31.9% bf16 MFU | 124289 tok/s step 3780/19560 | loss 3.686937 (+1.03z)| norm 0.2730 (-0.06z)| lr 7.57e-04 | 4176.07 ms | 32.3% bf16 MFU | 124352 tok/s step 3781/19560 | loss 3.589331 (-1.21z)| norm 0.2255 (-0.89z)| lr 7.57e-04 | 4165.37 ms | 32.4% bf16 MFU | 124428 tok/s step 3782/19560 | loss 3.706987 (+1.47z)| norm 0.2768 (+0.01z)| lr 7.56e-04 | 4171.54 ms | 32.4% bf16 MFU | 124490 tok/s step 3783/19560 | loss 3.631186 (-0.26z)| norm 0.2787 (+0.04z)| lr 7.56e-04 | 4180.85 ms | 32.3% bf16 MFU | 124536 tok/s step 3784/19560 | loss 3.638551 (-0.09z)| norm 0.2479 (-0.50z)| lr 7.56e-04 | 4173.09 ms | 32.4% bf16 MFU | 124591 tok/s step 3785/19560 | loss 3.604642 (-0.86z)| norm 0.2630 (-0.23z)| lr 7.56e-04 | 4166.51 ms | 32.4% bf16 MFU | 124653 tok/s step 3786/19560 | loss 3.611806 (-0.70z)| norm 0.2578 (-0.33z)| lr 7.56e-04 | 4201.87 ms | 32.1% bf16 MFU | 124659 tok/s step 3787/19560 | loss 3.642534 (-0.02z)| norm 0.2434 (-0.58z)| lr 7.56e-04 | 4174.42 ms | 32.3% bf16 MFU | 124706 tok/s step 3788/19560 | loss 3.642597 (-0.01z)| norm 0.2498 (-0.47z)| lr 7.56e-04 | 4170.05 ms | 32.4% bf16 MFU | 124757 tok/s step 3789/19560 | loss 3.588924 (-1.25z)| norm 0.2286 (-0.83z)| lr 7.56e-04 | 4405.64 ms | 30.6% bf16 MFU | 124469 tok/s step 3790/19560 | loss 3.603866 (-0.89z)| norm 0.2595 (-0.29z)| lr 7.56e-04 | 4161.01 ms | 32.4% bf16 MFU | 124546 tok/s step 3791/19560 | loss 3.572989 (-1.58z)| norm 0.2493 (-0.47z)| lr 7.56e-04 | 4169.50 ms | 32.4% bf16 MFU | 124606 tok/s step 3792/19560 | loss 3.627974 (-0.31z)| norm 0.2635 (-0.21z)| lr 7.56e-04 | 4170.81 ms | 32.4% bf16 MFU | 124661 tok/s step 3793/19560 | loss 3.683620 (+0.96z)| norm 0.2132 (-1.08z)| lr 7.56e-04 | 4171.59 ms | 32.4% bf16 MFU | 124712 tok/s step 3794/19560 | loss 3.658113 (+0.40z)| norm 0.2585 (-0.28z)| lr 7.56e-04 | 4211.77 ms | 32.1% bf16 MFU | 124700 tok/s step 3795/19560 | loss 3.617849 (-0.56z)| norm 0.2404 (-0.59z)| lr 7.56e-04 | 4161.71 ms | 32.4% bf16 MFU | 124764 tok/s step 3796/19560 | loss 3.575547 (-1.53z)| norm 0.2652 (-0.15z)| lr 7.56e-04 | 4170.87 ms | 32.4% bf16 MFU | 124811 tok/s step 3797/19560 | loss 3.652647 (+0.27z)| norm 0.2515 (-0.38z)| lr 7.56e-04 | 4171.19 ms | 32.4% bf16 MFU | 124855 tok/s step 3798/19560 | loss 3.611290 (-0.70z)| norm 0.2580 (-0.26z)| lr 7.56e-04 | 4172.01 ms | 32.4% bf16 MFU | 124896 tok/s step 3799/19560 | loss 3.675527 (+0.80z)| norm 0.2693 (-0.07z)| lr 7.56e-04 | 4167.81 ms | 32.4% bf16 MFU | 124941 tok/s step 3800/19560 | loss 3.593608 (-1.14z)| norm 0.2676 (-0.09z)| lr 7.56e-04 | 4238.51 ms | 31.9% bf16 MFU | 124878 tok/s step 3801/19560 | loss 3.644119 (+0.06z)| norm 0.2710 (-0.03z)| lr 7.56e-04 | 4179.41 ms | 32.3% bf16 MFU | 124907 tok/s step 3802/19560 | loss 3.728999 (+2.04z)| norm 0.2792 (+0.11z)| lr 7.56e-04 | 4165.14 ms | 32.4% bf16 MFU | 124955 tok/s step 3803/19560 | loss 3.649806 (+0.18z)| norm 0.2648 (-0.14z)| lr 7.56e-04 | 4173.34 ms | 32.4% bf16 MFU | 124989 tok/s step 3804/19560 | loss 3.575207 (-1.54z)| norm 0.2706 (-0.04z)| lr 7.56e-04 | 4168.08 ms | 32.4% bf16 MFU | 125029 tok/s step 3805/19560 | loss 3.685829 (+1.03z)| norm 0.2405 (-0.57z)| lr 7.56e-04 | 4174.85 ms | 32.3% bf16 MFU | 125056 tok/s step 3806/19560 | loss 3.671942 (+0.70z)| norm 0.2780 (+0.09z)| lr 7.56e-04 | 4178.60 ms | 32.3% bf16 MFU | 125077 tok/s step 3807/19560 | loss 3.642241 (+0.00z)| norm 0.2700 (-0.06z)| lr 7.56e-04 | 4168.56 ms | 32.4% bf16 MFU | 125112 tok/s step 3808/19560 | loss 3.639641 (-0.05z)| norm 0.2298 (-0.77z)| lr 7.56e-04 | 4165.74 ms | 32.4% bf16 MFU | 125149 tok/s step 3809/19560 | loss 3.689992 (+1.12z)| norm 0.2795 (+0.10z)| lr 7.56e-04 | 4175.23 ms | 32.3% bf16 MFU | 125170 tok/s step 3810/19560 | loss 3.640734 (-0.03z)| norm 0.2464 (-0.48z)| lr 7.56e-04 | 4188.06 ms | 32.2% bf16 MFU | 125171 tok/s step 3811/19560 | loss 3.616681 (-0.58z)| norm 0.2632 (-0.19z)| lr 7.56e-04 | 4163.99 ms | 32.4% bf16 MFU | 125208 tok/s step 3812/19560 | loss 3.604774 (-0.85z)| norm 0.2417 (-0.57z)| lr 7.56e-04 | 4166.35 ms | 32.4% bf16 MFU | 125239 tok/s step 3813/19560 | loss 3.602659 (-0.92z)| norm 0.2807 (+0.12z)| lr 7.56e-04 | 4172.92 ms | 32.4% bf16 MFU | 125260 tok/s step 3814/19560 | loss 3.620418 (-0.49z)| norm 0.2891 (+0.26z)| lr 7.56e-04 | 4168.73 ms | 32.4% bf16 MFU | 125285 tok/s step 3815/19560 | loss 3.644053 (+0.06z)| norm 0.2335 (-0.73z)| lr 7.55e-04 | 4208.57 ms | 32.1% bf16 MFU | 125249 tok/s step 3816/19560 | loss 3.625386 (-0.37z)| norm 0.2283 (-0.81z)| lr 7.55e-04 | 4167.01 ms | 32.4% bf16 MFU | 125278 tok/s step 3817/19560 | loss 3.624877 (-0.38z)| norm 0.2273 (-0.82z)| lr 7.55e-04 | 4173.20 ms | 32.4% bf16 MFU | 125296 tok/s step 3818/19560 | loss 3.687585 (+1.09z)| norm 0.2409 (-0.58z)| lr 7.55e-04 | 4228.28 ms | 31.9% bf16 MFU | 125231 tok/s step 3819/19560 | loss 3.648401 (+0.16z)| norm 0.2850 (+0.19z)| lr 7.55e-04 | 4167.85 ms | 32.4% bf16 MFU | 125259 tok/s step 3820/19560 | loss 3.636431 (-0.13z)| norm 0.2923 (+0.31z)| lr 7.55e-04 | 4173.72 ms | 32.3% bf16 MFU | 125277 tok/s step 3821/19560 | loss 3.616826 (-0.60z)| norm 0.3025 (+0.49z)| lr 7.55e-04 | 4195.49 ms | 32.2% bf16 MFU | 125261 tok/s step 3822/19560 | loss 3.617338 (-0.58z)| norm 0.2889 (+0.25z)| lr 7.55e-04 | 4185.25 ms | 32.3% bf16 MFU | 125262 tok/s step 3823/19560 | loss 3.664130 (+0.54z)| norm 0.3089 (+0.60z)| lr 7.55e-04 | 4168.44 ms | 32.4% bf16 MFU | 125287 tok/s step 3824/19560 | loss 3.635442 (-0.15z)| norm 0.2771 (+0.04z)| lr 7.55e-04 | 4172.37 ms | 32.4% bf16 MFU | 125306 tok/s step 3825/19560 | loss 3.668332 (+0.64z)| norm 0.2559 (-0.34z)| lr 7.55e-04 | 4168.02 ms | 32.4% bf16 MFU | 125330 tok/s step 3826/19560 | loss 3.614740 (-0.63z)| norm 0.2793 (+0.08z)| lr 7.55e-04 | 4173.30 ms | 32.4% bf16 MFU | 125345 tok/s step 3827/19560 | loss 3.670484 (+0.71z)| norm 0.2826 (+0.13z)| lr 7.55e-04 | 4181.86 ms | 32.3% bf16 MFU | 125346 tok/s step 3828/19560 | loss 3.610121 (-0.75z)| norm 0.2965 (+0.37z)| lr 7.55e-04 | 4163.74 ms | 32.4% bf16 MFU | 125375 tok/s step 3829/19560 | loss 3.690812 (+1.19z)| norm 0.2388 (-0.64z)| lr 7.55e-04 | 4156.34 ms | 32.5% bf16 MFU | 125413 tok/s step 3830/19560 | loss 3.587262 (-1.29z)| norm 0.2963 (+0.37z)| lr 7.55e-04 | 4173.83 ms | 32.3% bf16 MFU | 125423 tok/s step 3831/19560 | loss 3.611774 (-0.69z)| norm 0.2504 (-0.45z)| lr 7.55e-04 | 4178.52 ms | 32.3% bf16 MFU | 125426 tok/s step 3832/19560 | loss 3.588031 (-1.25z)| norm 0.2422 (-0.59z)| lr 7.55e-04 | 4167.93 ms | 32.4% bf16 MFU | 125444 tok/s step 3833/19560 | loss 3.596036 (-1.05z)| norm 0.2391 (-0.65z)| lr 7.55e-04 | 4161.78 ms | 32.4% bf16 MFU | 125470 tok/s step 3834/19560 | loss 3.650335 (+0.25z)| norm 0.2620 (-0.25z)| lr 7.55e-04 | 4176.41 ms | 32.3% bf16 MFU | 125474 tok/s step 3835/19560 | loss 3.623390 (-0.39z)| norm 0.2624 (-0.24z)| lr 7.55e-04 | 4168.27 ms | 32.4% bf16 MFU | 125489 tok/s step 3836/19560 | loss 3.576043 (-1.49z)| norm 0.2568 (-0.33z)| lr 7.55e-04 | 4182.44 ms | 32.3% bf16 MFU | 125482 tok/s step 3837/19560 | loss 3.630189 (-0.21z)| norm 0.2384 (-0.65z)| lr 7.55e-04 | 4164.87 ms | 32.4% bf16 MFU | 125502 tok/s step 3838/19560 | loss 3.655658 (+0.39z)| norm 0.2662 (-0.15z)| lr 7.55e-04 | 4169.41 ms | 32.4% bf16 MFU | 125515 tok/s step 3839/19560 | loss 3.615727 (-0.56z)| norm 0.2421 (-0.57z)| lr 7.55e-04 | 4171.05 ms | 32.4% bf16 MFU | 125524 tok/s step 3840/19560 | loss 3.607034 (-0.75z)| norm 0.2465 (-0.49z)| lr 7.55e-04 | 4174.08 ms | 32.3% bf16 MFU | 125528 tok/s step 3841/19560 | loss 3.661175 (+0.52z)| norm 0.2486 (-0.44z)| lr 7.55e-04 | 4218.25 ms | 32.0% bf16 MFU | 125466 tok/s step 3842/19560 | loss 3.642005 (+0.07z)| norm 0.2710 (-0.04z)| lr 7.55e-04 | 4173.21 ms | 32.4% bf16 MFU | 125474 tok/s step 3843/19560 | loss 3.603115 (-0.84z)| norm 0.2797 (+0.11z)| lr 7.55e-04 | 4167.04 ms | 32.4% bf16 MFU | 125491 tok/s step 3844/19560 | loss 3.646423 (+0.18z)| norm 0.3532 (+1.40z)| lr 7.55e-04 | 4168.90 ms | 32.4% bf16 MFU | 125505 tok/s step 3845/19560 | loss 3.580310 (-1.38z)| norm 0.3301 (+0.99z)| lr 7.55e-04 | 4173.10 ms | 32.4% bf16 MFU | 125511 tok/s step 3846/19560 | loss 3.610502 (-0.65z)| norm 0.2642 (-0.18z)| lr 7.55e-04 | 4175.79 ms | 32.3% bf16 MFU | 125514 tok/s step 3847/19560 | loss 3.610773 (-0.64z)| norm 0.3029 (+0.50z)| lr 7.54e-04 | 4162.65 ms | 32.4% bf16 MFU | 125535 tok/s step 3848/19560 | loss 3.678642 (+0.99z)| norm 0.2697 (-0.09z)| lr 7.54e-04 | 4174.40 ms | 32.3% bf16 MFU | 125538 tok/s step 3849/19560 | loss 3.624581 (-0.31z)| norm 0.4274 (+2.60z)| lr 7.54e-04 | 4168.29 ms | 32.4% bf16 MFU | 125551 tok/s step 3850/19560 | loss 3.624631 (-0.32z)| norm 0.2621 (-0.24z)| lr 7.54e-04 | 4161.36 ms | 32.4% bf16 MFU | 125572 tok/s step 3851/19560 | loss 3.617800 (-0.51z)| norm 0.2653 (-0.19z)| lr 7.54e-04 | 4177.20 ms | 32.3% bf16 MFU | 125569 tok/s step 3852/19560 | loss 3.630080 (-0.23z)| norm 0.2658 (-0.18z)| lr 7.54e-04 | 4165.82 ms | 32.4% bf16 MFU | 125584 tok/s step 3853/19560 | loss 3.599722 (-0.99z)| norm 0.2512 (-0.42z)| lr 7.54e-04 | 4179.30 ms | 32.3% bf16 MFU | 125577 tok/s step 3854/19560 | loss 3.669736 (+0.77z)| norm 0.2914 (+0.27z)| lr 7.54e-04 | 4169.38 ms | 32.4% bf16 MFU | 125585 tok/s step 3855/19560 | loss 3.575095 (-1.59z)| norm 0.2712 (-0.08z)| lr 7.54e-04 | 4164.20 ms | 32.4% bf16 MFU | 125601 tok/s step 3856/19560 | loss 3.612973 (-0.63z)| norm 0.3017 (+0.44z)| lr 7.54e-04 | 4168.19 ms | 32.4% bf16 MFU | 125610 tok/s step 3857/19560 | loss 3.612801 (-0.63z)| norm 0.2704 (-0.11z)| lr 7.54e-04 | 4171.48 ms | 32.4% bf16 MFU | 125614 tok/s step 3858/19560 | loss 3.611711 (-0.65z)| norm 0.2744 (-0.04z)| lr 7.54e-04 | 4172.23 ms | 32.4% bf16 MFU | 125616 tok/s step 3859/19560 | loss 3.629344 (-0.23z)| norm 0.2709 (-0.10z)| lr 7.54e-04 | 4172.63 ms | 32.4% bf16 MFU | 125618 tok/s step 3860/19560 | loss 3.656672 (+0.45z)| norm 0.2668 (-0.17z)| lr 7.54e-04 | 4162.56 ms | 32.4% bf16 MFU | 125635 tok/s step 3861/19560 | loss 3.618244 (-0.51z)| norm 0.2435 (-0.57z)| lr 7.54e-04 | 4169.07 ms | 32.4% bf16 MFU | 125641 tok/s step 3862/19560 | loss 3.654033 (+0.40z)| norm 0.2544 (-0.38z)| lr 7.54e-04 | 4167.00 ms | 32.4% bf16 MFU | 125650 tok/s step 3863/19560 | loss 3.673151 (+0.90z)| norm 0.2989 (+0.38z)| lr 7.54e-04 | 4164.27 ms | 32.4% bf16 MFU | 125662 tok/s step 3864/19560 | loss 3.636087 (-0.05z)| norm 0.2731 (-0.07z)| lr 7.54e-04 | 4172.07 ms | 32.4% bf16 MFU | 125663 tok/s step 3865/19560 | loss 3.671768 (+0.86z)| norm 0.2516 (-0.44z)| lr 7.54e-04 | 4166.09 ms | 32.4% bf16 MFU | 125672 tok/s step 3866/19560 | loss 3.612106 (-0.66z)| norm 0.2647 (-0.22z)| lr 7.54e-04 | 4173.89 ms | 32.3% bf16 MFU | 125669 tok/s step 3867/19560 | loss 3.712655 (+1.86z)| norm 0.2913 (+0.24z)| lr 7.54e-04 | 4171.99 ms | 32.4% bf16 MFU | 125669 tok/s step 3868/19560 | loss 3.646142 (+0.20z)| norm 0.2786 (+0.02z)| lr 7.54e-04 | 4167.41 ms | 32.4% bf16 MFU | 125676 tok/s step 3869/19560 | loss 3.626257 (-0.31z)| norm 0.2803 (+0.05z)| lr 7.54e-04 | 4165.64 ms | 32.4% bf16 MFU | 125685 tok/s step 3870/19560 | loss 3.630068 (-0.22z)| norm 0.3073 (+0.51z)| lr 7.54e-04 | 4162.01 ms | 32.4% bf16 MFU | 125699 tok/s step 3871/19560 | loss 3.694380 (+1.40z)| norm 0.3164 (+0.66z)| lr 7.54e-04 | 4169.33 ms | 32.4% bf16 MFU | 125702 tok/s step 3872/19560 | loss 3.657109 (+0.45z)| norm 0.3083 (+0.52z)| lr 7.54e-04 | 4163.93 ms | 32.4% bf16 MFU | 125712 tok/s step 3873/19560 | loss 3.523407 (-2.83z)| norm 0.2826 (+0.07z)| lr 7.54e-04 | 4171.27 ms | 32.4% bf16 MFU | 125711 tok/s step 3874/19560 | loss 3.591715 (-1.15z)| norm 0.2665 (-0.20z)| lr 7.54e-04 | 4170.49 ms | 32.4% bf16 MFU | 125711 tok/s step 3875/19560 | loss 3.651893 (+0.32z)| norm 0.2718 (-0.11z)| lr 7.54e-04 | 4172.82 ms | 32.4% bf16 MFU | 125708 tok/s step 3876/19560 | loss 3.674401 (+0.88z)| norm 0.2632 (-0.25z)| lr 7.54e-04 | 4163.14 ms | 32.4% bf16 MFU | 125719 tok/s step 3877/19560 | loss 3.635771 (-0.07z)| norm 0.2591 (-0.33z)| lr 7.54e-04 | 4169.04 ms | 32.4% bf16 MFU | 125721 tok/s step 3878/19560 | loss 3.601298 (-0.92z)| norm 0.2453 (-0.56z)| lr 7.53e-04 | 4164.26 ms | 32.4% bf16 MFU | 125730 tok/s step 3879/19560 | loss 3.576254 (-1.51z)| norm 0.2410 (-0.63z)| lr 7.53e-04 | 4168.17 ms | 32.4% bf16 MFU | 125733 tok/s step 3880/19560 | loss 3.624865 (-0.31z)| norm 0.2562 (-0.37z)| lr 7.53e-04 | 4164.74 ms | 32.4% bf16 MFU | 125740 tok/s step 3881/19560 | loss 3.619160 (-0.44z)| norm 0.2440 (-0.57z)| lr 7.53e-04 | 4173.81 ms | 32.3% bf16 MFU | 125734 tok/s step 3882/19560 | loss 3.611621 (-0.64z)| norm 0.2455 (-0.55z)| lr 7.53e-04 | 4164.78 ms | 32.4% bf16 MFU | 125742 tok/s step 3883/19560 | loss 3.586643 (-1.24z)| norm 0.2332 (-0.75z)| lr 7.53e-04 | 4162.47 ms | 32.4% bf16 MFU | 125752 tok/s step 3884/19560 | loss 3.597667 (-0.96z)| norm 0.2679 (-0.16z)| lr 7.53e-04 | 4167.55 ms | 32.4% bf16 MFU | 125755 tok/s step 3885/19560 | loss 3.685046 (+1.17z)| norm 0.2801 (+0.04z)| lr 7.53e-04 | 4169.95 ms | 32.4% bf16 MFU | 125754 tok/s step 3886/19560 | loss 3.617408 (-0.48z)| norm 0.3348 (+0.97z)| lr 7.53e-04 | 4179.12 ms | 32.3% bf16 MFU | 125739 tok/s step 3887/19560 | loss 3.590666 (-1.12z)| norm 0.2926 (+0.24z)| lr 7.53e-04 | 4167.11 ms | 32.4% bf16 MFU | 125743 tok/s step 3888/19560 | loss 3.640822 (+0.11z)| norm 0.2685 (-0.17z)| lr 7.53e-04 | 4164.42 ms | 32.4% bf16 MFU | 125750 tok/s step 3889/19560 | loss 3.656264 (+0.49z)| norm 0.2687 (-0.17z)| lr 7.53e-04 | 4203.06 ms | 32.1% bf16 MFU | 125700 tok/s step 3890/19560 | loss 3.603176 (-0.80z)| norm 0.2480 (-0.52z)| lr 7.53e-04 | 4171.31 ms | 32.4% bf16 MFU | 125699 tok/s step 3891/19560 | loss 3.586709 (-1.19z)| norm 0.2407 (-0.64z)| lr 7.53e-04 | 4169.28 ms | 32.4% bf16 MFU | 125702 tok/s step 3892/19560 | loss 3.692354 (+1.37z)| norm 0.2762 (-0.03z)| lr 7.53e-04 | 4164.06 ms | 32.4% bf16 MFU | 125712 tok/s step 3893/19560 | loss 3.638568 (+0.08z)| norm 0.2557 (-0.42z)| lr 7.53e-04 | 4165.37 ms | 32.4% bf16 MFU | 125720 tok/s step 3894/19560 | loss 3.624585 (-0.27z)| norm 0.2717 (-0.04z)| lr 7.53e-04 | 4168.93 ms | 32.4% bf16 MFU | 125722 tok/s step 3895/19560 | loss 3.630966 (-0.08z)| norm 0.2926 (+0.48z)| lr 7.53e-04 | 4166.17 ms | 32.4% bf16 MFU | 125728 tok/s step 3896/19560 | loss 3.648807 (+0.42z)| norm 0.2444 (-0.79z)| lr 7.53e-04 | 4163.66 ms | 32.4% bf16 MFU | 125738 tok/s step 3897/19560 | loss 3.644820 (+0.34z)| norm 0.2640 (-0.16z)| lr 7.53e-04 | 4176.60 ms | 32.3% bf16 MFU | 125727 tok/s step 3898/19560 | loss 3.640212 (+0.20z)| norm 0.2387 (-1.06z)| lr 7.53e-04 | 4175.90 ms | 32.3% bf16 MFU | 125718 tok/s step 3899/19560 | loss 3.615495 (-0.49z)| norm 0.2489 (-0.67z)| lr 7.53e-04 | 4163.87 ms | 32.4% bf16 MFU | 125728 tok/s step 3900/19560 | loss 3.623387 (-0.26z)| norm 0.2574 (-0.35z)| lr 7.53e-04 | 4181.63 ms | 32.3% bf16 MFU | 125711 tok/s step 3901/19560 | loss 3.635385 (+0.07z)| norm 0.2440 (-0.83z)| lr 7.53e-04 | 4162.06 ms | 32.4% bf16 MFU | 125724 tok/s step 3902/19560 | loss 3.626052 (-0.20z)| norm 0.2489 (-0.65z)| lr 7.53e-04 | 4164.05 ms | 32.4% bf16 MFU | 125733 tok/s step 3903/19560 | loss 3.650856 (+0.51z)| norm 0.2455 (-0.76z)| lr 7.53e-04 | 4221.31 ms | 32.0% bf16 MFU | 125656 tok/s step 3904/19560 | loss 3.632765 (+0.00z)| norm 0.2783 (+0.43z)| lr 7.53e-04 | 4160.44 ms | 32.5% bf16 MFU | 125674 tok/s step 3905/19560 | loss 3.677005 (+1.26z)| norm 0.2677 (+0.04z)| lr 7.53e-04 | 4166.40 ms | 32.4% bf16 MFU | 125682 tok/s step 3906/19560 | loss 3.592238 (-1.14z)| norm 0.2799 (+0.48z)| lr 7.53e-04 | 4167.43 ms | 32.4% bf16 MFU | 125689 tok/s step 3907/19560 | loss 3.575538 (-1.63z)| norm 0.3025 (+1.29z)| lr 7.53e-04 | 4170.62 ms | 32.4% bf16 MFU | 125690 tok/s step 3908/19560 | loss 3.655798 (+0.74z)| norm 0.2565 (-0.39z)| lr 7.53e-04 | 4175.65 ms | 32.3% bf16 MFU | 125683 tok/s step 3909/19560 | loss 3.648170 (+0.50z)| norm 0.2364 (-1.14z)| lr 7.53e-04 | 4166.37 ms | 32.4% bf16 MFU | 125691 tok/s step 3910/19560 | loss 3.641232 (+0.32z)| norm 0.2830 (+0.58z)| lr 7.52e-04 | 4162.37 ms | 32.4% bf16 MFU | 125704 tok/s step 3911/19560 | loss 3.581047 (-1.48z)| norm 0.2977 (+1.11z)| lr 7.52e-04 | 4164.66 ms | 32.4% bf16 MFU | 125714 tok/s step 3912/19560 | loss 3.634220 (+0.12z)| norm 0.2821 (+0.53z)| lr 7.52e-04 | 4173.59 ms | 32.4% bf16 MFU | 125709 tok/s step 3913/19560 | loss 3.605744 (-0.74z)| norm 0.2948 (+0.98z)| lr 7.52e-04 | 4161.29 ms | 32.4% bf16 MFU | 125723 tok/s step 3914/19560 | loss 3.627080 (-0.10z)| norm 0.2811 (+0.47z)| lr 7.52e-04 | 4163.86 ms | 32.4% bf16 MFU | 125733 tok/s step 3915/19560 | loss 3.580207 (-1.48z)| norm 0.2714 (+0.11z)| lr 7.52e-04 | 4163.27 ms | 32.4% bf16 MFU | 125743 tok/s step 3916/19560 | loss 3.748035 (+3.34z)| norm 0.2521 (-0.60z)| lr 7.52e-04 | 4171.49 ms | 32.4% bf16 MFU | 125740 tok/s step 3917/19560 | loss 3.640604 (+0.27z)| norm 0.2643 (-0.16z)| lr 7.52e-04 | 4173.37 ms | 32.4% bf16 MFU | 125734 tok/s step 3918/19560 | loss 3.604654 (-0.76z)| norm 0.2658 (-0.11z)| lr 7.52e-04 | 4171.63 ms | 32.4% bf16 MFU | 125731 tok/s step 3919/19560 | loss 3.572310 (-1.69z)| norm 0.2581 (-0.40z)| lr 7.52e-04 | 4166.06 ms | 32.4% bf16 MFU | 125737 tok/s step 3920/19560 | loss 3.600923 (-0.86z)| norm 0.2443 (-0.90z)| lr 7.52e-04 | 4642.80 ms | 29.1% bf16 MFU | 125096 tok/s step 3921/19560 | loss 3.597975 (-0.93z)| norm 0.2829 (+0.51z)| lr 7.52e-04 | 4437.19 ms | 30.4% bf16 MFU | 124749 tok/s step 3922/19560 | loss 3.633435 (+0.09z)| norm 0.2769 (+0.28z)| lr 7.52e-04 | 4464.65 ms | 30.2% bf16 MFU | 124384 tok/s step 3923/19560 | loss 3.643216 (+0.37z)| norm 0.2803 (+0.40z)| lr 7.52e-04 | 4302.76 ms | 31.4% bf16 MFU | 124257 tok/s step 3924/19560 | loss 3.577528 (-1.52z)| norm 0.2603 (-0.35z)| lr 7.52e-04 | 4209.63 ms | 32.1% bf16 MFU | 124271 tok/s step 3925/19560 | loss 3.620373 (-0.28z)| norm 0.2899 (+0.75z)| lr 7.52e-04 | 4238.68 ms | 31.9% bf16 MFU | 124242 tok/s step 3926/19560 | loss 3.819009 (+4.87z)| norm 0.3396 (+2.53z)| lr 7.52e-04 | 4194.46 ms | 32.2% bf16 MFU | 124280 tok/s step 3927/19560 | loss 3.669175 (+0.98z)| norm 0.2825 (+0.43z)| lr 7.52e-04 | 4318.69 ms | 31.3% bf16 MFU | 124136 tok/s step 3928/19560 | loss 3.659055 (+0.70z)| norm 0.2506 (-0.73z)| lr 7.52e-04 | 4196.81 ms | 32.2% bf16 MFU | 124175 tok/s step 3929/19560 | loss 3.631196 (-0.02z)| norm 0.2465 (-0.87z)| lr 7.52e-04 | 4168.40 ms | 32.4% bf16 MFU | 124255 tok/s step 3930/19560 | loss 3.628063 (-0.09z)| norm 0.2803 (+0.36z)| lr 7.52e-04 | 4163.41 ms | 32.4% bf16 MFU | 124339 tok/s step 3931/19560 | loss 3.593668 (-0.99z)| norm 0.2876 (+0.62z)| lr 7.52e-04 | 4171.72 ms | 32.4% bf16 MFU | 124406 tok/s step 3932/19560 | loss 3.614072 (-0.46z)| norm 0.2771 (+0.24z)| lr 7.52e-04 | 4216.17 ms | 32.0% bf16 MFU | 124403 tok/s step 3933/19560 | loss 3.660135 (+0.79z)| norm 0.2519 (-0.69z)| lr 7.52e-04 | 4221.73 ms | 32.0% bf16 MFU | 124392 tok/s step 3934/19560 | loss 3.649519 (+0.51z)| norm 0.2179 (-1.88z)| lr 7.52e-04 | 4168.15 ms | 32.4% bf16 MFU | 124462 tok/s step 3935/19560 | loss 3.633306 (+0.07z)| norm 0.2401 (-1.07z)| lr 7.52e-04 | 4257.51 ms | 31.7% bf16 MFU | 124396 tok/s step 3936/19560 | loss 3.594735 (-0.97z)| norm 0.2706 (+0.01z)| lr 7.52e-04 | 4164.88 ms | 32.4% bf16 MFU | 124470 tok/s step 3937/19560 | loss 3.607976 (-0.60z)| norm 0.2682 (-0.07z)| lr 7.52e-04 | 4168.52 ms | 32.4% bf16 MFU | 124536 tok/s step 3938/19560 | loss 3.637010 (+0.20z)| norm 0.2919 (+0.77z)| lr 7.52e-04 | 4207.04 ms | 32.1% bf16 MFU | 124540 tok/s step 3939/19560 | loss 3.568818 (-1.64z)| norm 0.2857 (+0.54z)| lr 7.52e-04 | 4153.83 ms | 32.5% bf16 MFU | 124624 tok/s step 3940/19560 | loss 3.650956 (+0.58z)| norm 0.3007 (+1.07z)| lr 7.52e-04 | 4171.15 ms | 32.4% bf16 MFU | 124677 tok/s step 3941/19560 | loss 3.612909 (-0.46z)| norm 0.2559 (-0.55z)| lr 7.51e-04 | 4161.23 ms | 32.4% bf16 MFU | 124743 tok/s step 3942/19560 | loss 3.576778 (-1.42z)| norm 0.2512 (-0.70z)| lr 7.51e-04 | 4166.99 ms | 32.4% bf16 MFU | 124797 tok/s step 3943/19560 | loss 3.577808 (-1.37z)| norm 0.2695 (-0.05z)| lr 7.51e-04 | 4173.05 ms | 32.4% bf16 MFU | 124839 tok/s step 3944/19560 | loss 3.613177 (-0.42z)| norm 0.2751 (+0.14z)| lr 7.51e-04 | 4158.58 ms | 32.5% bf16 MFU | 124901 tok/s step 3945/19560 | loss 3.552160 (-2.00z)| norm 0.2665 (-0.19z)| lr 7.51e-04 | 4163.44 ms | 32.4% bf16 MFU | 124952 tok/s step 3946/19560 | loss 3.557504 (-1.83z)| norm 0.2763 (+0.16z)| lr 7.51e-04 | 4206.60 ms | 32.1% bf16 MFU | 124936 tok/s step 3947/19560 | loss 3.587124 (-1.04z)| norm 0.2547 (-0.63z)| lr 7.51e-04 | 4164.77 ms | 32.4% bf16 MFU | 124984 tok/s step 3948/19560 | loss 3.615286 (-0.30z)| norm 0.2626 (-0.33z)| lr 7.51e-04 | 4175.48 ms | 32.3% bf16 MFU | 125013 tok/s step 3949/19560 | loss 3.605164 (-0.56z)| norm 0.2466 (-0.91z)| lr 7.51e-04 | 4166.60 ms | 32.4% bf16 MFU | 125054 tok/s step 3950/19560 | loss 3.634441 (+0.20z)| norm 0.2380 (-1.22z)| lr 7.51e-04 | 4311.58 ms | 31.3% bf16 MFU | 124881 tok/s step 3951/19560 | loss 3.599173 (-0.71z)| norm 0.2724 (+0.08z)| lr 7.51e-04 | 4160.11 ms | 32.5% bf16 MFU | 124938 tok/s step 3952/19560 | loss 3.549155 (-1.97z)| norm 0.2762 (+0.22z)| lr 7.51e-04 | 4356.00 ms | 31.0% bf16 MFU | 124709 tok/s step 3953/19560 | loss 3.611063 (-0.36z)| norm 0.2663 (-0.15z)| lr 7.51e-04 | 4159.53 ms | 32.5% bf16 MFU | 124776 tok/s step 3954/19560 | loss 3.627149 (+0.05z)| norm 0.2312 (-1.45z)| lr 7.51e-04 | 4162.23 ms | 32.4% bf16 MFU | 124835 tok/s step 3955/19560 | loss 3.624631 (-0.00z)| norm 0.2384 (-1.16z)| lr 7.51e-04 | 4172.99 ms | 32.4% bf16 MFU | 124876 tok/s step 3956/19560 | loss 3.621868 (-0.08z)| norm 0.2450 (-0.90z)| lr 7.51e-04 | 4166.72 ms | 32.4% bf16 MFU | 124923 tok/s step 3957/19560 | loss 3.594993 (-0.76z)| norm 0.2473 (-0.82z)| lr 7.51e-04 | 4239.12 ms | 31.9% bf16 MFU | 124861 tok/s step 3958/19560 | loss 3.588554 (-0.93z)| norm 0.2451 (-0.89z)| lr 7.51e-04 | 4157.69 ms | 32.5% bf16 MFU | 124923 tok/s step 3959/19560 | loss 3.591331 (-0.86z)| norm 0.2475 (-0.80z)| lr 7.51e-04 | 4178.96 ms | 32.3% bf16 MFU | 124950 tok/s step 3960/19560 | loss 3.620407 (-0.10z)| norm 0.2504 (-0.69z)| lr 7.51e-04 | 4174.06 ms | 32.3% bf16 MFU | 124983 tok/s step 3961/19560 | loss 3.612948 (-0.30z)| norm 0.2427 (-0.98z)| lr 7.51e-04 | 4169.68 ms | 32.4% bf16 MFU | 125020 tok/s step 3962/19560 | loss 3.599333 (-0.65z)| norm 0.2823 (+0.49z)| lr 7.51e-04 | 4165.37 ms | 32.4% bf16 MFU | 125063 tok/s step 3963/19560 | loss 3.604619 (-0.50z)| norm 0.3098 (+1.49z)| lr 7.51e-04 | 4157.86 ms | 32.5% bf16 MFU | 125114 tok/s step 3964/19560 | loss 3.571301 (-1.38z)| norm 0.3215 (+1.88z)| lr 7.51e-04 | 4171.14 ms | 32.4% bf16 MFU | 125143 tok/s step 3965/19560 | loss 3.633073 (+0.24z)| norm 0.2813 (+0.40z)| lr 7.51e-04 | 4156.81 ms | 32.5% bf16 MFU | 125193 tok/s step 3966/19560 | loss 3.586282 (-0.97z)| norm 0.2754 (+0.18z)| lr 7.51e-04 | 4365.82 ms | 30.9% bf16 MFU | 124937 tok/s step 3967/19560 | loss 3.637565 (+0.37z)| norm 0.2370 (-1.22z)| lr 7.51e-04 | 4170.09 ms | 32.4% bf16 MFU | 124977 tok/s step 3968/19560 | loss 3.596170 (-0.71z)| norm 0.2428 (-1.01z)| lr 7.51e-04 | 4166.79 ms | 32.4% bf16 MFU | 125019 tok/s step 3969/19560 | loss 3.603796 (-0.50z)| norm 0.2548 (-0.57z)| lr 7.51e-04 | 4172.21 ms | 32.4% bf16 MFU | 125051 tok/s step 3970/19560 | loss 3.611018 (-0.31z)| norm 0.2140 (-2.02z)| lr 7.51e-04 | 4154.83 ms | 32.5% bf16 MFU | 125108 tok/s step 3971/19560 | loss 3.640139 (+0.45z)| norm 0.2746 (+0.17z)| lr 7.51e-04 | 4178.54 ms | 32.3% bf16 MFU | 125126 tok/s step 3972/19560 | loss 3.630776 (+0.21z)| norm 0.2943 (+0.93z)| lr 7.50e-04 | 4155.27 ms | 32.5% bf16 MFU | 125179 tok/s step 3973/19560 | loss 3.614648 (-0.22z)| norm 0.2670 (-0.08z)| lr 7.50e-04 | 4199.61 ms | 32.1% bf16 MFU | 125162 tok/s step 3974/19560 | loss 3.590773 (-0.85z)| norm 0.2732 (+0.16z)| lr 7.50e-04 | 4162.56 ms | 32.4% bf16 MFU | 125201 tok/s step 3975/19560 | loss 3.652113 (+0.76z)| norm 0.2626 (-0.24z)| lr 7.50e-04 | 4166.94 ms | 32.4% bf16 MFU | 125232 tok/s step 3976/19560 | loss 3.570311 (-1.37z)| norm 0.2614 (-0.28z)| lr 7.50e-04 | 4165.96 ms | 32.4% bf16 MFU | 125263 tok/s step 3977/19560 | loss 3.625540 (+0.08z)| norm 0.2843 (+0.76z)| lr 7.50e-04 | 4172.13 ms | 32.4% bf16 MFU | 125283 tok/s step 3978/19560 | loss 3.596588 (-0.67z)| norm 0.2794 (+0.53z)| lr 7.50e-04 | 4161.85 ms | 32.4% bf16 MFU | 125318 tok/s step 3979/19560 | loss 3.572036 (-1.30z)| norm 0.2585 (-0.42z)| lr 7.50e-04 | 4156.51 ms | 32.5% bf16 MFU | 125359 tok/s step 3980/19560 | loss 3.676397 (+1.41z)| norm 0.2713 (+0.16z)| lr 7.50e-04 | 4157.59 ms | 32.5% bf16 MFU | 125396 tok/s step 3981/19560 | loss 3.529224 (-2.35z)| norm 0.2494 (-0.83z)| lr 7.50e-04 | 4186.27 ms | 32.3% bf16 MFU | 125388 tok/s step 3982/19560 | loss 3.595576 (-0.65z)| norm 0.2715 (+0.18z)| lr 7.50e-04 | 4163.49 ms | 32.4% bf16 MFU | 125415 tok/s step 3983/19560 | loss 3.633931 (+0.32z)| norm 0.2554 (-0.55z)| lr 7.50e-04 | 4159.40 ms | 32.5% bf16 MFU | 125447 tok/s step 3984/19560 | loss 3.577757 (-1.11z)| norm 0.2848 (+0.80z)| lr 7.50e-04 | 4171.86 ms | 32.4% bf16 MFU | 125458 tok/s step 3985/19560 | loss 3.608177 (-0.33z)| norm 0.2816 (+0.65z)| lr 7.50e-04 | 4174.41 ms | 32.3% bf16 MFU | 125465 tok/s step 3986/19560 | loss 3.609737 (-0.29z)| norm 0.2758 (+0.39z)| lr 7.50e-04 | 4170.25 ms | 32.4% bf16 MFU | 125478 tok/s step 3987/19560 | loss 3.619194 (-0.05z)| norm 0.2539 (-0.61z)| lr 7.50e-04 | 4159.93 ms | 32.5% bf16 MFU | 125506 tok/s step 3988/19560 | loss 3.596332 (-0.62z)| norm 0.2758 (+0.39z)| lr 7.50e-04 | 4169.00 ms | 32.4% bf16 MFU | 125518 tok/s step 3989/19560 | loss 3.584691 (-0.91z)| norm 0.2498 (-0.80z)| lr 7.50e-04 | 4187.55 ms | 32.2% bf16 MFU | 125502 tok/s step 3990/19560 | loss 3.630005 (+0.25z)| norm 0.2530 (-0.66z)| lr 7.50e-04 | 4169.81 ms | 32.4% bf16 MFU | 125514 tok/s step 3991/19560 | loss 3.523912 (-2.39z)| norm 0.2554 (-0.53z)| lr 7.50e-04 | 4164.79 ms | 32.4% bf16 MFU | 125533 tok/s step 3992/19560 | loss 3.585942 (-0.82z)| norm 0.2520 (-0.68z)| lr 7.50e-04 | 4176.30 ms | 32.3% bf16 MFU | 125533 tok/s step 3993/19560 | loss 3.581684 (-0.91z)| norm 0.2450 (-1.00z)| lr 7.50e-04 | 4324.77 ms | 31.2% bf16 MFU | 125318 tok/s step 3994/19560 | loss 3.604827 (-0.33z)| norm 0.2586 (-0.37z)| lr 7.50e-04 | 4157.65 ms | 32.5% bf16 MFU | 125357 tok/s step 3995/19560 | loss 3.603501 (-0.35z)| norm 0.2674 (+0.04z)| lr 7.50e-04 | 4170.51 ms | 32.4% bf16 MFU | 125375 tok/s step 3996/19560 | loss 3.628310 (+0.30z)| norm 0.2493 (-0.78z)| lr 7.50e-04 | 4174.17 ms | 32.3% bf16 MFU | 125386 tok/s step 3997/19560 | loss 3.615948 (-0.02z)| norm 0.2557 (-0.48z)| lr 7.50e-04 | 4159.14 ms | 32.5% bf16 MFU | 125420 tok/s step 3998/19560 | loss 3.588545 (-0.72z)| norm 0.2653 (-0.03z)| lr 7.50e-04 | 4158.02 ms | 32.5% bf16 MFU | 125453 tok/s step 3999/19560 | loss 3.591556 (-0.63z)| norm 0.2573 (-0.39z)| lr 7.50e-04 | 4160.78 ms | 32.4% bf16 MFU | 125481 tok/s step 4000/19560 | loss 3.620850 (+0.14z)| norm 0.2575 (-0.36z)| lr 7.50e-04 | 4358.69 ms | 31.0% bf16 MFU | 125221 tok/s val loss 3.596193 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2735/10042 = 0.272356 step 4001/19560 | loss 3.591045 (-0.67z)| norm 0.2459 (-0.91z)| lr 7.50e-04 | 4260.17 ms | 31.7% bf16 MFU | 125113 tok/s step 4002/19560 | loss 3.571930 (-1.17z)| norm 0.2363 (-1.36z)| lr 7.49e-04 | 4357.53 ms | 31.0% bf16 MFU | 124874 tok/s step 4003/19560 | loss 3.640035 (+0.65z)| norm 0.2253 (-1.85z)| lr 7.49e-04 | 4286.21 ms | 31.5% bf16 MFU | 124746 tok/s step 4004/19560 | loss 3.561911 (-1.42z)| norm 0.2377 (-1.24z)| lr 7.49e-04 | 4290.10 ms | 31.5% bf16 MFU | 124619 tok/s step 4005/19560 | loss 3.636038 (+0.57z)| norm 0.2380 (-1.21z)| lr 7.49e-04 | 4162.46 ms | 32.4% bf16 MFU | 124686 tok/s step 4006/19560 | loss 3.633191 (+0.49z)| norm 0.2449 (-0.88z)| lr 7.49e-04 | 4156.60 ms | 32.5% bf16 MFU | 124758 tok/s step 4007/19560 | loss 3.543732 (-1.89z)| norm 0.2366 (-1.27z)| lr 7.49e-04 | 4169.51 ms | 32.4% bf16 MFU | 124808 tok/s step 4008/19560 | loss 3.585472 (-0.77z)| norm 0.2610 (-0.13z)| lr 7.49e-04 | 4156.77 ms | 32.5% bf16 MFU | 124874 tok/s step 4009/19560 | loss 3.634454 (+0.53z)| norm 0.3112 (+2.17z)| lr 7.49e-04 | 4157.81 ms | 32.5% bf16 MFU | 124935 tok/s step 4010/19560 | loss 3.663652 (+1.28z)| norm 0.3344 (+3.10z)| lr 7.49e-04 | 4267.44 ms | 31.6% bf16 MFU | 124831 tok/s step 4011/19560 | loss 3.638510 (+0.61z)| norm 0.2498 (-0.69z)| lr 7.49e-04 | 4160.00 ms | 32.5% bf16 MFU | 124891 tok/s step 4012/19560 | loss 3.599740 (-0.41z)| norm 0.2896 (+1.09z)| lr 7.49e-04 | 4181.46 ms | 32.3% bf16 MFU | 124916 tok/s step 4013/19560 | loss 3.636235 (+0.57z)| norm 0.2808 (+0.69z)| lr 7.49e-04 | 4159.53 ms | 32.5% bf16 MFU | 124972 tok/s step 4014/19560 | loss 3.630332 (+0.41z)| norm 0.2705 (+0.27z)| lr 7.49e-04 | 4161.62 ms | 32.4% bf16 MFU | 125023 tok/s step 4015/19560 | loss 3.593641 (-0.57z)| norm 0.3159 (+2.33z)| lr 7.49e-04 | 4166.17 ms | 32.4% bf16 MFU | 125064 tok/s step 4016/19560 | loss 3.624442 (+0.25z)| norm 0.2809 (+0.72z)| lr 7.49e-04 | 4159.79 ms | 32.5% bf16 MFU | 125112 tok/s step 4017/19560 | loss 3.547480 (-1.77z)| norm 0.3153 (+2.23z)| lr 7.49e-04 | 4419.11 ms | 30.6% bf16 MFU | 124789 tok/s step 4018/19560 | loss 3.571187 (-1.13z)| norm 0.2810 (+0.69z)| lr 7.49e-04 | 4349.90 ms | 31.0% bf16 MFU | 124576 tok/s step 4019/19560 | loss 3.601986 (-0.32z)| norm 0.2713 (+0.24z)| lr 7.49e-04 | 4155.25 ms | 32.5% bf16 MFU | 124656 tok/s step 4020/19560 | loss 3.615608 (+0.06z)| norm 0.2472 (-0.83z)| lr 7.49e-04 | 4163.15 ms | 32.4% bf16 MFU | 124720 tok/s step 4021/19560 | loss 3.571453 (-1.11z)| norm 0.2982 (+1.43z)| lr 7.49e-04 | 4161.12 ms | 32.4% bf16 MFU | 124784 tok/s step 4022/19560 | loss 3.664122 (+1.36z)| norm 0.2774 (+0.51z)| lr 7.49e-04 | 4269.94 ms | 31.6% bf16 MFU | 124684 tok/s step 4023/19560 | loss 3.588986 (-0.64z)| norm 0.2852 (+0.86z)| lr 7.49e-04 | 4159.97 ms | 32.5% bf16 MFU | 124751 tok/s step 4024/19560 | loss 3.592111 (-0.54z)| norm 0.2877 (+0.96z)| lr 7.49e-04 | 4169.03 ms | 32.4% bf16 MFU | 124801 tok/s step 4025/19560 | loss 3.675133 (+1.65z)| norm 0.2601 (-0.28z)| lr 7.49e-04 | 4156.18 ms | 32.5% bf16 MFU | 124869 tok/s step 4026/19560 | loss 3.602503 (-0.26z)| norm 0.2646 (-0.08z)| lr 7.49e-04 | 4168.45 ms | 32.4% bf16 MFU | 124914 tok/s step 4027/19560 | loss 3.514404 (-2.51z)| norm 0.2395 (-1.20z)| lr 7.49e-04 | 4159.43 ms | 32.5% bf16 MFU | 124971 tok/s step 4028/19560 | loss 3.568181 (-1.11z)| norm 0.2615 (-0.22z)| lr 7.49e-04 | 4170.75 ms | 32.4% bf16 MFU | 125007 tok/s step 4029/19560 | loss 3.538160 (-1.84z)| norm 0.2816 (+0.67z)| lr 7.49e-04 | 4312.78 ms | 31.3% bf16 MFU | 124835 tok/s step 4030/19560 | loss 3.595268 (-0.38z)| norm 0.3097 (+1.89z)| lr 7.49e-04 | 4172.37 ms | 32.4% bf16 MFU | 124876 tok/s step 4031/19560 | loss 3.609250 (-0.02z)| norm 0.3086 (+1.80z)| lr 7.49e-04 | 4167.71 ms | 32.4% bf16 MFU | 124923 tok/s step 4032/19560 | loss 3.610239 (+0.01z)| norm 0.3029 (+1.53z)| lr 7.49e-04 | 4175.77 ms | 32.3% bf16 MFU | 124954 tok/s step 4033/19560 | loss 3.570236 (-1.00z)| norm 0.2752 (+0.32z)| lr 7.48e-04 | 4171.72 ms | 32.4% bf16 MFU | 124990 tok/s step 4034/19560 | loss 3.654595 (+1.16z)| norm 0.2534 (-0.62z)| lr 7.48e-04 | 4163.88 ms | 32.4% bf16 MFU | 125036 tok/s step 4035/19560 | loss 3.619644 (+0.26z)| norm 0.2590 (-0.37z)| lr 7.48e-04 | 4161.59 ms | 32.4% bf16 MFU | 125084 tok/s step 4036/19560 | loss 3.573025 (-0.93z)| norm 0.2660 (-0.06z)| lr 7.48e-04 | 4156.60 ms | 32.5% bf16 MFU | 125136 tok/s step 4037/19560 | loss 3.644888 (+0.93z)| norm 0.2561 (-0.51z)| lr 7.48e-04 | 4187.72 ms | 32.2% bf16 MFU | 125139 tok/s step 4038/19560 | loss 3.534544 (-1.88z)| norm 0.2294 (-1.66z)| lr 7.48e-04 | 4162.52 ms | 32.4% bf16 MFU | 125180 tok/s step 4039/19560 | loss 3.615917 (+0.19z)| norm 0.2413 (-1.12z)| lr 7.48e-04 | 4157.31 ms | 32.5% bf16 MFU | 125227 tok/s step 4040/19560 | loss 3.607738 (-0.01z)| norm 0.2494 (-0.75z)| lr 7.48e-04 | 4176.05 ms | 32.3% bf16 MFU | 125243 tok/s step 4041/19560 | loss 3.673264 (+1.63z)| norm 0.2586 (-0.34z)| lr 7.48e-04 | 4159.41 ms | 32.5% bf16 MFU | 125283 tok/s step 4042/19560 | loss 3.538080 (-1.75z)| norm 0.2280 (-1.65z)| lr 7.48e-04 | 4172.70 ms | 32.4% bf16 MFU | 125301 tok/s step 4043/19560 | loss 3.628053 (+0.49z)| norm 0.2498 (-0.69z)| lr 7.48e-04 | 4172.89 ms | 32.4% bf16 MFU | 125318 tok/s step 4044/19560 | loss 3.596479 (-0.28z)| norm 0.3025 (+1.58z)| lr 7.48e-04 | 4156.91 ms | 32.5% bf16 MFU | 125358 tok/s step 4045/19560 | loss 3.670518 (+1.65z)| norm 0.3038 (+1.60z)| lr 7.48e-04 | 4172.15 ms | 32.4% bf16 MFU | 125374 tok/s step 4046/19560 | loss 3.605062 (-0.06z)| norm 0.2619 (-0.19z)| lr 7.48e-04 | 4157.44 ms | 32.5% bf16 MFU | 125410 tok/s step 4047/19560 | loss 3.661186 (+1.38z)| norm 0.2866 (+0.86z)| lr 7.48e-04 | 4176.59 ms | 32.3% bf16 MFU | 125416 tok/s step 4048/19560 | loss 3.622035 (+0.36z)| norm 0.2669 (+0.01z)| lr 7.48e-04 | 4222.35 ms | 32.0% bf16 MFU | 125354 tok/s step 4049/19560 | loss 3.572233 (-0.93z)| norm 0.2616 (-0.21z)| lr 7.48e-04 | 4170.47 ms | 32.4% bf16 MFU | 125372 tok/s step 4050/19560 | loss 3.530104 (-1.98z)| norm 0.2816 (+0.64z)| lr 7.48e-04 | 4158.05 ms | 32.5% bf16 MFU | 125408 tok/s step 4051/19560 | loss 3.587985 (-0.49z)| norm 0.2650 (-0.06z)| lr 7.48e-04 | 4207.86 ms | 32.1% bf16 MFU | 125367 tok/s step 4052/19560 | loss 3.566540 (-1.03z)| norm 0.2421 (-1.04z)| lr 7.48e-04 | 4156.46 ms | 32.5% bf16 MFU | 125406 tok/s step 4053/19560 | loss 3.609776 (+0.08z)| norm 0.2499 (-0.69z)| lr 7.48e-04 | 4161.08 ms | 32.4% bf16 MFU | 125436 tok/s step 4054/19560 | loss 3.628471 (+0.68z)| norm 0.3005 (+1.54z)| lr 7.48e-04 | 4157.03 ms | 32.5% bf16 MFU | 125470 tok/s step 4055/19560 | loss 3.604285 (-0.01z)| norm 0.3056 (+1.74z)| lr 7.48e-04 | 4397.09 ms | 30.7% bf16 MFU | 125158 tok/s step 4056/19560 | loss 3.592901 (-0.34z)| norm 0.3134 (+2.03z)| lr 7.48e-04 | 4262.63 ms | 31.7% bf16 MFU | 125050 tok/s step 4057/19560 | loss 3.596545 (-0.22z)| norm 0.2899 (+1.00z)| lr 7.48e-04 | 4156.40 ms | 32.5% bf16 MFU | 125105 tok/s step 4058/19560 | loss 3.668083 (+1.89z)| norm 0.2655 (-0.05z)| lr 7.48e-04 | 4161.72 ms | 32.4% bf16 MFU | 125148 tok/s step 4059/19560 | loss 3.566112 (-1.12z)| norm 0.2819 (+0.66z)| lr 7.48e-04 | 4167.60 ms | 32.4% bf16 MFU | 125181 tok/s step 4060/19560 | loss 3.642348 (+1.12z)| norm 0.2788 (+0.53z)| lr 7.48e-04 | 4156.75 ms | 32.5% bf16 MFU | 125228 tok/s step 4061/19560 | loss 3.626507 (+0.67z)| norm 0.2698 (+0.14z)| lr 7.48e-04 | 4175.02 ms | 32.3% bf16 MFU | 125246 tok/s step 4062/19560 | loss 3.710246 (+3.04z)| norm 0.2725 (+0.24z)| lr 7.47e-04 | 4156.18 ms | 32.5% bf16 MFU | 125291 tok/s step 4063/19560 | loss 3.573854 (-0.87z)| norm 0.2410 (-1.15z)| lr 7.47e-04 | 4157.72 ms | 32.5% bf16 MFU | 125331 tok/s step 4064/19560 | loss 3.587983 (-0.46z)| norm 0.2606 (-0.28z)| lr 7.47e-04 | 4172.64 ms | 32.4% bf16 MFU | 125347 tok/s step 4065/19560 | loss 3.634049 (+0.85z)| norm 0.2412 (-1.13z)| lr 7.47e-04 | 4166.72 ms | 32.4% bf16 MFU | 125371 tok/s step 4066/19560 | loss 3.629418 (+0.72z)| norm 0.2523 (-0.63z)| lr 7.47e-04 | 4155.88 ms | 32.5% bf16 MFU | 125410 tok/s step 4067/19560 | loss 3.610877 (+0.18z)| norm 0.2445 (-0.95z)| lr 7.47e-04 | 4172.70 ms | 32.4% bf16 MFU | 125422 tok/s step 4068/19560 | loss 3.557604 (-1.33z)| norm 0.2800 (+0.62z)| lr 7.47e-04 | 4206.91 ms | 32.1% bf16 MFU | 125382 tok/s step 4069/19560 | loss 3.587950 (-0.45z)| norm 0.2593 (-0.30z)| lr 7.47e-04 | 4179.94 ms | 32.3% bf16 MFU | 125385 tok/s step 4070/19560 | loss 3.597392 (-0.18z)| norm 0.2468 (-0.85z)| lr 7.47e-04 | 4155.68 ms | 32.5% bf16 MFU | 125424 tok/s step 4071/19560 | loss 3.567012 (-1.05z)| norm 0.2440 (-0.97z)| lr 7.47e-04 | 4157.79 ms | 32.5% bf16 MFU | 125457 tok/s step 4072/19560 | loss 3.662633 (+1.67z)| norm 0.2637 (-0.09z)| lr 7.47e-04 | 4365.43 ms | 30.9% bf16 MFU | 125189 tok/s step 4073/19560 | loss 3.645802 (+1.17z)| norm 0.2559 (-0.43z)| lr 7.47e-04 | 4154.74 ms | 32.5% bf16 MFU | 125239 tok/s step 4074/19560 | loss 3.594121 (-0.31z)| norm 0.2507 (-0.65z)| lr 7.47e-04 | 4165.54 ms | 32.4% bf16 MFU | 125271 tok/s step 4075/19560 | loss 3.566431 (-1.10z)| norm 0.2394 (-1.14z)| lr 7.47e-04 | 4235.38 ms | 31.9% bf16 MFU | 125197 tok/s step 4076/19560 | loss 3.605427 (+0.02z)| norm 0.2525 (-0.56z)| lr 7.47e-04 | 4163.64 ms | 32.4% bf16 MFU | 125233 tok/s step 4077/19560 | loss 3.628051 (+0.66z)| norm 0.2691 (+0.16z)| lr 7.47e-04 | 4179.17 ms | 32.3% bf16 MFU | 125244 tok/s step 4078/19560 | loss 3.615383 (+0.30z)| norm 0.2337 (-1.39z)| lr 7.47e-04 | 4159.60 ms | 32.5% bf16 MFU | 125284 tok/s step 4079/19560 | loss 3.660681 (+1.57z)| norm 0.3556 (+3.71z)| lr 7.47e-04 | 4153.40 ms | 32.5% bf16 MFU | 125331 tok/s step 4080/19560 | loss 3.588901 (-0.48z)| norm 0.2956 (+1.21z)| lr 7.47e-04 | 4219.65 ms | 32.0% bf16 MFU | 125277 tok/s step 4081/19560 | loss 3.588640 (-0.48z)| norm 0.3189 (+2.12z)| lr 7.47e-04 | 4159.25 ms | 32.5% bf16 MFU | 125316 tok/s step 4082/19560 | loss 3.573997 (-0.89z)| norm 0.2709 (+0.16z)| lr 7.47e-04 | 4163.73 ms | 32.4% bf16 MFU | 125346 tok/s step 4083/19560 | loss 3.621096 (+0.46z)| norm 0.2705 (+0.13z)| lr 7.47e-04 | 4168.44 ms | 32.4% bf16 MFU | 125367 tok/s step 4084/19560 | loss 3.633832 (+0.82z)| norm 0.2629 (-0.18z)| lr 7.47e-04 | 4151.44 ms | 32.5% bf16 MFU | 125414 tok/s step 4085/19560 | loss 3.582397 (-0.64z)| norm 0.2367 (-1.26z)| lr 7.47e-04 | 4164.38 ms | 32.4% bf16 MFU | 125438 tok/s step 4086/19560 | loss 3.592240 (-0.36z)| norm 0.2766 (+0.38z)| lr 7.47e-04 | 4153.92 ms | 32.5% bf16 MFU | 125477 tok/s step 4087/19560 | loss 3.687698 (+2.29z)| norm 0.2509 (-0.69z)| lr 7.47e-04 | 4155.56 ms | 32.5% bf16 MFU | 125511 tok/s step 4088/19560 | loss 3.558326 (-1.30z)| norm 0.2428 (-1.01z)| lr 7.47e-04 | 4164.08 ms | 32.4% bf16 MFU | 125531 tok/s step 4089/19560 | loss 3.586470 (-0.52z)| norm 0.2531 (-0.60z)| lr 7.47e-04 | 4155.47 ms | 32.5% bf16 MFU | 125563 tok/s step 4090/19560 | loss 3.610345 (+0.14z)| norm 0.2617 (-0.23z)| lr 7.47e-04 | 4163.47 ms | 32.4% bf16 MFU | 125581 tok/s step 4091/19560 | loss 3.586956 (-0.50z)| norm 0.2700 (+0.12z)| lr 7.47e-04 | 4286.48 ms | 31.5% bf16 MFU | 125417 tok/s step 4092/19560 | loss 3.610265 (+0.14z)| norm 0.2537 (-0.55z)| lr 7.46e-04 | 4197.07 ms | 32.2% bf16 MFU | 125392 tok/s step 4093/19560 | loss 3.621701 (+0.46z)| norm 0.2587 (-0.33z)| lr 7.46e-04 | 4203.49 ms | 32.1% bf16 MFU | 125359 tok/s step 4094/19560 | loss 3.567278 (-1.05z)| norm 0.2398 (-1.12z)| lr 7.46e-04 | 4152.67 ms | 32.5% bf16 MFU | 125404 tok/s step 4095/19560 | loss 3.638095 (+0.92z)| norm 0.2424 (-1.01z)| lr 7.46e-04 | 4361.44 ms | 31.0% bf16 MFU | 125144 tok/s step 4096/19560 | loss 3.634218 (+0.80z)| norm 0.2524 (-0.59z)| lr 7.46e-04 | 4207.02 ms | 32.1% bf16 MFU | 125118 tok/s step 4097/19560 | loss 3.597606 (-0.21z)| norm 0.2600 (-0.27z)| lr 7.46e-04 | 4223.78 ms | 32.0% bf16 MFU | 125069 tok/s step 4098/19560 | loss 3.590927 (-0.39z)| norm 0.2695 (+0.12z)| lr 7.46e-04 | 4434.11 ms | 30.4% bf16 MFU | 124727 tok/s step 4099/19560 | loss 3.597819 (-0.20z)| norm 0.2583 (-0.36z)| lr 7.46e-04 | 4683.26 ms | 28.8% bf16 MFU | 124088 tok/s step 4100/19560 | loss 3.538386 (-1.81z)| norm 0.2492 (-0.74z)| lr 7.46e-04 | 4149.23 ms | 32.5% bf16 MFU | 124202 tok/s step 4101/19560 | loss 3.567792 (-0.98z)| norm 0.2979 (+1.36z)| lr 7.46e-04 | 4166.01 ms | 32.4% bf16 MFU | 124284 tok/s step 4102/19560 | loss 3.574943 (-0.78z)| norm 0.3209 (+2.30z)| lr 7.46e-04 | 4150.57 ms | 32.5% bf16 MFU | 124386 tok/s step 4103/19560 | loss 3.590471 (-0.35z)| norm 0.2947 (+1.17z)| lr 7.46e-04 | 4158.61 ms | 32.5% bf16 MFU | 124470 tok/s step 4104/19560 | loss 3.616713 (+0.36z)| norm 0.2965 (+1.23z)| lr 7.46e-04 | 4152.17 ms | 32.5% bf16 MFU | 124560 tok/s step 4105/19560 | loss 3.658967 (+1.51z)| norm 0.2536 (-0.57z)| lr 7.46e-04 | 4156.81 ms | 32.5% bf16 MFU | 124638 tok/s step 4106/19560 | loss 3.623812 (+0.54z)| norm 0.2438 (-0.97z)| lr 7.46e-04 | 4159.90 ms | 32.5% bf16 MFU | 124708 tok/s step 4107/19560 | loss 3.572833 (-0.85z)| norm 0.2544 (-0.52z)| lr 7.46e-04 | 4158.28 ms | 32.5% bf16 MFU | 124777 tok/s step 4108/19560 | loss 3.598382 (-0.14z)| norm 0.2343 (-1.34z)| lr 7.46e-04 | 4168.14 ms | 32.4% bf16 MFU | 124827 tok/s step 4109/19560 | loss 3.603740 (-0.01z)| norm 0.2561 (-0.44z)| lr 7.46e-04 | 4227.21 ms | 31.9% bf16 MFU | 124787 tok/s step 4110/19560 | loss 3.550100 (-1.50z)| norm 0.2407 (-1.06z)| lr 7.46e-04 | 4326.43 ms | 31.2% bf16 MFU | 124607 tok/s step 4111/19560 | loss 3.594383 (-0.25z)| norm 0.2344 (-1.31z)| lr 7.46e-04 | 5154.95 ms | 26.2% bf16 MFU | 123462 tok/s step 4112/19560 | loss 3.655509 (+1.44z)| norm 0.3144 (+1.96z)| lr 7.46e-04 | 4558.34 ms | 29.6% bf16 MFU | 123040 tok/s step 4113/19560 | loss 3.577498 (-0.73z)| norm 0.3540 (+3.39z)| lr 7.46e-04 | 4998.94 ms | 27.0% bf16 MFU | 122132 tok/s step 4114/19560 | loss 3.543572 (-1.64z)| norm 0.4149 (+5.10z)| lr 7.46e-04 | 4372.90 ms | 30.9% bf16 MFU | 122020 tok/s step 4115/19560 | loss 3.708464 (+2.79z)| norm 0.3835 (+3.75z)| lr 7.46e-04 | 4378.87 ms | 30.8% bf16 MFU | 121905 tok/s step 4116/19560 | loss 3.571319 (-0.86z)| norm 0.2532 (-0.51z)| lr 7.46e-04 | 4478.56 ms | 30.1% bf16 MFU | 121663 tok/s step 4117/19560 | loss 3.596264 (-0.20z)| norm 0.2885 (+0.63z)| lr 7.46e-04 | 4364.70 ms | 30.9% bf16 MFU | 121586 tok/s step 4118/19560 | loss 3.706290 (+2.65z)| norm 0.3383 (+2.20z)| lr 7.46e-04 | 4335.00 ms | 31.1% bf16 MFU | 121554 tok/s step 4119/19560 | loss 3.552339 (-1.37z)| norm 0.3242 (+1.71z)| lr 7.46e-04 | 4284.99 ms | 31.5% bf16 MFU | 121594 tok/s step 4120/19560 | loss 3.581037 (-0.62z)| norm 0.2937 (+0.73z)| lr 7.46e-04 | 4201.07 ms | 32.1% bf16 MFU | 121754 tok/s step 4121/19560 | loss 3.576051 (-0.74z)| norm 0.2514 (-0.62z)| lr 7.46e-04 | 4542.76 ms | 29.7% bf16 MFU | 121437 tok/s step 4122/19560 | loss 3.556786 (-1.23z)| norm 0.2852 (+0.45z)| lr 7.45e-04 | 4174.33 ms | 32.3% bf16 MFU | 121645 tok/s step 4123/19560 | loss 3.557371 (-1.20z)| norm 0.2695 (-0.05z)| lr 7.45e-04 | 4186.00 ms | 32.3% bf16 MFU | 121825 tok/s step 4124/19560 | loss 3.595339 (-0.21z)| norm 0.2621 (-0.29z)| lr 7.45e-04 | 4214.22 ms | 32.0% bf16 MFU | 121955 tok/s step 4125/19560 | loss 3.549231 (-1.38z)| norm 0.2533 (-0.56z)| lr 7.45e-04 | 4344.74 ms | 31.1% bf16 MFU | 121890 tok/s step 4126/19560 | loss 3.605007 (+0.05z)| norm 0.3079 (+1.16z)| lr 7.45e-04 | 4456.26 ms | 30.3% bf16 MFU | 121679 tok/s step 4127/19560 | loss 3.517852 (-2.14z)| norm 0.3102 (+1.21z)| lr 7.45e-04 | 4180.82 ms | 32.3% bf16 MFU | 121865 tok/s step 4128/19560 | loss 3.574554 (-0.70z)| norm 0.2873 (+0.48z)| lr 7.45e-04 | 4256.55 ms | 31.7% bf16 MFU | 121930 tok/s step 4129/19560 | loss 3.567391 (-0.87z)| norm 0.2700 (-0.07z)| lr 7.45e-04 | 4206.97 ms | 32.1% bf16 MFU | 122065 tok/s step 4130/19560 | loss 3.580682 (-0.54z)| norm 0.2485 (-0.76z)| lr 7.45e-04 | 4196.60 ms | 32.2% bf16 MFU | 122208 tok/s step 4131/19560 | loss 3.644619 (+1.07z)| norm 0.2502 (-0.71z)| lr 7.45e-04 | 4177.54 ms | 32.3% bf16 MFU | 122373 tok/s step 4132/19560 | loss 3.615674 (+0.33z)| norm 0.2573 (-0.49z)| lr 7.45e-04 | 4171.08 ms | 32.4% bf16 MFU | 122539 tok/s step 4133/19560 | loss 3.623074 (+0.52z)| norm 0.2312 (-1.32z)| lr 7.45e-04 | 4167.48 ms | 32.4% bf16 MFU | 122702 tok/s step 4134/19560 | loss 3.582173 (-0.50z)| norm 0.2333 (-1.25z)| lr 7.45e-04 | 4240.31 ms | 31.8% bf16 MFU | 122749 tok/s step 4135/19560 | loss 3.529016 (-1.84z)| norm 0.2561 (-0.53z)| lr 7.45e-04 | 4252.90 ms | 31.7% bf16 MFU | 122776 tok/s step 4136/19560 | loss 3.627454 (+0.64z)| norm 0.2395 (-1.05z)| lr 7.45e-04 | 4239.45 ms | 31.8% bf16 MFU | 122820 tok/s step 4137/19560 | loss 3.613200 (+0.28z)| norm 0.2415 (-0.97z)| lr 7.45e-04 | 4165.97 ms | 32.4% bf16 MFU | 122972 tok/s step 4138/19560 | loss 3.630718 (+0.74z)| norm 0.2517 (-0.64z)| lr 7.45e-04 | 4174.67 ms | 32.3% bf16 MFU | 123103 tok/s step 4139/19560 | loss 3.523229 (-1.95z)| norm 0.2339 (-1.20z)| lr 7.45e-04 | 4172.27 ms | 32.4% bf16 MFU | 123231 tok/s step 4140/19560 | loss 3.572340 (-0.71z)| norm 0.2391 (-1.02z)| lr 7.45e-04 | 4177.21 ms | 32.3% bf16 MFU | 123345 tok/s step 4141/19560 | loss 3.573456 (-0.67z)| norm 0.2359 (-1.11z)| lr 7.45e-04 | 7282.03 ms | 18.5% bf16 MFU | 120777 tok/s step 4142/19560 | loss 3.601851 (+0.05z)| norm 0.2676 (-0.09z)| lr 7.45e-04 | 4158.23 ms | 32.5% bf16 MFU | 121043 tok/s step 4143/19560 | loss 3.585844 (-0.35z)| norm 0.2830 (+0.41z)| lr 7.45e-04 | 4177.92 ms | 32.3% bf16 MFU | 121265 tok/s step 4144/19560 | loss 3.578393 (-0.53z)| norm 0.2889 (+0.60z)| lr 7.45e-04 | 4169.03 ms | 32.4% bf16 MFU | 121490 tok/s step 4145/19560 | loss 3.588867 (-0.28z)| norm 0.2969 (+0.87z)| lr 7.45e-04 | 4168.68 ms | 32.4% bf16 MFU | 121704 tok/s step 4146/19560 | loss 3.593872 (-0.16z)| norm 0.2834 (+0.43z)| lr 7.45e-04 | 4158.59 ms | 32.5% bf16 MFU | 121922 tok/s step 4147/19560 | loss 3.591853 (-0.21z)| norm 0.2562 (-0.45z)| lr 7.45e-04 | 4162.32 ms | 32.4% bf16 MFU | 122124 tok/s step 4148/19560 | loss 3.562413 (-0.94z)| norm 0.2322 (-1.22z)| lr 7.45e-04 | 4163.55 ms | 32.4% bf16 MFU | 122314 tok/s step 4149/19560 | loss 3.628177 (+0.71z)| norm 0.2193 (-1.60z)| lr 7.45e-04 | 4177.10 ms | 32.3% bf16 MFU | 122474 tok/s step 4150/19560 | loss 3.562580 (-0.93z)| norm 0.2503 (-0.60z)| lr 7.45e-04 | 4171.26 ms | 32.4% bf16 MFU | 122635 tok/s step 4151/19560 | loss 3.638052 (+0.98z)| norm 0.2910 (+0.70z)| lr 7.44e-04 | 4166.41 ms | 32.4% bf16 MFU | 122795 tok/s step 4152/19560 | loss 3.577267 (-0.56z)| norm 0.2404 (-0.90z)| lr 7.44e-04 | 4180.04 ms | 32.3% bf16 MFU | 122926 tok/s step 4153/19560 | loss 3.576934 (-0.56z)| norm 0.2491 (-0.62z)| lr 7.44e-04 | 4169.54 ms | 32.4% bf16 MFU | 123067 tok/s step 4154/19560 | loss 3.563659 (-0.89z)| norm 0.2433 (-0.80z)| lr 7.44e-04 | 4162.54 ms | 32.4% bf16 MFU | 123212 tok/s step 4155/19560 | loss 3.588936 (-0.26z)| norm 0.2928 (+0.76z)| lr 7.44e-04 | 4174.72 ms | 32.3% bf16 MFU | 123330 tok/s step 4156/19560 | loss 3.638100 (+1.01z)| norm 0.3016 (+1.03z)| lr 7.44e-04 | 4166.53 ms | 32.4% bf16 MFU | 123455 tok/s step 4157/19560 | loss 3.601636 (+0.04z)| norm 0.2704 (+0.04z)| lr 7.44e-04 | 4169.31 ms | 32.4% bf16 MFU | 123570 tok/s step 4158/19560 | loss 3.626348 (+0.69z)| norm 0.3320 (+1.97z)| lr 7.44e-04 | 4165.90 ms | 32.4% bf16 MFU | 123684 tok/s step 4159/19560 | loss 3.572884 (-0.71z)| norm 0.2897 (+0.65z)| lr 7.44e-04 | 4178.53 ms | 32.3% bf16 MFU | 123774 tok/s step 4160/19560 | loss 3.631652 (+0.83z)| norm 0.3246 (+1.74z)| lr 7.44e-04 | 4206.68 ms | 32.1% bf16 MFU | 123817 tok/s step 4161/19560 | loss 3.560889 (-1.03z)| norm 0.3317 (+1.91z)| lr 7.44e-04 | 4187.96 ms | 32.2% bf16 MFU | 123885 tok/s step 4162/19560 | loss 3.595507 (-0.11z)| norm 0.2911 (+0.65z)| lr 7.44e-04 | 4167.65 ms | 32.4% bf16 MFU | 123981 tok/s step 4163/19560 | loss 3.551936 (-1.24z)| norm 0.2715 (+0.04z)| lr 7.44e-04 | 4185.25 ms | 32.3% bf16 MFU | 124045 tok/s step 4164/19560 | loss 3.560471 (-1.01z)| norm 0.3004 (+0.92z)| lr 7.44e-04 | 4261.40 ms | 31.7% bf16 MFU | 123995 tok/s step 4165/19560 | loss 3.583349 (-0.40z)| norm 0.3343 (+1.92z)| lr 7.44e-04 | 4174.04 ms | 32.3% bf16 MFU | 124075 tok/s step 4166/19560 | loss 3.583013 (-0.42z)| norm 0.2871 (+0.48z)| lr 7.44e-04 | 4449.37 ms | 30.3% bf16 MFU | 123763 tok/s step 4167/19560 | loss 3.649158 (+1.32z)| norm 0.2949 (+0.70z)| lr 7.44e-04 | 4179.40 ms | 32.3% bf16 MFU | 123847 tok/s step 4168/19560 | loss 3.552995 (-1.20z)| norm 0.2881 (+0.49z)| lr 7.44e-04 | 4239.48 ms | 31.8% bf16 MFU | 123838 tok/s step 4169/19560 | loss 3.570120 (-0.74z)| norm 0.2590 (-0.40z)| lr 7.44e-04 | 4239.16 ms | 31.9% bf16 MFU | 123830 tok/s step 4170/19560 | loss 3.536287 (-1.64z)| norm 0.3230 (+1.53z)| lr 7.44e-04 | 4161.36 ms | 32.4% bf16 MFU | 123938 tok/s step 4171/19560 | loss 3.591624 (-0.16z)| norm 0.2788 (+0.17z)| lr 7.44e-04 | 4173.46 ms | 32.4% bf16 MFU | 124023 tok/s step 4172/19560 | loss 3.566108 (-0.83z)| norm 0.3027 (+0.91z)| lr 7.44e-04 | 4169.52 ms | 32.4% bf16 MFU | 124109 tok/s step 4173/19560 | loss 3.567365 (-0.79z)| norm 0.2985 (+0.78z)| lr 7.44e-04 | 4397.35 ms | 30.7% bf16 MFU | 123865 tok/s step 4174/19560 | loss 3.529076 (-1.78z)| norm 0.2359 (-1.13z)| lr 7.44e-04 | 4175.66 ms | 32.3% bf16 MFU | 123949 tok/s step 4175/19560 | loss 3.622994 (+0.74z)| norm 0.3009 (+0.85z)| lr 7.44e-04 | 4174.35 ms | 32.3% bf16 MFU | 124032 tok/s step 4176/19560 | loss 3.561647 (-0.90z)| norm 0.2447 (-0.86z)| lr 7.44e-04 | 4172.24 ms | 32.4% bf16 MFU | 124113 tok/s step 4177/19560 | loss 3.674248 (+2.07z)| norm 0.2648 (-0.25z)| lr 7.44e-04 | 4175.63 ms | 32.3% bf16 MFU | 124185 tok/s step 4178/19560 | loss 3.647911 (+1.36z)| norm 0.2696 (-0.10z)| lr 7.44e-04 | 4180.07 ms | 32.3% bf16 MFU | 124247 tok/s step 4179/19560 | loss 3.624861 (+0.73z)| norm 0.2935 (+0.62z)| lr 7.44e-04 | 4174.12 ms | 32.3% bf16 MFU | 124315 tok/s step 4180/19560 | loss 3.557497 (-1.05z)| norm 0.2719 (-0.04z)| lr 7.43e-04 | 4176.47 ms | 32.3% bf16 MFU | 124376 tok/s step 4181/19560 | loss 3.584115 (-0.34z)| norm 0.2437 (-0.90z)| lr 7.43e-04 | 4171.13 ms | 32.4% bf16 MFU | 124442 tok/s step 4182/19560 | loss 3.556962 (-1.04z)| norm 0.2354 (-1.13z)| lr 7.43e-04 | 4182.55 ms | 32.3% bf16 MFU | 124488 tok/s step 4183/19560 | loss 3.578861 (-0.46z)| norm 0.2403 (-0.97z)| lr 7.43e-04 | 4184.93 ms | 32.3% bf16 MFU | 124527 tok/s step 4184/19560 | loss 3.614334 (+0.48z)| norm 0.2484 (-0.71z)| lr 7.43e-04 | 4173.33 ms | 32.4% bf16 MFU | 124582 tok/s step 4185/19560 | loss 3.616935 (+0.54z)| norm 0.2389 (-0.98z)| lr 7.43e-04 | 4173.81 ms | 32.3% bf16 MFU | 124634 tok/s step 4186/19560 | loss 3.580979 (-0.40z)| norm 0.2495 (-0.66z)| lr 7.43e-04 | 4184.00 ms | 32.3% bf16 MFU | 124668 tok/s step 4187/19560 | loss 3.596041 (-0.00z)| norm 0.2218 (-1.47z)| lr 7.43e-04 | 4169.72 ms | 32.4% bf16 MFU | 124721 tok/s step 4188/19560 | loss 3.503283 (-2.42z)| norm 0.2413 (-0.87z)| lr 7.43e-04 | 4174.43 ms | 32.3% bf16 MFU | 124765 tok/s step 4189/19560 | loss 3.577128 (-0.46z)| norm 0.2563 (-0.42z)| lr 7.43e-04 | 4178.93 ms | 32.3% bf16 MFU | 124799 tok/s step 4190/19560 | loss 3.580657 (-0.35z)| norm 0.2606 (-0.29z)| lr 7.43e-04 | 4169.70 ms | 32.4% bf16 MFU | 124846 tok/s step 4191/19560 | loss 3.615383 (+0.59z)| norm 0.2441 (-0.78z)| lr 7.43e-04 | 4211.59 ms | 32.1% bf16 MFU | 124828 tok/s step 4192/19560 | loss 3.590692 (-0.09z)| norm 0.2204 (-1.47z)| lr 7.43e-04 | 4173.12 ms | 32.4% bf16 MFU | 124869 tok/s step 4193/19560 | loss 3.532739 (-1.64z)| norm 0.2545 (-0.46z)| lr 7.43e-04 | 4204.41 ms | 32.1% bf16 MFU | 124860 tok/s step 4194/19560 | loss 3.572127 (-0.56z)| norm 0.2593 (-0.32z)| lr 7.43e-04 | 8927.48 ms | 15.1% bf16 MFU | 121554 tok/s step 4195/19560 | loss 3.566747 (-0.70z)| norm 0.2449 (-0.75z)| lr 7.43e-04 | 4155.43 ms | 32.5% bf16 MFU | 121784 tok/s step 4196/19560 | loss 3.575861 (-0.45z)| norm 0.2507 (-0.57z)| lr 7.43e-04 | 4208.53 ms | 32.1% bf16 MFU | 121924 tok/s step 4197/19560 | loss 3.632840 (+1.09z)| norm 0.2402 (-0.88z)| lr 7.43e-04 | 4197.31 ms | 32.2% bf16 MFU | 122073 tok/s step 4198/19560 | loss 3.563418 (-0.79z)| norm 0.2727 (+0.08z)| lr 7.43e-04 | 4194.74 ms | 32.2% bf16 MFU | 122219 tok/s step 4199/19560 | loss 3.512558 (-2.13z)| norm 0.3112 (+1.21z)| lr 7.43e-04 | 4169.70 ms | 32.4% bf16 MFU | 122395 tok/s step 4200/19560 | loss 3.579076 (-0.34z)| norm 0.2728 (+0.07z)| lr 7.43e-04 | 4164.30 ms | 32.4% bf16 MFU | 122570 tok/s step 4201/19560 | loss 3.569905 (-0.57z)| norm 0.3039 (+0.98z)| lr 7.43e-04 | 4157.97 ms | 32.5% bf16 MFU | 122746 tok/s step 4202/19560 | loss 3.619965 (+0.79z)| norm 0.2971 (+0.77z)| lr 7.43e-04 | 4162.41 ms | 32.4% bf16 MFU | 122907 tok/s step 4203/19560 | loss 3.604085 (+0.35z)| norm 0.2863 (+0.44z)| lr 7.43e-04 | 4189.97 ms | 32.2% bf16 MFU | 123018 tok/s step 4204/19560 | loss 3.576706 (-0.39z)| norm 0.2607 (-0.33z)| lr 7.43e-04 | 4164.85 ms | 32.4% bf16 MFU | 123161 tok/s step 4205/19560 | loss 3.522167 (-1.84z)| norm 0.2531 (-0.54z)| lr 7.43e-04 | 4156.95 ms | 32.5% bf16 MFU | 123309 tok/s step 4206/19560 | loss 3.535498 (-1.45z)| norm 0.2728 (+0.03z)| lr 7.43e-04 | 4167.43 ms | 32.4% bf16 MFU | 123434 tok/s step 4207/19560 | loss 3.560174 (-0.78z)| norm 0.2466 (-0.74z)| lr 7.43e-04 | 4155.90 ms | 32.5% bf16 MFU | 123570 tok/s step 4208/19560 | loss 3.562259 (-0.72z)| norm 0.2559 (-0.45z)| lr 7.42e-04 | 4193.39 ms | 32.2% bf16 MFU | 123643 tok/s step 4209/19560 | loss 3.555215 (-0.90z)| norm 0.2307 (-1.20z)| lr 7.42e-04 | 4344.29 ms | 31.1% bf16 MFU | 123495 tok/s step 4210/19560 | loss 3.621987 (+0.89z)| norm 0.2415 (-0.86z)| lr 7.42e-04 | 4160.13 ms | 32.5% bf16 MFU | 123622 tok/s step 4211/19560 | loss 3.574991 (-0.36z)| norm 0.2539 (-0.48z)| lr 7.42e-04 | 4172.03 ms | 32.4% bf16 MFU | 123724 tok/s step 4212/19560 | loss 3.591263 (+0.08z)| norm 0.2631 (-0.20z)| lr 7.42e-04 | 4156.29 ms | 32.5% bf16 MFU | 123845 tok/s step 4213/19560 | loss 3.557960 (-0.81z)| norm 0.2457 (-0.73z)| lr 7.42e-04 | 4162.27 ms | 32.4% bf16 MFU | 123951 tok/s step 4214/19560 | loss 3.581863 (-0.16z)| norm 0.2450 (-0.74z)| lr 7.42e-04 | 4175.76 ms | 32.3% bf16 MFU | 124031 tok/s step 4215/19560 | loss 3.690178 (+2.77z)| norm 0.2709 (+0.04z)| lr 7.42e-04 | 4235.46 ms | 31.9% bf16 MFU | 124019 tok/s step 4216/19560 | loss 3.623865 (+0.96z)| norm 0.3173 (+1.43z)| lr 7.42e-04 | 4167.22 ms | 32.4% bf16 MFU | 124108 tok/s step 4217/19560 | loss 3.544439 (-1.17z)| norm 0.3424 (+2.14z)| lr 7.42e-04 | 4167.99 ms | 32.4% bf16 MFU | 124193 tok/s step 4218/19560 | loss 3.580567 (-0.20z)| norm 0.2900 (+0.56z)| lr 7.42e-04 | 4285.80 ms | 31.5% bf16 MFU | 124099 tok/s step 4219/19560 | loss 3.605784 (+0.48z)| norm 0.2514 (-0.58z)| lr 7.42e-04 | 4170.61 ms | 32.4% bf16 MFU | 124180 tok/s step 4220/19560 | loss 3.571498 (-0.44z)| norm 0.2546 (-0.49z)| lr 7.42e-04 | 4165.83 ms | 32.4% bf16 MFU | 124264 tok/s step 4221/19560 | loss 3.645137 (+1.53z)| norm 0.4321 (+4.38z)| lr 7.42e-04 | 4164.27 ms | 32.4% bf16 MFU | 124346 tok/s step 4222/19560 | loss 3.619225 (+0.83z)| norm 0.2859 (+0.36z)| lr 7.42e-04 | 4163.44 ms | 32.4% bf16 MFU | 124425 tok/s step 4223/19560 | loss 3.611476 (+0.63z)| norm 0.2541 (-0.51z)| lr 7.42e-04 | 4174.68 ms | 32.3% bf16 MFU | 124483 tok/s step 4224/19560 | loss 3.559606 (-0.75z)| norm 0.2427 (-0.82z)| lr 7.42e-04 | 4162.42 ms | 32.4% bf16 MFU | 124557 tok/s step 4225/19560 | loss 3.589097 (+0.05z)| norm 0.2723 (-0.01z)| lr 7.42e-04 | 4278.50 ms | 31.6% bf16 MFU | 124456 tok/s step 4226/19560 | loss 3.569024 (-0.49z)| norm 0.2591 (-0.38z)| lr 7.42e-04 | 4166.77 ms | 32.4% bf16 MFU | 124524 tok/s step 4227/19560 | loss 3.609272 (+0.59z)| norm 0.2664 (-0.17z)| lr 7.42e-04 | 4171.99 ms | 32.4% bf16 MFU | 124581 tok/s step 4228/19560 | loss 3.517201 (-1.87z)| norm 0.2600 (-0.35z)| lr 7.42e-04 | 4169.34 ms | 32.4% bf16 MFU | 124640 tok/s step 4229/19560 | loss 3.578122 (-0.24z)| norm 0.2712 (-0.04z)| lr 7.42e-04 | 4180.02 ms | 32.3% bf16 MFU | 124679 tok/s step 4230/19560 | loss 3.615962 (+0.76z)| norm 0.3175 (+1.24z)| lr 7.42e-04 | 4162.88 ms | 32.4% bf16 MFU | 124742 tok/s step 4231/19560 | loss 3.726933 (+3.51z)| norm 0.3335 (+1.66z)| lr 7.42e-04 | 4162.75 ms | 32.4% bf16 MFU | 124803 tok/s step 4232/19560 | loss 3.538802 (-1.24z)| norm 0.3993 (+3.29z)| lr 7.42e-04 | 4155.22 ms | 32.5% bf16 MFU | 124871 tok/s step 4233/19560 | loss 3.596641 (+0.23z)| norm 0.2810 (+0.18z)| lr 7.42e-04 | 4179.06 ms | 32.3% bf16 MFU | 124901 tok/s step 4234/19560 | loss 3.573438 (-0.35z)| norm 0.3047 (+0.80z)| lr 7.42e-04 | 4181.98 ms | 32.3% bf16 MFU | 124924 tok/s step 4235/19560 | loss 3.568249 (-0.48z)| norm 0.2861 (+0.30z)| lr 7.42e-04 | 4206.62 ms | 32.1% bf16 MFU | 124909 tok/s step 4236/19560 | loss 3.576936 (-0.26z)| norm 0.2953 (+0.53z)| lr 7.42e-04 | 4188.52 ms | 32.2% bf16 MFU | 124923 tok/s step 4237/19560 | loss 3.640502 (+1.36z)| norm 0.2696 (-0.15z)| lr 7.41e-04 | 4157.86 ms | 32.5% bf16 MFU | 124981 tok/s step 4238/19560 | loss 3.699982 (+2.77z)| norm 0.3071 (+0.83z)| lr 7.41e-04 | 4173.03 ms | 32.4% bf16 MFU | 125014 tok/s step 4239/19560 | loss 3.598489 (+0.25z)| norm 0.2750 (-0.03z)| lr 7.41e-04 | 4154.39 ms | 32.5% bf16 MFU | 125073 tok/s step 4240/19560 | loss 3.546077 (-1.04z)| norm 0.2627 (-0.34z)| lr 7.41e-04 | 4171.10 ms | 32.4% bf16 MFU | 125104 tok/s step 4241/19560 | loss 3.573596 (-0.35z)| norm 0.2809 (+0.16z)| lr 7.41e-04 | 4160.17 ms | 32.5% bf16 MFU | 125151 tok/s step 4242/19560 | loss 3.560099 (-0.69z)| norm 0.2781 (+0.12z)| lr 7.41e-04 | 4162.44 ms | 32.4% bf16 MFU | 125191 tok/s step 4243/19560 | loss 3.596440 (+0.25z)| norm 0.2618 (-0.34z)| lr 7.41e-04 | 4194.00 ms | 32.2% bf16 MFU | 125182 tok/s step 4244/19560 | loss 3.679663 (+2.34z)| norm 0.2691 (-0.12z)| lr 7.41e-04 | 4169.99 ms | 32.4% bf16 MFU | 125209 tok/s step 4245/19560 | loss 3.625973 (+0.97z)| norm 0.2462 (-0.80z)| lr 7.41e-04 | 4169.09 ms | 32.4% bf16 MFU | 125236 tok/s step 4246/19560 | loss 3.599411 (+0.32z)| norm 0.2601 (-0.37z)| lr 7.41e-04 | 4166.33 ms | 32.4% bf16 MFU | 125267 tok/s step 4247/19560 | loss 3.578923 (-0.22z)| norm 0.2610 (-0.33z)| lr 7.41e-04 | 4189.16 ms | 32.2% bf16 MFU | 125261 tok/s step 4248/19560 | loss 3.583409 (-0.10z)| norm 0.2680 (-0.11z)| lr 7.41e-04 | 4165.20 ms | 32.4% bf16 MFU | 125292 tok/s step 4249/19560 | loss 3.613229 (+0.68z)| norm 0.2592 (-0.38z)| lr 7.41e-04 | 4163.85 ms | 32.4% bf16 MFU | 125323 tok/s step 4250/19560 | loss 3.521714 (-1.71z)| norm 0.2524 (-0.58z)| lr 7.41e-04 | 4173.45 ms | 32.4% bf16 MFU | 125338 tok/s val loss 3.588039 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2753/10042 = 0.274149 step 4251/19560 | loss 3.481844 (-2.66z)| norm 0.2531 (-0.56z)| lr 7.41e-04 | 4579.69 ms | 29.5% bf16 MFU | 124795 tok/s step 4252/19560 | loss 3.606588 (+0.50z)| norm 0.2413 (-0.91z)| lr 7.41e-04 | 4237.32 ms | 31.9% bf16 MFU | 124742 tok/s step 4253/19560 | loss 3.549381 (-0.95z)| norm 0.2725 (+0.04z)| lr 7.41e-04 | 4251.68 ms | 31.8% bf16 MFU | 124670 tok/s step 4254/19560 | loss 3.550546 (-0.91z)| norm 0.2545 (-0.50z)| lr 7.41e-04 | 4468.62 ms | 30.2% bf16 MFU | 124303 tok/s step 4255/19560 | loss 3.608225 (+0.54z)| norm 0.2544 (-0.49z)| lr 7.41e-04 | 4206.40 ms | 32.1% bf16 MFU | 124320 tok/s step 4256/19560 | loss 3.551460 (-0.90z)| norm 0.2413 (-0.89z)| lr 7.41e-04 | 4160.05 ms | 32.5% bf16 MFU | 124405 tok/s step 4257/19560 | loss 3.606353 (+0.49z)| norm 0.2373 (-1.00z)| lr 7.41e-04 | 4162.63 ms | 32.4% bf16 MFU | 124483 tok/s step 4258/19560 | loss 3.599011 (+0.30z)| norm 0.2318 (-1.16z)| lr 7.41e-04 | 4197.06 ms | 32.2% bf16 MFU | 124504 tok/s step 4259/19560 | loss 3.669606 (+2.08z)| norm 0.2282 (-1.26z)| lr 7.41e-04 | 4235.83 ms | 31.9% bf16 MFU | 124468 tok/s step 4260/19560 | loss 3.608754 (+0.54z)| norm 0.2406 (-0.87z)| lr 7.41e-04 | 4160.01 ms | 32.5% bf16 MFU | 124546 tok/s step 4261/19560 | loss 3.643765 (+1.41z)| norm 0.2866 (+0.51z)| lr 7.41e-04 | 4157.47 ms | 32.5% bf16 MFU | 124624 tok/s step 4262/19560 | loss 3.572584 (-0.38z)| norm 0.2887 (+0.57z)| lr 7.41e-04 | 4169.12 ms | 32.4% bf16 MFU | 124681 tok/s step 4263/19560 | loss 3.617223 (+0.73z)| norm 0.2985 (+0.86z)| lr 7.41e-04 | 4270.41 ms | 31.6% bf16 MFU | 124585 tok/s step 4264/19560 | loss 3.584581 (-0.08z)| norm 0.2914 (+0.63z)| lr 7.41e-04 | 4164.81 ms | 32.4% bf16 MFU | 124650 tok/s step 4265/19560 | loss 3.588544 (+0.02z)| norm 0.3463 (+2.24z)| lr 7.40e-04 | 4168.66 ms | 32.4% bf16 MFU | 124706 tok/s step 4266/19560 | loss 3.623177 (+0.91z)| norm 0.3166 (+1.33z)| lr 7.40e-04 | 4166.49 ms | 32.4% bf16 MFU | 124763 tok/s step 4267/19560 | loss 3.597201 (+0.23z)| norm 0.2985 (+0.77z)| lr 7.40e-04 | 4156.98 ms | 32.5% bf16 MFU | 124831 tok/s step 4268/19560 | loss 3.573245 (-0.39z)| norm 0.2565 (-0.49z)| lr 7.40e-04 | 4156.80 ms | 32.5% bf16 MFU | 124895 tok/s step 4269/19560 | loss 3.603335 (+0.38z)| norm 0.2530 (-0.61z)| lr 7.40e-04 | 4156.92 ms | 32.5% bf16 MFU | 124957 tok/s step 4270/19560 | loss 3.574060 (-0.37z)| norm 0.2514 (-0.65z)| lr 7.40e-04 | 4204.90 ms | 32.1% bf16 MFU | 124943 tok/s step 4271/19560 | loss 3.648631 (+1.53z)| norm 0.2561 (-0.50z)| lr 7.40e-04 | 4157.34 ms | 32.5% bf16 MFU | 125002 tok/s step 4272/19560 | loss 3.555904 (-0.83z)| norm 0.2379 (-1.04z)| lr 7.40e-04 | 4162.83 ms | 32.4% bf16 MFU | 125049 tok/s step 4273/19560 | loss 3.583732 (-0.12z)| norm 0.2498 (-0.67z)| lr 7.40e-04 | 4160.53 ms | 32.5% bf16 MFU | 125097 tok/s step 4274/19560 | loss 3.605901 (+0.44z)| norm 0.2510 (-0.62z)| lr 7.40e-04 | 4163.36 ms | 32.4% bf16 MFU | 125139 tok/s step 4275/19560 | loss 3.584893 (-0.09z)| norm 0.2604 (-0.34z)| lr 7.40e-04 | 4163.12 ms | 32.4% bf16 MFU | 125179 tok/s step 4276/19560 | loss 3.595963 (+0.18z)| norm 0.2522 (-0.59z)| lr 7.40e-04 | 4159.47 ms | 32.5% bf16 MFU | 125222 tok/s step 4277/19560 | loss 3.642702 (+1.37z)| norm 0.6209 (+7.70z)| lr 7.40e-04 | 4160.42 ms | 32.5% bf16 MFU | 125262 tok/s step 4278/19560 | loss 3.594850 (+0.14z)| norm 0.3209 (+1.01z)| lr 7.40e-04 | 4149.74 ms | 32.5% bf16 MFU | 125316 tok/s step 4279/19560 | loss 3.631929 (+1.09z)| norm 0.3170 (+0.91z)| lr 7.40e-04 | 4187.54 ms | 32.2% bf16 MFU | 125310 tok/s step 4280/19560 | loss 3.588121 (-0.03z)| norm 0.2817 (+0.12z)| lr 7.40e-04 | 4187.02 ms | 32.2% bf16 MFU | 125306 tok/s step 4281/19560 | loss 3.637371 (+1.21z)| norm 0.3092 (+0.72z)| lr 7.40e-04 | 4155.97 ms | 32.5% bf16 MFU | 125348 tok/s step 4282/19560 | loss 3.560405 (-0.74z)| norm 0.3039 (+0.60z)| lr 7.40e-04 | 4171.12 ms | 32.4% bf16 MFU | 125365 tok/s step 4283/19560 | loss 3.602158 (+0.31z)| norm 0.2759 (-0.02z)| lr 7.40e-04 | 4168.13 ms | 32.4% bf16 MFU | 125386 tok/s step 4284/19560 | loss 3.648692 (+1.49z)| norm 0.3299 (+1.17z)| lr 7.40e-04 | 4166.12 ms | 32.4% bf16 MFU | 125409 tok/s step 4285/19560 | loss 3.565123 (-0.62z)| norm 0.2779 (+0.02z)| lr 7.40e-04 | 4214.59 ms | 32.0% bf16 MFU | 125359 tok/s step 4286/19560 | loss 3.583293 (-0.15z)| norm 0.2650 (-0.26z)| lr 7.40e-04 | 4154.81 ms | 32.5% bf16 MFU | 125400 tok/s step 4287/19560 | loss 3.459881 (-3.13z)| norm 0.2809 (+0.10z)| lr 7.40e-04 | 4171.14 ms | 32.4% bf16 MFU | 125415 tok/s step 4288/19560 | loss 3.605461 (+0.42z)| norm 0.2579 (-0.41z)| lr 7.40e-04 | 4192.74 ms | 32.2% bf16 MFU | 125396 tok/s step 4289/19560 | loss 3.602563 (+0.34z)| norm 0.2736 (-0.04z)| lr 7.40e-04 | 4231.21 ms | 31.9% bf16 MFU | 125322 tok/s step 4290/19560 | loss 3.615095 (+0.65z)| norm 0.2685 (-0.16z)| lr 7.40e-04 | 4312.52 ms | 31.3% bf16 MFU | 125135 tok/s step 4291/19560 | loss 3.581379 (-0.18z)| norm 0.2665 (-0.20z)| lr 7.40e-04 | 4207.19 ms | 32.1% bf16 MFU | 125109 tok/s step 4292/19560 | loss 3.622432 (+0.81z)| norm 0.2777 (+0.06z)| lr 7.40e-04 | 4155.17 ms | 32.5% bf16 MFU | 125162 tok/s step 4293/19560 | loss 3.608127 (+0.45z)| norm 0.2431 (-0.71z)| lr 7.39e-04 | 4163.79 ms | 32.4% bf16 MFU | 125200 tok/s step 4294/19560 | loss 3.554379 (-0.85z)| norm 0.2533 (-0.47z)| lr 7.39e-04 | 4221.39 ms | 32.0% bf16 MFU | 125150 tok/s step 4295/19560 | loss 3.563325 (-0.62z)| norm 0.2397 (-0.77z)| lr 7.39e-04 | 4166.13 ms | 32.4% bf16 MFU | 125185 tok/s step 4296/19560 | loss 3.592946 (+0.10z)| norm 0.2349 (-0.86z)| lr 7.39e-04 | 4153.02 ms | 32.5% bf16 MFU | 125237 tok/s step 4297/19560 | loss 3.576494 (-0.31z)| norm 0.2491 (-0.54z)| lr 7.39e-04 | 4163.71 ms | 32.4% bf16 MFU | 125272 tok/s step 4298/19560 | loss 3.626395 (+0.91z)| norm 0.2757 (+0.06z)| lr 7.39e-04 | 4178.43 ms | 32.3% bf16 MFU | 125282 tok/s step 4299/19560 | loss 3.591916 (+0.05z)| norm 0.2761 (+0.07z)| lr 7.39e-04 | 4163.64 ms | 32.4% bf16 MFU | 125314 tok/s step 4300/19560 | loss 3.572950 (-0.42z)| norm 0.2469 (-0.58z)| lr 7.39e-04 | 4160.51 ms | 32.5% bf16 MFU | 125349 tok/s step 4301/19560 | loss 3.606226 (+0.40z)| norm 0.2513 (-0.47z)| lr 7.39e-04 | 4422.79 ms | 30.5% bf16 MFU | 125008 tok/s step 4302/19560 | loss 3.638143 (+1.17z)| norm 0.2693 (-0.07z)| lr 7.39e-04 | 4487.84 ms | 30.1% bf16 MFU | 124599 tok/s step 4303/19560 | loss 3.645152 (+1.34z)| norm 0.2674 (-0.11z)| lr 7.39e-04 | 4479.62 ms | 30.1% bf16 MFU | 124221 tok/s step 4304/19560 | loss 3.536357 (-1.34z)| norm 0.2577 (-0.33z)| lr 7.39e-04 | 4267.05 ms | 31.6% bf16 MFU | 124154 tok/s step 4305/19560 | loss 3.643728 (+1.32z)| norm 0.2297 (-0.96z)| lr 7.39e-04 | 4342.58 ms | 31.1% bf16 MFU | 123982 tok/s step 4306/19560 | loss 3.569611 (-0.51z)| norm 0.2545 (-0.39z)| lr 7.39e-04 | 4246.72 ms | 31.8% bf16 MFU | 123956 tok/s step 4307/19560 | loss 3.584892 (-0.12z)| norm 0.2297 (-0.94z)| lr 7.39e-04 | 4148.21 ms | 32.5% bf16 MFU | 124078 tok/s step 4308/19560 | loss 3.529344 (-1.50z)| norm 0.2429 (-0.64z)| lr 7.39e-04 | 4162.70 ms | 32.4% bf16 MFU | 124171 tok/s step 4309/19560 | loss 3.617680 (+0.70z)| norm 0.2502 (-0.47z)| lr 7.39e-04 | 4402.40 ms | 30.7% bf16 MFU | 123917 tok/s step 4310/19560 | loss 3.550063 (-0.99z)| norm 0.2813 (+0.22z)| lr 7.39e-04 | 4146.36 ms | 32.6% bf16 MFU | 124044 tok/s step 4311/19560 | loss 3.543998 (-1.13z)| norm 0.2568 (-0.34z)| lr 7.39e-04 | 4269.33 ms | 31.6% bf16 MFU | 123982 tok/s step 4312/19560 | loss 3.582763 (-0.16z)| norm 0.2382 (-0.76z)| lr 7.39e-04 | 4144.97 ms | 32.6% bf16 MFU | 124107 tok/s step 4313/19560 | loss 3.526100 (-1.54z)| norm 0.2674 (-0.10z)| lr 7.39e-04 | 4394.74 ms | 30.7% bf16 MFU | 123867 tok/s step 4314/19560 | loss 3.569715 (-0.46z)| norm 0.2450 (-0.61z)| lr 7.39e-04 | 4290.10 ms | 31.5% bf16 MFU | 123784 tok/s step 4315/19560 | loss 3.576759 (-0.28z)| norm 0.2548 (-0.39z)| lr 7.39e-04 | 4183.38 ms | 32.3% bf16 MFU | 123861 tok/s step 4316/19560 | loss 3.599537 (+0.26z)| norm 0.2551 (-0.39z)| lr 7.39e-04 | 4200.08 ms | 32.1% bf16 MFU | 123909 tok/s step 4317/19560 | loss 3.591305 (+0.05z)| norm 0.2328 (-0.89z)| lr 7.39e-04 | 4194.63 ms | 32.2% bf16 MFU | 123963 tok/s step 4318/19560 | loss 3.631453 (+1.05z)| norm 0.2690 (-0.07z)| lr 7.39e-04 | 4177.65 ms | 32.3% bf16 MFU | 124040 tok/s step 4319/19560 | loss 3.650069 (+1.49z)| norm 0.2820 (+0.22z)| lr 7.39e-04 | 4153.14 ms | 32.5% bf16 MFU | 124150 tok/s step 4320/19560 | loss 3.623228 (+0.82z)| norm 0.2643 (-0.19z)| lr 7.39e-04 | 4215.08 ms | 32.0% bf16 MFU | 124162 tok/s step 4321/19560 | loss 3.573232 (-0.43z)| norm 0.2420 (-0.70z)| lr 7.38e-04 | 4270.59 ms | 31.6% bf16 MFU | 124092 tok/s step 4322/19560 | loss 3.615756 (+0.62z)| norm 0.2395 (-0.75z)| lr 7.38e-04 | 4145.67 ms | 32.6% bf16 MFU | 124211 tok/s step 4323/19560 | loss 3.589881 (-0.03z)| norm 0.2517 (-0.47z)| lr 7.38e-04 | 4270.43 ms | 31.6% bf16 MFU | 124139 tok/s step 4324/19560 | loss 3.602527 (+0.28z)| norm 0.2511 (-0.49z)| lr 7.38e-04 | 4171.46 ms | 32.4% bf16 MFU | 124216 tok/s step 4325/19560 | loss 3.570774 (-0.50z)| norm 0.2548 (-0.41z)| lr 7.38e-04 | 4472.22 ms | 30.2% bf16 MFU | 123867 tok/s step 4326/19560 | loss 3.577123 (-0.34z)| norm 0.2820 (+0.21z)| lr 7.38e-04 | 4154.51 ms | 32.5% bf16 MFU | 123983 tok/s step 4327/19560 | loss 3.603200 (+0.30z)| norm 0.2652 (-0.16z)| lr 7.38e-04 | 4342.54 ms | 31.1% bf16 MFU | 123821 tok/s step 4328/19560 | loss 3.619606 (+0.71z)| norm 0.2274 (-1.02z)| lr 7.38e-04 | 4156.25 ms | 32.5% bf16 MFU | 123937 tok/s step 4329/19560 | loss 3.557980 (-0.85z)| norm 0.2301 (-0.94z)| lr 7.38e-04 | 4309.58 ms | 31.3% bf16 MFU | 123823 tok/s step 4330/19560 | loss 3.611991 (+0.52z)| norm 0.2478 (-0.53z)| lr 7.38e-04 | 4558.67 ms | 29.6% bf16 MFU | 123382 tok/s step 4331/19560 | loss 3.557583 (-0.85z)| norm 0.2577 (-0.30z)| lr 7.38e-04 | 4148.28 ms | 32.5% bf16 MFU | 123532 tok/s step 4332/19560 | loss 3.576045 (-0.38z)| norm 0.2585 (-0.28z)| lr 7.38e-04 | 4154.19 ms | 32.5% bf16 MFU | 123666 tok/s step 4333/19560 | loss 3.673610 (+2.04z)| norm 0.2559 (-0.34z)| lr 7.38e-04 | 4304.08 ms | 31.4% bf16 MFU | 123573 tok/s step 4334/19560 | loss 3.650091 (+1.43z)| norm 0.2903 (+0.44z)| lr 7.38e-04 | 4148.32 ms | 32.5% bf16 MFU | 123714 tok/s step 4335/19560 | loss 3.599030 (+0.14z)| norm 0.2798 (+0.20z)| lr 7.38e-04 | 4265.01 ms | 31.7% bf16 MFU | 123675 tok/s step 4336/19560 | loss 3.567165 (-0.67z)| norm 0.2791 (+0.18z)| lr 7.38e-04 | 4242.83 ms | 31.8% bf16 MFU | 123670 tok/s step 4337/19560 | loss 3.575695 (-0.46z)| norm 0.3119 (+0.91z)| lr 7.38e-04 | 4191.61 ms | 32.2% bf16 MFU | 123740 tok/s step 4338/19560 | loss 3.585598 (-0.20z)| norm 0.3193 (+1.07z)| lr 7.38e-04 | 4218.04 ms | 32.0% bf16 MFU | 123768 tok/s step 4339/19560 | loss 3.559699 (-0.85z)| norm 0.3067 (+0.77z)| lr 7.38e-04 | 4277.13 ms | 31.6% bf16 MFU | 123708 tok/s step 4340/19560 | loss 3.606848 (+0.34z)| norm 0.2594 (-0.31z)| lr 7.38e-04 | 4150.53 ms | 32.5% bf16 MFU | 123839 tok/s step 4341/19560 | loss 3.677294 (+2.07z)| norm 0.2877 (+0.33z)| lr 7.38e-04 | 4152.92 ms | 32.5% bf16 MFU | 123959 tok/s step 4342/19560 | loss 3.578077 (-0.41z)| norm 0.2512 (-0.50z)| lr 7.38e-04 | 4157.71 ms | 32.5% bf16 MFU | 124066 tok/s step 4343/19560 | loss 3.629112 (+0.89z)| norm 0.2902 (+0.38z)| lr 7.38e-04 | 4159.74 ms | 32.5% bf16 MFU | 124165 tok/s step 4344/19560 | loss 3.619022 (+0.64z)| norm 0.3032 (+0.68z)| lr 7.38e-04 | 4471.90 ms | 30.2% bf16 MFU | 123819 tok/s step 4345/19560 | loss 3.607035 (+0.32z)| norm 0.3193 (+1.06z)| lr 7.38e-04 | 4280.36 ms | 31.5% bf16 MFU | 123752 tok/s step 4346/19560 | loss 3.535915 (-1.48z)| norm 0.2862 (+0.30z)| lr 7.38e-04 | 4223.06 ms | 32.0% bf16 MFU | 123772 tok/s step 4347/19560 | loss 3.585673 (-0.21z)| norm 0.2649 (-0.19z)| lr 7.38e-04 | 4152.00 ms | 32.5% bf16 MFU | 123897 tok/s step 4348/19560 | loss 3.589040 (-0.13z)| norm 0.2497 (-0.54z)| lr 7.38e-04 | 4159.04 ms | 32.5% bf16 MFU | 124005 tok/s step 4349/19560 | loss 3.559474 (-0.87z)| norm 0.2777 (+0.14z)| lr 7.37e-04 | 4160.76 ms | 32.5% bf16 MFU | 124105 tok/s step 4350/19560 | loss 3.682016 (+2.21z)| norm 0.2982 (+0.63z)| lr 7.37e-04 | 4356.38 ms | 31.0% bf16 MFU | 123918 tok/s step 4351/19560 | loss 3.706708 (+2.73z)| norm 0.2591 (-0.32z)| lr 7.37e-04 | 4431.57 ms | 30.5% bf16 MFU | 123637 tok/s step 4352/19560 | loss 3.557688 (-0.90z)| norm 0.2591 (-0.32z)| lr 7.37e-04 | 4151.72 ms | 32.5% bf16 MFU | 123769 tok/s step 4353/19560 | loss 3.490171 (-2.46z)| norm 0.2690 (-0.08z)| lr 7.37e-04 | 4249.26 ms | 31.8% bf16 MFU | 123750 tok/s step 4354/19560 | loss 3.665504 (+1.67z)| norm 0.3011 (+0.69z)| lr 7.37e-04 | 4152.02 ms | 32.5% bf16 MFU | 123876 tok/s step 4355/19560 | loss 3.605198 (+0.25z)| norm 0.2911 (+0.44z)| lr 7.37e-04 | 4150.30 ms | 32.5% bf16 MFU | 123999 tok/s step 4356/19560 | loss 3.532754 (-1.46z)| norm 0.2595 (-0.32z)| lr 7.37e-04 | 4153.96 ms | 32.5% bf16 MFU | 124109 tok/s step 4357/19560 | loss 3.697098 (+2.35z)| norm 0.2757 (+0.07z)| lr 7.37e-04 | 4201.88 ms | 32.1% bf16 MFU | 124143 tok/s step 4358/19560 | loss 3.565846 (-0.68z)| norm 0.2840 (+0.28z)| lr 7.37e-04 | 4154.68 ms | 32.5% bf16 MFU | 124245 tok/s step 4359/19560 | loss 3.633795 (+0.94z)| norm 0.4283 (+3.61z)| lr 7.37e-04 | 4160.32 ms | 32.5% bf16 MFU | 124334 tok/s step 4360/19560 | loss 3.628212 (+0.79z)| norm 0.3134 (+0.98z)| lr 7.37e-04 | 4159.67 ms | 32.5% bf16 MFU | 124419 tok/s step 4361/19560 | loss 3.649453 (+1.29z)| norm 0.2596 (-0.31z)| lr 7.37e-04 | 4365.87 ms | 30.9% bf16 MFU | 124203 tok/s step 4362/19560 | loss 3.597236 (+0.04z)| norm 0.3081 (+0.85z)| lr 7.37e-04 | 4183.04 ms | 32.3% bf16 MFU | 124259 tok/s step 4363/19560 | loss 3.576464 (-0.46z)| norm 0.3046 (+0.77z)| lr 7.37e-04 | 4166.18 ms | 32.4% bf16 MFU | 124339 tok/s step 4364/19560 | loss 3.577227 (-0.45z)| norm 0.2898 (+0.41z)| lr 7.37e-04 | 4171.99 ms | 32.4% bf16 MFU | 124405 tok/s step 4365/19560 | loss 3.593091 (-0.06z)| norm 0.2573 (-0.36z)| lr 7.37e-04 | 6374.58 ms | 21.2% bf16 MFU | 122297 tok/s step 4366/19560 | loss 3.558773 (-0.87z)| norm 0.2809 (+0.21z)| lr 7.37e-04 | 4157.14 ms | 32.5% bf16 MFU | 122488 tok/s step 4367/19560 | loss 3.616732 (+0.55z)| norm 0.2774 (+0.12z)| lr 7.37e-04 | 4208.86 ms | 32.1% bf16 MFU | 122592 tok/s step 4368/19560 | loss 3.580023 (-0.36z)| norm 0.2591 (-0.32z)| lr 7.37e-04 | 4154.50 ms | 32.5% bf16 MFU | 122772 tok/s step 4369/19560 | loss 3.559142 (-0.87z)| norm 0.2402 (-0.76z)| lr 7.37e-04 | 4154.56 ms | 32.5% bf16 MFU | 122944 tok/s step 4370/19560 | loss 3.628688 (+0.83z)| norm 0.2658 (-0.15z)| lr 7.37e-04 | 4150.26 ms | 32.5% bf16 MFU | 123113 tok/s step 4371/19560 | loss 3.556898 (-0.93z)| norm 0.2658 (-0.14z)| lr 7.37e-04 | 4144.87 ms | 32.6% bf16 MFU | 123282 tok/s step 4372/19560 | loss 3.611068 (+0.42z)| norm 0.2359 (-0.85z)| lr 7.37e-04 | 4159.56 ms | 32.5% bf16 MFU | 123420 tok/s step 4373/19560 | loss 3.659831 (+1.62z)| norm 0.2784 (+0.16z)| lr 7.37e-04 | 4186.96 ms | 32.2% bf16 MFU | 123510 tok/s step 4374/19560 | loss 3.570429 (-0.59z)| norm 0.2584 (-0.32z)| lr 7.37e-04 | 4156.14 ms | 32.5% bf16 MFU | 123642 tok/s step 4375/19560 | loss 3.564164 (-0.74z)| norm 0.2568 (-0.36z)| lr 7.37e-04 | 4148.98 ms | 32.5% bf16 MFU | 123778 tok/s step 4376/19560 | loss 3.589682 (-0.11z)| norm 0.2864 (+0.35z)| lr 7.36e-04 | 4155.95 ms | 32.5% bf16 MFU | 123897 tok/s step 4377/19560 | loss 3.492844 (-2.43z)| norm 0.2384 (-0.80z)| lr 7.36e-04 | 4163.11 ms | 32.4% bf16 MFU | 123999 tok/s step 4378/19560 | loss 3.646701 (+1.27z)| norm 0.2903 (+0.43z)| lr 7.36e-04 | 4162.77 ms | 32.4% bf16 MFU | 124096 tok/s step 4379/19560 | loss 3.571763 (-0.58z)| norm 0.2564 (-0.38z)| lr 7.36e-04 | 4168.26 ms | 32.4% bf16 MFU | 124180 tok/s step 4380/19560 | loss 3.607148 (+0.30z)| norm 0.2442 (-0.67z)| lr 7.36e-04 | 4164.78 ms | 32.4% bf16 MFU | 124266 tok/s step 4381/19560 | loss 3.685237 (+2.20z)| norm 0.2433 (-0.68z)| lr 7.36e-04 | 4169.80 ms | 32.4% bf16 MFU | 124339 tok/s step 4382/19560 | loss 3.550612 (-1.12z)| norm 0.2324 (-0.93z)| lr 7.36e-04 | 4235.67 ms | 31.9% bf16 MFU | 124311 tok/s step 4383/19560 | loss 3.622133 (+0.64z)| norm 0.2719 (+0.00z)| lr 7.36e-04 | 4154.49 ms | 32.5% bf16 MFU | 124405 tok/s step 4384/19560 | loss 3.534507 (-1.51z)| norm 0.2263 (-1.08z)| lr 7.36e-04 | 4174.44 ms | 32.3% bf16 MFU | 124465 tok/s step 4385/19560 | loss 3.595533 (-0.01z)| norm 0.2621 (-0.24z)| lr 7.36e-04 | 4163.27 ms | 32.4% bf16 MFU | 124538 tok/s step 4386/19560 | loss 3.625265 (+0.71z)| norm 0.2618 (-0.25z)| lr 7.36e-04 | 4150.43 ms | 32.5% bf16 MFU | 124627 tok/s step 4387/19560 | loss 3.585988 (-0.24z)| norm 0.2625 (-0.24z)| lr 7.36e-04 | 4149.75 ms | 32.5% bf16 MFU | 124713 tok/s step 4388/19560 | loss 3.600959 (+0.14z)| norm 0.2586 (-0.33z)| lr 7.36e-04 | 4163.90 ms | 32.4% bf16 MFU | 124773 tok/s step 4389/19560 | loss 3.599085 (+0.10z)| norm 0.2838 (+0.27z)| lr 7.36e-04 | 4153.95 ms | 32.5% bf16 MFU | 124845 tok/s step 4390/19560 | loss 3.706710 (+2.68z)| norm 0.2750 (+0.06z)| lr 7.36e-04 | 4155.27 ms | 32.5% bf16 MFU | 124912 tok/s step 4391/19560 | loss 3.606329 (+0.25z)| norm 0.2495 (-0.54z)| lr 7.36e-04 | 4150.34 ms | 32.5% bf16 MFU | 124982 tok/s step 4392/19560 | loss 3.659725 (+1.52z)| norm 0.2584 (-0.32z)| lr 7.36e-04 | 4169.96 ms | 32.4% bf16 MFU | 125020 tok/s step 4393/19560 | loss 3.521513 (-1.77z)| norm 0.2528 (-0.44z)| lr 7.36e-04 | 4151.61 ms | 32.5% bf16 MFU | 125083 tok/s step 4394/19560 | loss 3.711179 (+2.64z)| norm 0.2557 (-0.36z)| lr 7.36e-04 | 4179.53 ms | 32.3% bf16 MFU | 125101 tok/s step 4395/19560 | loss 3.570347 (-0.61z)| norm 0.2471 (-0.57z)| lr 7.36e-04 | 4188.53 ms | 32.2% bf16 MFU | 125104 tok/s step 4396/19560 | loss 3.585977 (-0.25z)| norm 0.2139 (-1.36z)| lr 7.36e-04 | 4186.26 ms | 32.3% bf16 MFU | 125111 tok/s step 4397/19560 | loss 3.584799 (-0.27z)| norm 0.2357 (-0.82z)| lr 7.36e-04 | 4164.61 ms | 32.4% bf16 MFU | 125150 tok/s step 4398/19560 | loss 3.613274 (+0.38z)| norm 0.2426 (-0.65z)| lr 7.36e-04 | 4268.44 ms | 31.6% bf16 MFU | 125034 tok/s step 4399/19560 | loss 3.532589 (-1.46z)| norm 0.2518 (-0.43z)| lr 7.36e-04 | 4164.27 ms | 32.4% bf16 MFU | 125078 tok/s step 4400/19560 | loss 3.560421 (-0.82z)| norm 0.2590 (-0.26z)| lr 7.36e-04 | 4155.23 ms | 32.5% bf16 MFU | 125132 tok/s step 4401/19560 | loss 3.544328 (-1.18z)| norm 0.2517 (-0.44z)| lr 7.36e-04 | 4178.25 ms | 32.3% bf16 MFU | 125150 tok/s step 4402/19560 | loss 3.573884 (-0.50z)| norm 0.2232 (-1.12z)| lr 7.36e-04 | 4159.80 ms | 32.5% bf16 MFU | 125194 tok/s step 4403/19560 | loss 3.583488 (-0.27z)| norm 0.2608 (-0.21z)| lr 7.35e-04 | 4363.80 ms | 30.9% bf16 MFU | 124942 tok/s step 4404/19560 | loss 3.569439 (-0.59z)| norm 0.2791 (+0.22z)| lr 7.35e-04 | 4152.16 ms | 32.5% bf16 MFU | 125008 tok/s step 4405/19560 | loss 3.566188 (-0.65z)| norm 0.2614 (-0.21z)| lr 7.35e-04 | 4153.87 ms | 32.5% bf16 MFU | 125068 tok/s step 4406/19560 | loss 3.566294 (-0.65z)| norm 0.2347 (-1.17z)| lr 7.35e-04 | 4168.38 ms | 32.4% bf16 MFU | 125104 tok/s step 4407/19560 | loss 3.575931 (-0.42z)| norm 0.2610 (-0.19z)| lr 7.35e-04 | 4219.19 ms | 32.0% bf16 MFU | 125062 tok/s step 4408/19560 | loss 3.622320 (+0.64z)| norm 0.2622 (-0.13z)| lr 7.35e-04 | 4172.21 ms | 32.4% bf16 MFU | 125092 tok/s step 4409/19560 | loss 3.529869 (-1.45z)| norm 0.2686 (+0.12z)| lr 7.35e-04 | 4335.54 ms | 31.1% bf16 MFU | 124884 tok/s step 4410/19560 | loss 3.623117 (+0.67z)| norm 0.2754 (+0.39z)| lr 7.35e-04 | 4164.44 ms | 32.4% bf16 MFU | 124934 tok/s step 4411/19560 | loss 3.558080 (-0.81z)| norm 0.2452 (-0.76z)| lr 7.35e-04 | 4168.13 ms | 32.4% bf16 MFU | 124977 tok/s step 4412/19560 | loss 3.587279 (-0.13z)| norm 0.2537 (-0.42z)| lr 7.35e-04 | 4202.81 ms | 32.1% bf16 MFU | 124965 tok/s step 4413/19560 | loss 3.673816 (+1.81z)| norm 0.2484 (-0.62z)| lr 7.35e-04 | 4154.95 ms | 32.5% bf16 MFU | 125026 tok/s step 4414/19560 | loss 3.596216 (+0.05z)| norm 0.2714 (+0.28z)| lr 7.35e-04 | 4151.81 ms | 32.5% bf16 MFU | 125089 tok/s step 4415/19560 | loss 3.577641 (-0.41z)| norm 0.2822 (+0.70z)| lr 7.35e-04 | 4156.92 ms | 32.5% bf16 MFU | 125141 tok/s step 4416/19560 | loss 3.537245 (-1.34z)| norm 0.2788 (+0.56z)| lr 7.35e-04 | 4160.54 ms | 32.5% bf16 MFU | 125184 tok/s step 4417/19560 | loss 3.596568 (+0.05z)| norm 0.2557 (-0.33z)| lr 7.35e-04 | 4157.61 ms | 32.5% bf16 MFU | 125230 tok/s step 4418/19560 | loss 3.571416 (-0.53z)| norm 0.2786 (+0.56z)| lr 7.35e-04 | 4423.74 ms | 30.5% bf16 MFU | 124895 tok/s step 4419/19560 | loss 3.579416 (-0.34z)| norm 0.2540 (-0.40z)| lr 7.35e-04 | 4156.27 ms | 32.5% bf16 MFU | 124957 tok/s step 4420/19560 | loss 3.585983 (-0.18z)| norm 0.2261 (-1.46z)| lr 7.35e-04 | 4246.72 ms | 31.8% bf16 MFU | 124882 tok/s step 4421/19560 | loss 3.569570 (-0.56z)| norm 0.2426 (-0.82z)| lr 7.35e-04 | 4153.50 ms | 32.5% bf16 MFU | 124949 tok/s step 4422/19560 | loss 3.586528 (-0.17z)| norm 0.2619 (-0.08z)| lr 7.35e-04 | 4156.62 ms | 32.5% bf16 MFU | 125009 tok/s step 4423/19560 | loss 3.612380 (+0.43z)| norm 0.2416 (-0.86z)| lr 7.35e-04 | 4161.42 ms | 32.4% bf16 MFU | 125058 tok/s step 4424/19560 | loss 3.602482 (+0.20z)| norm 0.2792 (+0.58z)| lr 7.35e-04 | 4161.15 ms | 32.4% bf16 MFU | 125104 tok/s step 4425/19560 | loss 3.605800 (+0.27z)| norm 0.2881 (+0.91z)| lr 7.35e-04 | 4154.19 ms | 32.5% bf16 MFU | 125160 tok/s step 4426/19560 | loss 3.585467 (-0.20z)| norm 0.2613 (-0.12z)| lr 7.35e-04 | 4159.83 ms | 32.5% bf16 MFU | 125203 tok/s step 4427/19560 | loss 3.582972 (-0.26z)| norm 0.2470 (-0.67z)| lr 7.35e-04 | 4167.24 ms | 32.4% bf16 MFU | 125234 tok/s step 4428/19560 | loss 3.519362 (-1.73z)| norm 0.2274 (-1.41z)| lr 7.35e-04 | 4152.28 ms | 32.5% bf16 MFU | 125285 tok/s step 4429/19560 | loss 3.572193 (-0.49z)| norm 0.2398 (-0.93z)| lr 7.35e-04 | 4154.60 ms | 32.5% bf16 MFU | 125331 tok/s step 4430/19560 | loss 3.629426 (+0.84z)| norm 0.2343 (-1.12z)| lr 7.34e-04 | 4165.77 ms | 32.4% bf16 MFU | 125357 tok/s step 4431/19560 | loss 3.548904 (-1.02z)| norm 0.2448 (-0.71z)| lr 7.34e-04 | 4168.59 ms | 32.4% bf16 MFU | 125378 tok/s step 4432/19560 | loss 3.638274 (+1.05z)| norm 0.2437 (-0.75z)| lr 7.34e-04 | 4164.59 ms | 32.4% bf16 MFU | 125404 tok/s step 4433/19560 | loss 3.656252 (+1.47z)| norm 0.2685 (+0.19z)| lr 7.34e-04 | 4152.98 ms | 32.5% bf16 MFU | 125446 tok/s step 4434/19560 | loss 3.582381 (-0.26z)| norm 0.3218 (+2.17z)| lr 7.34e-04 | 4172.89 ms | 32.4% bf16 MFU | 125455 tok/s step 4435/19560 | loss 3.598465 (+0.11z)| norm 0.3085 (+1.64z)| lr 7.34e-04 | 4157.46 ms | 32.5% bf16 MFU | 125488 tok/s step 4436/19560 | loss 3.543297 (-1.18z)| norm 0.2649 (-0.00z)| lr 7.34e-04 | 4160.66 ms | 32.5% bf16 MFU | 125514 tok/s step 4437/19560 | loss 3.540335 (-1.23z)| norm 0.2584 (-0.25z)| lr 7.34e-04 | 4152.15 ms | 32.5% bf16 MFU | 125552 tok/s step 4438/19560 | loss 3.562028 (-0.73z)| norm 0.2944 (+1.10z)| lr 7.34e-04 | 4154.10 ms | 32.5% bf16 MFU | 125585 tok/s step 4439/19560 | loss 3.611438 (+0.42z)| norm 0.2774 (+0.45z)| lr 7.34e-04 | 4154.48 ms | 32.5% bf16 MFU | 125615 tok/s step 4440/19560 | loss 3.650841 (+1.32z)| norm 0.2714 (+0.22z)| lr 7.34e-04 | 4370.30 ms | 30.9% bf16 MFU | 125333 tok/s step 4441/19560 | loss 3.581353 (-0.31z)| norm 0.2929 (+1.02z)| lr 7.34e-04 | 4156.16 ms | 32.5% bf16 MFU | 125374 tok/s step 4442/19560 | loss 3.639448 (+1.04z)| norm 0.2751 (+0.34z)| lr 7.34e-04 | 4158.13 ms | 32.5% bf16 MFU | 125409 tok/s step 4443/19560 | loss 3.616185 (+0.48z)| norm 0.2749 (+0.33z)| lr 7.34e-04 | 4164.58 ms | 32.4% bf16 MFU | 125433 tok/s step 4444/19560 | loss 3.560717 (-0.81z)| norm 0.2819 (+0.58z)| lr 7.34e-04 | 4199.95 ms | 32.1% bf16 MFU | 125403 tok/s step 4445/19560 | loss 3.671855 (+1.76z)| norm 0.2435 (-0.87z)| lr 7.34e-04 | 4156.48 ms | 32.5% bf16 MFU | 125440 tok/s step 4446/19560 | loss 3.566205 (-0.67z)| norm 0.2496 (-0.63z)| lr 7.34e-04 | 4165.14 ms | 32.4% bf16 MFU | 125462 tok/s step 4447/19560 | loss 3.557618 (-0.86z)| norm 0.2391 (-1.01z)| lr 7.34e-04 | 4154.36 ms | 32.5% bf16 MFU | 125499 tok/s step 4448/19560 | loss 3.584656 (-0.23z)| norm 0.2269 (-1.44z)| lr 7.34e-04 | 4164.53 ms | 32.4% bf16 MFU | 125519 tok/s step 4449/19560 | loss 3.573395 (-0.49z)| norm 0.2538 (-0.45z)| lr 7.34e-04 | 4169.66 ms | 32.4% bf16 MFU | 125530 tok/s step 4450/19560 | loss 3.528046 (-1.51z)| norm 0.2669 (+0.03z)| lr 7.34e-04 | 4158.90 ms | 32.5% bf16 MFU | 125556 tok/s step 4451/19560 | loss 3.622115 (+0.65z)| norm 0.2990 (+1.22z)| lr 7.34e-04 | 4169.49 ms | 32.4% bf16 MFU | 125566 tok/s step 4452/19560 | loss 3.593779 (-0.00z)| norm 0.2837 (+0.64z)| lr 7.34e-04 | 4271.02 ms | 31.6% bf16 MFU | 125425 tok/s step 4453/19560 | loss 3.540612 (-1.22z)| norm 0.2549 (-0.44z)| lr 7.34e-04 | 4153.38 ms | 32.5% bf16 MFU | 125465 tok/s step 4454/19560 | loss 3.540993 (-1.19z)| norm 0.2804 (+0.52z)| lr 7.34e-04 | 4165.87 ms | 32.4% bf16 MFU | 125485 tok/s step 4455/19560 | loss 3.619934 (+0.60z)| norm 0.2732 (+0.24z)| lr 7.34e-04 | 4160.78 ms | 32.5% bf16 MFU | 125511 tok/s step 4456/19560 | loss 3.591643 (-0.04z)| norm 0.2756 (+0.32z)| lr 7.34e-04 | 4159.99 ms | 32.5% bf16 MFU | 125537 tok/s step 4457/19560 | loss 3.630046 (+0.83z)| norm 0.2470 (-0.76z)| lr 7.33e-04 | 4153.32 ms | 32.5% bf16 MFU | 125572 tok/s step 4458/19560 | loss 3.599964 (+0.14z)| norm 0.2712 (+0.15z)| lr 7.33e-04 | 4154.54 ms | 32.5% bf16 MFU | 125603 tok/s step 4459/19560 | loss 3.607427 (+0.31z)| norm 0.2674 (-0.00z)| lr 7.33e-04 | 4162.51 ms | 32.4% bf16 MFU | 125621 tok/s step 4460/19560 | loss 3.581388 (-0.29z)| norm 0.2987 (+1.17z)| lr 7.33e-04 | 4153.42 ms | 32.5% bf16 MFU | 125651 tok/s step 4461/19560 | loss 3.605223 (+0.27z)| norm 0.2813 (+0.50z)| lr 7.33e-04 | 4172.49 ms | 32.4% bf16 MFU | 125651 tok/s step 4462/19560 | loss 3.600898 (+0.18z)| norm 0.3266 (+2.16z)| lr 7.33e-04 | 4169.16 ms | 32.4% bf16 MFU | 125656 tok/s step 4463/19560 | loss 3.581430 (-0.27z)| norm 0.2673 (-0.03z)| lr 7.33e-04 | 4249.17 ms | 31.8% bf16 MFU | 125543 tok/s step 4464/19560 | loss 3.599688 (+0.15z)| norm 0.2703 (+0.08z)| lr 7.33e-04 | 4150.86 ms | 32.5% bf16 MFU | 125581 tok/s step 4465/19560 | loss 3.630317 (+0.85z)| norm 0.2590 (-0.33z)| lr 7.33e-04 | 4173.70 ms | 32.3% bf16 MFU | 125583 tok/s step 4466/19560 | loss 3.554570 (-0.91z)| norm 0.3076 (+1.51z)| lr 7.33e-04 | 4201.38 ms | 32.1% bf16 MFU | 125543 tok/s step 4467/19560 | loss 3.636871 (+0.99z)| norm 0.3463 (+2.89z)| lr 7.33e-04 | 4148.41 ms | 32.5% bf16 MFU | 125585 tok/s step 4468/19560 | loss 3.608476 (+0.33z)| norm 0.2629 (-0.18z)| lr 7.33e-04 | 4161.69 ms | 32.4% bf16 MFU | 125605 tok/s step 4469/19560 | loss 3.636714 (+1.01z)| norm 0.3514 (+2.95z)| lr 7.33e-04 | 4153.44 ms | 32.5% bf16 MFU | 125636 tok/s step 4470/19560 | loss 3.568746 (-0.59z)| norm 0.3486 (+2.75z)| lr 7.33e-04 | 4166.01 ms | 32.4% bf16 MFU | 125647 tok/s step 4471/19560 | loss 3.606100 (+0.30z)| norm 0.2882 (+0.66z)| lr 7.33e-04 | 4165.16 ms | 32.4% bf16 MFU | 125658 tok/s step 4472/19560 | loss 3.603572 (+0.24z)| norm 0.2655 (-0.12z)| lr 7.33e-04 | 4161.01 ms | 32.4% bf16 MFU | 125675 tok/s step 4473/19560 | loss 3.640041 (+1.09z)| norm 0.2913 (+0.80z)| lr 7.33e-04 | 4157.91 ms | 32.5% bf16 MFU | 125696 tok/s step 4474/19560 | loss 3.633548 (+0.92z)| norm 0.3128 (+1.53z)| lr 7.33e-04 | 4195.89 ms | 32.2% bf16 MFU | 125659 tok/s step 4475/19560 | loss 3.618716 (+0.56z)| norm 0.2787 (+0.34z)| lr 7.33e-04 | 4152.17 ms | 32.5% bf16 MFU | 125690 tok/s step 4476/19560 | loss 3.497998 (-2.21z)| norm 0.2356 (-1.15z)| lr 7.33e-04 | 4153.80 ms | 32.5% bf16 MFU | 125716 tok/s step 4477/19560 | loss 3.538954 (-1.26z)| norm 0.2801 (+0.39z)| lr 7.33e-04 | 4168.10 ms | 32.4% bf16 MFU | 125719 tok/s step 4478/19560 | loss 3.619759 (+0.62z)| norm 0.2751 (+0.22z)| lr 7.33e-04 | 4170.35 ms | 32.4% bf16 MFU | 125719 tok/s step 4479/19560 | loss 3.583577 (-0.21z)| norm 0.2660 (-0.10z)| lr 7.33e-04 | 4161.08 ms | 32.4% bf16 MFU | 125733 tok/s step 4480/19560 | loss 3.592142 (-0.01z)| norm 0.2565 (-0.43z)| lr 7.33e-04 | 4157.57 ms | 32.5% bf16 MFU | 125752 tok/s step 4481/19560 | loss 3.521801 (-1.73z)| norm 0.2406 (-0.97z)| lr 7.33e-04 | 4153.33 ms | 32.5% bf16 MFU | 125776 tok/s step 4482/19560 | loss 3.569775 (-0.55z)| norm 0.2637 (-0.16z)| lr 7.33e-04 | 4172.04 ms | 32.4% bf16 MFU | 125771 tok/s step 4483/19560 | loss 3.507133 (-2.04z)| norm 0.2388 (-1.01z)| lr 7.33e-04 | 4167.90 ms | 32.4% bf16 MFU | 125772 tok/s step 4484/19560 | loss 3.562367 (-0.71z)| norm 0.2370 (-1.06z)| lr 7.32e-04 | 4150.28 ms | 32.5% bf16 MFU | 125799 tok/s step 4485/19560 | loss 3.568577 (-0.55z)| norm 0.2409 (-0.91z)| lr 7.32e-04 | 4436.70 ms | 30.4% bf16 MFU | 125418 tok/s step 4486/19560 | loss 3.642799 (+1.28z)| norm 0.2173 (-1.70z)| lr 7.32e-04 | 4147.83 ms | 32.6% bf16 MFU | 125467 tok/s step 4487/19560 | loss 3.556735 (-0.84z)| norm 0.2302 (-1.37z)| lr 7.32e-04 | 4158.57 ms | 32.5% bf16 MFU | 125497 tok/s step 4488/19560 | loss 3.633985 (+1.08z)| norm 0.2720 (+0.28z)| lr 7.32e-04 | 4172.77 ms | 32.4% bf16 MFU | 125505 tok/s step 4489/19560 | loss 3.572169 (-0.45z)| norm 0.2816 (+0.65z)| lr 7.32e-04 | 4151.19 ms | 32.5% bf16 MFU | 125544 tok/s step 4490/19560 | loss 3.622982 (+0.82z)| norm 0.2661 (+0.05z)| lr 7.32e-04 | 4157.32 ms | 32.5% bf16 MFU | 125573 tok/s step 4491/19560 | loss 3.527815 (-1.54z)| norm 0.2446 (-0.80z)| lr 7.32e-04 | 4166.05 ms | 32.4% bf16 MFU | 125587 tok/s step 4492/19560 | loss 3.561796 (-0.69z)| norm 0.2246 (-1.57z)| lr 7.32e-04 | 4893.07 ms | 27.6% bf16 MFU | 124665 tok/s step 4493/19560 | loss 3.618546 (+0.71z)| norm 0.2348 (-1.15z)| lr 7.32e-04 | 4470.56 ms | 30.2% bf16 MFU | 124295 tok/s step 4494/19560 | loss 3.545911 (-1.08z)| norm 0.2546 (-0.35z)| lr 7.32e-04 | 4588.15 ms | 29.4% bf16 MFU | 123794 tok/s step 4495/19560 | loss 3.588059 (-0.04z)| norm 0.2720 (+0.34z)| lr 7.32e-04 | 4341.60 ms | 31.1% bf16 MFU | 123642 tok/s step 4496/19560 | loss 3.557250 (-0.79z)| norm 0.2689 (+0.22z)| lr 7.32e-04 | 4226.22 ms | 31.9% bf16 MFU | 123663 tok/s step 4497/19560 | loss 3.542490 (-1.15z)| norm 0.2490 (-0.58z)| lr 7.32e-04 | 4341.13 ms | 31.1% bf16 MFU | 123518 tok/s step 4498/19560 | loss 3.614108 (+0.61z)| norm 0.2816 (+0.72z)| lr 7.32e-04 | 4329.03 ms | 31.2% bf16 MFU | 123398 tok/s step 4499/19560 | loss 3.603626 (+0.35z)| norm 0.2838 (+0.80z)| lr 7.32e-04 | 4265.61 ms | 31.7% bf16 MFU | 123374 tok/s step 4500/19560 | loss 3.628299 (+0.95z)| norm 0.2722 (+0.32z)| lr 7.32e-04 | 4187.16 ms | 32.2% bf16 MFU | 123466 tok/s val loss 3.573333 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2771/10042 = 0.275941 step 4501/19560 | loss 3.584048 (-0.13z)| norm 0.2464 (-0.70z)| lr 7.32e-04 | 4442.83 ms | 30.4% bf16 MFU | 123193 tok/s step 4502/19560 | loss 3.664247 (+1.83z)| norm 0.2729 (+0.36z)| lr 7.32e-04 | 4217.01 ms | 32.0% bf16 MFU | 123249 tok/s step 4503/19560 | loss 3.570505 (-0.48z)| norm 0.2218 (-1.66z)| lr 7.32e-04 | 4356.70 ms | 31.0% bf16 MFU | 123104 tok/s step 4504/19560 | loss 3.511708 (-1.88z)| norm 0.2510 (-0.49z)| lr 7.32e-04 | 4162.77 ms | 32.4% bf16 MFU | 123246 tok/s step 4505/19560 | loss 3.546671 (-1.06z)| norm 0.2420 (-0.85z)| lr 7.32e-04 | 4226.63 ms | 31.9% bf16 MFU | 123286 tok/s step 4506/19560 | loss 3.721754 (+3.14z)| norm 0.2726 (+0.37z)| lr 7.32e-04 | 4167.14 ms | 32.4% bf16 MFU | 123412 tok/s step 4507/19560 | loss 3.614199 (+0.56z)| norm 0.2559 (-0.30z)| lr 7.32e-04 | 4293.85 ms | 31.4% bf16 MFU | 123347 tok/s step 4508/19560 | loss 3.581974 (-0.20z)| norm 0.2966 (+1.30z)| lr 7.32e-04 | 4159.22 ms | 32.5% bf16 MFU | 123482 tok/s step 4509/19560 | loss 3.604097 (+0.35z)| norm 0.2945 (+1.20z)| lr 7.32e-04 | 4273.63 ms | 31.6% bf16 MFU | 123442 tok/s step 4510/19560 | loss 3.550958 (-0.95z)| norm 0.2785 (+0.56z)| lr 7.31e-04 | 4187.65 ms | 32.2% bf16 MFU | 123530 tok/s step 4511/19560 | loss 3.533863 (-1.34z)| norm 0.2674 (+0.12z)| lr 7.31e-04 | 4262.76 ms | 31.7% bf16 MFU | 123503 tok/s step 4512/19560 | loss 3.576277 (-0.32z)| norm 0.2634 (-0.05z)| lr 7.31e-04 | 4162.60 ms | 32.4% bf16 MFU | 123626 tok/s step 4513/19560 | loss 3.556638 (-0.79z)| norm 0.2698 (+0.20z)| lr 7.31e-04 | 4220.62 ms | 32.0% bf16 MFU | 123655 tok/s step 4514/19560 | loss 3.569272 (-0.47z)| norm 0.2510 (-0.55z)| lr 7.31e-04 | 4211.39 ms | 32.1% bf16 MFU | 123697 tok/s step 4515/19560 | loss 3.560449 (-0.68z)| norm 0.2548 (-0.39z)| lr 7.31e-04 | 4262.14 ms | 31.7% bf16 MFU | 123663 tok/s step 4516/19560 | loss 3.622260 (+0.82z)| norm 0.2790 (+0.57z)| lr 7.31e-04 | 4357.69 ms | 31.0% bf16 MFU | 123495 tok/s step 4517/19560 | loss 3.582955 (-0.14z)| norm 0.3004 (+1.41z)| lr 7.31e-04 | 4324.47 ms | 31.2% bf16 MFU | 123383 tok/s step 4518/19560 | loss 3.578265 (-0.23z)| norm 0.2751 (+0.40z)| lr 7.31e-04 | 4355.54 ms | 31.0% bf16 MFU | 123232 tok/s step 4519/19560 | loss 3.567144 (-0.51z)| norm 0.2853 (+0.80z)| lr 7.31e-04 | 4162.47 ms | 32.4% bf16 MFU | 123368 tok/s step 4520/19560 | loss 3.539667 (-1.18z)| norm 0.2772 (+0.47z)| lr 7.31e-04 | 4240.77 ms | 31.8% bf16 MFU | 123381 tok/s step 4521/19560 | loss 3.522332 (-1.62z)| norm 0.2429 (-0.88z)| lr 7.31e-04 | 4178.73 ms | 32.3% bf16 MFU | 123486 tok/s step 4522/19560 | loss 3.537043 (-1.26z)| norm 0.2780 (+0.50z)| lr 7.31e-04 | 4164.50 ms | 32.4% bf16 MFU | 123606 tok/s step 4523/19560 | loss 3.557897 (-0.71z)| norm 0.2572 (-0.33z)| lr 7.31e-04 | 4174.52 ms | 32.3% bf16 MFU | 123705 tok/s step 4524/19560 | loss 3.584910 (+0.00z)| norm 0.2601 (-0.23z)| lr 7.31e-04 | 4171.53 ms | 32.4% bf16 MFU | 123804 tok/s step 4525/19560 | loss 3.567075 (-0.46z)| norm 0.2497 (-0.66z)| lr 7.31e-04 | 4165.46 ms | 32.4% bf16 MFU | 123907 tok/s step 4526/19560 | loss 3.629432 (+1.16z)| norm 0.2465 (-0.79z)| lr 7.31e-04 | 4228.56 ms | 31.9% bf16 MFU | 123911 tok/s step 4527/19560 | loss 3.514639 (-1.82z)| norm 0.2530 (-0.53z)| lr 7.31e-04 | 4166.43 ms | 32.4% bf16 MFU | 124008 tok/s step 4528/19560 | loss 3.552426 (-0.83z)| norm 0.2222 (-1.74z)| lr 7.31e-04 | 4175.32 ms | 32.3% bf16 MFU | 124086 tok/s step 4529/19560 | loss 3.566053 (-0.49z)| norm 0.2527 (-0.52z)| lr 7.31e-04 | 4181.91 ms | 32.3% bf16 MFU | 124150 tok/s step 4530/19560 | loss 3.700498 (+2.89z)| norm 0.2366 (-1.18z)| lr 7.31e-04 | 4276.44 ms | 31.6% bf16 MFU | 124072 tok/s step 4531/19560 | loss 3.525736 (-1.49z)| norm 0.2808 (+0.59z)| lr 7.31e-04 | 4162.56 ms | 32.4% bf16 MFU | 124166 tok/s step 4532/19560 | loss 3.546179 (-0.97z)| norm 0.2836 (+0.71z)| lr 7.31e-04 | 4184.43 ms | 32.3% bf16 MFU | 124223 tok/s step 4533/19560 | loss 3.584287 (-0.02z)| norm 0.2753 (+0.37z)| lr 7.31e-04 | 4169.67 ms | 32.4% bf16 MFU | 124299 tok/s step 4534/19560 | loss 3.501574 (-2.04z)| norm 0.2655 (-0.04z)| lr 7.31e-04 | 4188.64 ms | 32.2% bf16 MFU | 124342 tok/s step 4535/19560 | loss 3.754419 (+3.87z)| norm 0.2596 (-0.27z)| lr 7.31e-04 | 4167.96 ms | 32.4% bf16 MFU | 124414 tok/s step 4536/19560 | loss 3.611767 (+0.59z)| norm 0.2685 (+0.08z)| lr 7.31e-04 | 4168.07 ms | 32.4% bf16 MFU | 124483 tok/s step 4537/19560 | loss 3.560071 (-0.61z)| norm 0.2597 (-0.27z)| lr 7.30e-04 | 4182.86 ms | 32.3% bf16 MFU | 124526 tok/s step 4538/19560 | loss 3.595383 (+0.22z)| norm 0.2826 (+0.65z)| lr 7.30e-04 | 4245.91 ms | 31.8% bf16 MFU | 124474 tok/s step 4539/19560 | loss 3.547332 (-0.90z)| norm 0.2822 (+0.63z)| lr 7.30e-04 | 4176.22 ms | 32.3% bf16 MFU | 124527 tok/s step 4540/19560 | loss 3.575628 (-0.24z)| norm 0.2544 (-0.50z)| lr 7.30e-04 | 4163.81 ms | 32.4% bf16 MFU | 124597 tok/s step 4541/19560 | loss 3.554809 (-0.71z)| norm 0.2335 (-1.33z)| lr 7.30e-04 | 4173.68 ms | 32.3% bf16 MFU | 124648 tok/s step 4542/19560 | loss 3.633176 (+1.12z)| norm 0.2504 (-0.65z)| lr 7.30e-04 | 4160.07 ms | 32.5% bf16 MFU | 124717 tok/s step 4543/19560 | loss 3.514028 (-1.64z)| norm 0.2498 (-0.66z)| lr 7.30e-04 | 4188.05 ms | 32.2% bf16 MFU | 124740 tok/s step 4544/19560 | loss 3.587314 (+0.05z)| norm 0.2479 (-0.73z)| lr 7.30e-04 | 4273.43 ms | 31.6% bf16 MFU | 124637 tok/s step 4545/19560 | loss 3.550512 (-0.80z)| norm 0.2382 (-1.10z)| lr 7.30e-04 | 4162.10 ms | 32.4% bf16 MFU | 124704 tok/s step 4546/19560 | loss 3.560632 (-0.56z)| norm 0.2425 (-0.92z)| lr 7.30e-04 | 4243.83 ms | 31.8% bf16 MFU | 124646 tok/s step 4547/19560 | loss 3.559963 (-0.57z)| norm 0.2653 (-0.01z)| lr 7.30e-04 | 4163.67 ms | 32.4% bf16 MFU | 124709 tok/s step 4548/19560 | loss 3.627583 (+0.99z)| norm 0.2595 (-0.26z)| lr 7.30e-04 | 4162.99 ms | 32.4% bf16 MFU | 124771 tok/s step 4549/19560 | loss 3.560940 (-0.55z)| norm 0.2464 (-0.79z)| lr 7.30e-04 | 4175.96 ms | 32.3% bf16 MFU | 124810 tok/s step 4550/19560 | loss 3.584454 (-0.01z)| norm 0.2482 (-0.71z)| lr 7.30e-04 | 4181.79 ms | 32.3% bf16 MFU | 124838 tok/s step 4551/19560 | loss 3.582174 (-0.06z)| norm 0.2635 (-0.10z)| lr 7.30e-04 | 4160.64 ms | 32.5% bf16 MFU | 124897 tok/s step 4552/19560 | loss 3.524751 (-1.36z)| norm 0.2583 (-0.30z)| lr 7.30e-04 | 4166.00 ms | 32.4% bf16 MFU | 124944 tok/s step 4553/19560 | loss 3.581866 (-0.04z)| norm 0.2574 (-0.33z)| lr 7.30e-04 | 4175.19 ms | 32.3% bf16 MFU | 124976 tok/s step 4554/19560 | loss 3.595958 (+0.28z)| norm 0.2706 (+0.20z)| lr 7.30e-04 | 4172.00 ms | 32.4% bf16 MFU | 125010 tok/s step 4555/19560 | loss 3.571309 (-0.29z)| norm 0.2509 (-0.60z)| lr 7.30e-04 | 4269.34 ms | 31.6% bf16 MFU | 124900 tok/s step 4556/19560 | loss 3.565068 (-0.44z)| norm 0.2423 (-0.96z)| lr 7.30e-04 | 4167.49 ms | 32.4% bf16 MFU | 124945 tok/s step 4557/19560 | loss 3.528973 (-1.26z)| norm 0.2363 (-1.20z)| lr 7.30e-04 | 4242.80 ms | 31.8% bf16 MFU | 124877 tok/s step 4558/19560 | loss 3.575108 (-0.19z)| norm 0.2347 (-1.27z)| lr 7.30e-04 | 4175.20 ms | 32.3% bf16 MFU | 124911 tok/s step 4559/19560 | loss 3.582383 (-0.03z)| norm 0.2509 (-0.61z)| lr 7.30e-04 | 4178.37 ms | 32.3% bf16 MFU | 124940 tok/s step 4560/19560 | loss 3.520379 (-1.44z)| norm 0.2925 (+1.07z)| lr 7.30e-04 | 4168.39 ms | 32.4% bf16 MFU | 124981 tok/s step 4561/19560 | loss 3.559023 (-0.54z)| norm 0.2459 (-0.82z)| lr 7.30e-04 | 4174.44 ms | 32.3% bf16 MFU | 125012 tok/s step 4562/19560 | loss 3.522153 (-1.38z)| norm 0.2625 (-0.13z)| lr 7.30e-04 | 4161.60 ms | 32.4% bf16 MFU | 125061 tok/s step 4563/19560 | loss 3.550381 (-0.71z)| norm 0.2366 (-1.19z)| lr 7.29e-04 | 4170.65 ms | 32.4% bf16 MFU | 125093 tok/s step 4564/19560 | loss 3.678725 (+2.20z)| norm 0.2455 (-0.81z)| lr 7.29e-04 | 4256.90 ms | 31.7% bf16 MFU | 124996 tok/s step 4565/19560 | loss 3.557688 (-0.56z)| norm 0.2840 (+0.79z)| lr 7.29e-04 | 4176.47 ms | 32.3% bf16 MFU | 125023 tok/s step 4566/19560 | loss 3.577160 (-0.12z)| norm 0.2592 (-0.23z)| lr 7.29e-04 | 4164.68 ms | 32.4% bf16 MFU | 125067 tok/s step 4567/19560 | loss 3.577266 (-0.11z)| norm 0.2645 (-0.01z)| lr 7.29e-04 | 4171.82 ms | 32.4% bf16 MFU | 125097 tok/s step 4568/19560 | loss 3.676619 (+2.14z)| norm 0.2536 (-0.46z)| lr 7.29e-04 | 4172.73 ms | 32.4% bf16 MFU | 125124 tok/s step 4569/19560 | loss 3.541245 (-0.92z)| norm 0.2532 (-0.46z)| lr 7.29e-04 | 4169.13 ms | 32.4% bf16 MFU | 125156 tok/s step 4570/19560 | loss 3.598050 (+0.37z)| norm 0.2478 (-0.68z)| lr 7.29e-04 | 4163.73 ms | 32.4% bf16 MFU | 125194 tok/s step 4571/19560 | loss 3.568054 (-0.30z)| norm 0.2721 (+0.34z)| lr 7.29e-04 | 4232.27 ms | 31.9% bf16 MFU | 125128 tok/s step 4572/19560 | loss 3.600227 (+0.42z)| norm 0.2570 (-0.28z)| lr 7.29e-04 | 4215.17 ms | 32.0% bf16 MFU | 125091 tok/s step 4573/19560 | loss 3.560913 (-0.46z)| norm 0.2837 (+0.83z)| lr 7.29e-04 | 4179.37 ms | 32.3% bf16 MFU | 125109 tok/s step 4574/19560 | loss 3.606855 (+0.60z)| norm 0.2496 (-0.61z)| lr 7.29e-04 | 4174.81 ms | 32.3% bf16 MFU | 125132 tok/s step 4575/19560 | loss 3.554410 (-0.62z)| norm 0.2556 (-0.36z)| lr 7.29e-04 | 4184.44 ms | 32.3% bf16 MFU | 125141 tok/s step 4576/19560 | loss 3.596962 (+0.36z)| norm 0.2575 (-0.30z)| lr 7.29e-04 | 4164.78 ms | 32.4% bf16 MFU | 125178 tok/s step 4577/19560 | loss 3.631279 (+1.14z)| norm 0.2497 (-0.63z)| lr 7.29e-04 | 4180.56 ms | 32.3% bf16 MFU | 125189 tok/s step 4578/19560 | loss 3.531133 (-1.16z)| norm 0.2603 (-0.18z)| lr 7.29e-04 | 4176.07 ms | 32.3% bf16 MFU | 125207 tok/s step 4579/19560 | loss 3.555249 (-0.60z)| norm 0.2692 (+0.22z)| lr 7.29e-04 | 4185.96 ms | 32.3% bf16 MFU | 125209 tok/s step 4580/19560 | loss 3.685559 (+2.34z)| norm 0.2514 (-0.54z)| lr 7.29e-04 | 4155.00 ms | 32.5% bf16 MFU | 125258 tok/s step 4581/19560 | loss 3.548236 (-0.76z)| norm 0.2853 (+0.91z)| lr 7.29e-04 | 4176.04 ms | 32.3% bf16 MFU | 125272 tok/s step 4582/19560 | loss 3.622686 (+0.91z)| norm 0.2742 (+0.43z)| lr 7.29e-04 | 4172.75 ms | 32.4% bf16 MFU | 125291 tok/s step 4583/19560 | loss 3.586045 (+0.09z)| norm 0.2599 (-0.18z)| lr 7.29e-04 | 4171.34 ms | 32.4% bf16 MFU | 125311 tok/s step 4584/19560 | loss 3.534393 (-1.07z)| norm 0.2562 (-0.33z)| lr 7.29e-04 | 4198.27 ms | 32.2% bf16 MFU | 125290 tok/s step 4585/19560 | loss 3.544815 (-0.82z)| norm 0.2486 (-0.66z)| lr 7.29e-04 | 4206.92 ms | 32.1% bf16 MFU | 125256 tok/s step 4586/19560 | loss 3.526941 (-1.21z)| norm 0.2295 (-1.45z)| lr 7.29e-04 | 4177.46 ms | 32.3% bf16 MFU | 125269 tok/s step 4587/19560 | loss 3.587205 (+0.15z)| norm 0.2453 (-0.77z)| lr 7.29e-04 | 4164.86 ms | 32.4% bf16 MFU | 125299 tok/s step 4588/19560 | loss 3.588327 (+0.18z)| norm 0.2861 (+0.98z)| lr 7.29e-04 | 4187.24 ms | 32.2% bf16 MFU | 125295 tok/s step 4589/19560 | loss 3.574883 (-0.12z)| norm 0.2729 (+0.42z)| lr 7.28e-04 | 4164.26 ms | 32.4% bf16 MFU | 125325 tok/s step 4590/19560 | loss 3.526419 (-1.20z)| norm 0.2550 (-0.34z)| lr 7.28e-04 | 4182.06 ms | 32.3% bf16 MFU | 125327 tok/s step 4591/19560 | loss 3.555845 (-0.53z)| norm 0.2518 (-0.47z)| lr 7.28e-04 | 4182.54 ms | 32.3% bf16 MFU | 125329 tok/s step 4592/19560 | loss 3.599604 (+0.45z)| norm 0.2669 (+0.20z)| lr 7.28e-04 | 4178.84 ms | 32.3% bf16 MFU | 125335 tok/s step 4593/19560 | loss 3.622022 (+0.96z)| norm 0.2915 (+1.27z)| lr 7.28e-04 | 4183.51 ms | 32.3% bf16 MFU | 125335 tok/s step 4594/19560 | loss 3.613896 (+0.76z)| norm 0.2556 (-0.30z)| lr 7.28e-04 | 4164.50 ms | 32.4% bf16 MFU | 125363 tok/s step 4595/19560 | loss 3.598120 (+0.42z)| norm 0.2596 (-0.10z)| lr 7.28e-04 | 4165.07 ms | 32.4% bf16 MFU | 125388 tok/s step 4596/19560 | loss 3.526590 (-1.18z)| norm 0.2493 (-0.58z)| lr 7.28e-04 | 4168.36 ms | 32.4% bf16 MFU | 125408 tok/s step 4597/19560 | loss 3.543812 (-0.78z)| norm 0.2244 (-1.82z)| lr 7.28e-04 | 4163.60 ms | 32.4% bf16 MFU | 125434 tok/s step 4598/19560 | loss 3.558259 (-0.45z)| norm 0.2326 (-1.47z)| lr 7.28e-04 | 4175.24 ms | 32.3% bf16 MFU | 125440 tok/s step 4599/19560 | loss 3.542277 (-0.80z)| norm 0.2237 (-1.92z)| lr 7.28e-04 | 4367.10 ms | 30.9% bf16 MFU | 125171 tok/s step 4600/19560 | loss 3.537358 (-0.89z)| norm 0.2292 (-1.59z)| lr 7.28e-04 | 4167.31 ms | 32.4% bf16 MFU | 125203 tok/s step 4601/19560 | loss 3.712219 (+2.95z)| norm 0.8880 (+10.69z)| lr 7.28e-04 | 4165.69 ms | 32.4% bf16 MFU | 125236 tok/s step 4602/19560 | loss 3.569778 (-0.16z)| norm 0.3368 (+1.25z)| lr 7.28e-04 | 4284.45 ms | 31.5% bf16 MFU | 125093 tok/s step 4603/19560 | loss 3.678445 (+2.18z)| norm 0.3390 (+1.27z)| lr 7.28e-04 | 4178.73 ms | 32.3% bf16 MFU | 125111 tok/s step 4604/19560 | loss 3.607826 (+0.64z)| norm 0.2647 (+0.01z)| lr 7.28e-04 | 4175.90 ms | 32.3% bf16 MFU | 125133 tok/s step 4605/19560 | loss 3.518698 (-1.30z)| norm 0.2741 (+0.17z)| lr 7.28e-04 | 4166.12 ms | 32.4% bf16 MFU | 125169 tok/s step 4606/19560 | loss 3.574983 (-0.07z)| norm 0.2713 (+0.12z)| lr 7.28e-04 | 4174.91 ms | 32.3% bf16 MFU | 125189 tok/s step 4607/19560 | loss 3.570205 (-0.17z)| norm 0.2580 (-0.11z)| lr 7.28e-04 | 4181.18 ms | 32.3% bf16 MFU | 125200 tok/s step 4608/19560 | loss 3.526428 (-1.11z)| norm 0.2274 (-0.62z)| lr 7.28e-04 | 4173.29 ms | 32.4% bf16 MFU | 125221 tok/s step 4609/19560 | loss 3.610109 (+0.70z)| norm 0.2468 (-0.29z)| lr 7.28e-04 | 4169.04 ms | 32.4% bf16 MFU | 125248 tok/s step 4610/19560 | loss 3.568774 (-0.20z)| norm 0.2320 (-0.54z)| lr 7.28e-04 | 4153.87 ms | 32.5% bf16 MFU | 125296 tok/s step 4611/19560 | loss 3.550962 (-0.61z)| norm 0.2284 (-0.60z)| lr 7.28e-04 | 4168.95 ms | 32.4% bf16 MFU | 125320 tok/s step 4612/19560 | loss 3.538949 (-0.86z)| norm 0.2516 (-0.21z)| lr 7.28e-04 | 4162.48 ms | 32.4% bf16 MFU | 125351 tok/s step 4613/19560 | loss 3.579867 (+0.03z)| norm 0.2345 (-0.50z)| lr 7.28e-04 | 4159.08 ms | 32.5% bf16 MFU | 125387 tok/s step 4614/19560 | loss 3.620633 (+0.94z)| norm 0.2131 (-0.86z)| lr 7.28e-04 | 4158.05 ms | 32.5% bf16 MFU | 125422 tok/s step 4615/19560 | loss 3.594042 (+0.35z)| norm 0.2419 (-0.37z)| lr 7.27e-04 | 4192.28 ms | 32.2% bf16 MFU | 125404 tok/s step 4616/19560 | loss 3.567932 (-0.22z)| norm 0.2512 (-0.21z)| lr 7.27e-04 | 4165.85 ms | 32.4% bf16 MFU | 125426 tok/s step 4617/19560 | loss 3.500792 (-1.68z)| norm 0.2520 (-0.19z)| lr 7.27e-04 | 4208.98 ms | 32.1% bf16 MFU | 125383 tok/s step 4618/19560 | loss 3.617959 (+0.89z)| norm 0.2427 (-0.35z)| lr 7.27e-04 | 4183.66 ms | 32.3% bf16 MFU | 125380 tok/s step 4619/19560 | loss 3.606677 (+0.63z)| norm 0.2522 (-0.19z)| lr 7.27e-04 | 4162.99 ms | 32.4% bf16 MFU | 125408 tok/s step 4620/19560 | loss 3.568922 (-0.20z)| norm 0.2478 (-0.27z)| lr 7.27e-04 | 4305.12 ms | 31.4% bf16 MFU | 125227 tok/s step 4621/19560 | loss 3.554593 (-0.51z)| norm 0.2658 (+0.03z)| lr 7.27e-04 | 4157.50 ms | 32.5% bf16 MFU | 125271 tok/s step 4622/19560 | loss 3.564992 (-0.28z)| norm 0.2781 (+0.24z)| lr 7.27e-04 | 4223.00 ms | 32.0% bf16 MFU | 125215 tok/s step 4623/19560 | loss 3.597131 (+0.43z)| norm 0.2561 (-0.13z)| lr 7.27e-04 | 4400.71 ms | 30.7% bf16 MFU | 124911 tok/s step 4624/19560 | loss 3.533704 (-0.97z)| norm 0.2463 (-0.29z)| lr 7.27e-04 | 4622.84 ms | 29.2% bf16 MFU | 124336 tok/s step 4625/19560 | loss 3.601704 (+0.52z)| norm 0.2792 (+0.26z)| lr 7.27e-04 | 4242.53 ms | 31.8% bf16 MFU | 124298 tok/s step 4626/19560 | loss 3.502829 (-1.63z)| norm 0.2611 (-0.04z)| lr 7.27e-04 | 4159.31 ms | 32.5% bf16 MFU | 124386 tok/s step 4627/19560 | loss 3.607787 (+0.67z)| norm 0.2477 (-0.27z)| lr 7.27e-04 | 4177.93 ms | 32.3% bf16 MFU | 124441 tok/s step 4628/19560 | loss 3.594703 (+0.39z)| norm 0.2983 (+0.59z)| lr 7.27e-04 | 4151.02 ms | 32.5% bf16 MFU | 124534 tok/s step 4629/19560 | loss 3.574448 (-0.05z)| norm 0.2868 (+0.39z)| lr 7.27e-04 | 4162.12 ms | 32.4% bf16 MFU | 124606 tok/s step 4630/19560 | loss 3.525653 (-1.11z)| norm 0.2470 (-0.28z)| lr 7.27e-04 | 4155.51 ms | 32.5% bf16 MFU | 124684 tok/s step 4631/19560 | loss 3.639464 (+1.40z)| norm 0.2832 (+0.32z)| lr 7.27e-04 | 4420.59 ms | 30.5% bf16 MFU | 124380 tok/s step 4632/19560 | loss 3.532702 (-0.97z)| norm 0.2657 (+0.02z)| lr 7.27e-04 | 4256.20 ms | 31.7% bf16 MFU | 124320 tok/s step 4633/19560 | loss 3.577954 (+0.03z)| norm 0.2845 (+0.34z)| lr 7.27e-04 | 4166.87 ms | 32.4% bf16 MFU | 124395 tok/s step 4634/19560 | loss 3.546710 (-0.66z)| norm 0.2486 (-0.27z)| lr 7.27e-04 | 4164.44 ms | 32.4% bf16 MFU | 124470 tok/s step 4635/19560 | loss 3.583418 (+0.19z)| norm 0.2643 (-0.01z)| lr 7.27e-04 | 4156.34 ms | 32.5% bf16 MFU | 124554 tok/s step 4636/19560 | loss 3.601017 (+0.60z)| norm 0.2394 (-0.42z)| lr 7.27e-04 | 4160.30 ms | 32.5% bf16 MFU | 124627 tok/s step 4637/19560 | loss 3.576844 (+0.04z)| norm 0.2689 (+0.08z)| lr 7.27e-04 | 4172.95 ms | 32.4% bf16 MFU | 124678 tok/s step 4638/19560 | loss 3.555176 (-0.46z)| norm 0.2607 (-0.05z)| lr 7.27e-04 | 4157.35 ms | 32.5% bf16 MFU | 124749 tok/s step 4639/19560 | loss 3.596939 (+0.50z)| norm 0.2629 (-0.01z)| lr 7.27e-04 | 4157.95 ms | 32.5% bf16 MFU | 124816 tok/s step 4640/19560 | loss 3.570061 (-0.13z)| norm 0.2501 (-0.23z)| lr 7.26e-04 | 4288.42 ms | 31.5% bf16 MFU | 124688 tok/s step 4641/19560 | loss 3.555359 (-0.47z)| norm 0.2608 (-0.05z)| lr 7.26e-04 | 4297.34 ms | 31.4% bf16 MFU | 124554 tok/s step 4642/19560 | loss 3.528741 (-1.08z)| norm 0.2543 (-0.16z)| lr 7.26e-04 | 4157.70 ms | 32.5% bf16 MFU | 124631 tok/s step 4643/19560 | loss 3.510873 (-1.47z)| norm 0.2908 (+0.46z)| lr 7.26e-04 | 4166.48 ms | 32.4% bf16 MFU | 124692 tok/s step 4644/19560 | loss 3.669391 (+2.14z)| norm 0.3204 (+0.95z)| lr 7.26e-04 | 4169.22 ms | 32.4% bf16 MFU | 124745 tok/s step 4645/19560 | loss 3.592670 (+0.40z)| norm 0.2482 (-0.26z)| lr 7.26e-04 | 4167.81 ms | 32.4% bf16 MFU | 124797 tok/s step 4646/19560 | loss 3.574958 (-0.01z)| norm 0.2631 (-0.01z)| lr 7.26e-04 | 4314.28 ms | 31.3% bf16 MFU | 124634 tok/s step 4647/19560 | loss 3.548584 (-0.60z)| norm 0.2593 (-0.07z)| lr 7.26e-04 | 4188.14 ms | 32.2% bf16 MFU | 124661 tok/s step 4648/19560 | loss 3.582621 (+0.16z)| norm 0.2868 (+0.39z)| lr 7.26e-04 | 4247.62 ms | 31.8% bf16 MFU | 124600 tok/s step 4649/19560 | loss 3.575537 (-0.01z)| norm 0.2634 (-0.01z)| lr 7.26e-04 | 4167.06 ms | 32.4% bf16 MFU | 124660 tok/s step 4650/19560 | loss 3.555455 (-0.47z)| norm 0.2680 (+0.07z)| lr 7.26e-04 | 4160.17 ms | 32.5% bf16 MFU | 124729 tok/s step 4651/19560 | loss 3.521517 (-1.23z)| norm 0.2613 (-0.04z)| lr 7.26e-04 | 4161.41 ms | 32.4% bf16 MFU | 124792 tok/s step 4652/19560 | loss 3.616800 (+0.93z)| norm 0.2350 (-0.48z)| lr 7.26e-04 | 4167.82 ms | 32.4% bf16 MFU | 124842 tok/s step 4653/19560 | loss 3.541589 (-0.77z)| norm 0.2545 (-0.15z)| lr 7.26e-04 | 4162.54 ms | 32.4% bf16 MFU | 124897 tok/s step 4654/19560 | loss 3.536900 (-0.86z)| norm 0.2480 (-0.26z)| lr 7.26e-04 | 4167.22 ms | 32.4% bf16 MFU | 124943 tok/s step 4655/19560 | loss 3.526939 (-1.10z)| norm 0.2446 (-0.32z)| lr 7.26e-04 | 4155.23 ms | 32.5% bf16 MFU | 125005 tok/s step 4656/19560 | loss 3.600150 (+0.56z)| norm 0.2682 (+0.07z)| lr 7.26e-04 | 4163.55 ms | 32.4% bf16 MFU | 125051 tok/s step 4657/19560 | loss 3.538875 (-0.83z)| norm 0.2573 (-0.11z)| lr 7.26e-04 | 4183.20 ms | 32.3% bf16 MFU | 125065 tok/s step 4658/19560 | loss 3.583287 (+0.21z)| norm 0.2667 (+0.04z)| lr 7.26e-04 | 4234.84 ms | 31.9% bf16 MFU | 125002 tok/s step 4659/19560 | loss 3.579775 (+0.12z)| norm 0.2411 (-0.39z)| lr 7.26e-04 | 4179.46 ms | 32.3% bf16 MFU | 125024 tok/s step 4660/19560 | loss 3.540504 (-0.81z)| norm 0.2604 (-0.06z)| lr 7.26e-04 | 4161.71 ms | 32.4% bf16 MFU | 125072 tok/s step 4661/19560 | loss 3.561901 (-0.30z)| norm 0.2714 (+0.13z)| lr 7.26e-04 | 4162.12 ms | 32.4% bf16 MFU | 125116 tok/s step 4662/19560 | loss 3.578821 (+0.09z)| norm 0.2930 (+0.50z)| lr 7.26e-04 | 4180.49 ms | 32.3% bf16 MFU | 125131 tok/s step 4663/19560 | loss 3.533516 (-1.02z)| norm 0.2973 (+0.56z)| lr 7.26e-04 | 4173.46 ms | 32.4% bf16 MFU | 125156 tok/s step 4664/19560 | loss 3.548830 (-0.62z)| norm 0.3339 (+1.17z)| lr 7.26e-04 | 4266.84 ms | 31.6% bf16 MFU | 125042 tok/s step 4665/19560 | loss 3.594123 (+0.54z)| norm 0.3121 (+0.79z)| lr 7.26e-04 | 4175.36 ms | 32.3% bf16 MFU | 125068 tok/s step 4666/19560 | loss 3.501145 (-1.81z)| norm 0.2568 (-0.13z)| lr 7.25e-04 | 4234.62 ms | 31.9% bf16 MFU | 125005 tok/s step 4667/19560 | loss 3.553520 (-0.48z)| norm 0.2899 (+0.42z)| lr 7.25e-04 | 4156.29 ms | 32.5% bf16 MFU | 125062 tok/s step 4668/19560 | loss 3.589581 (+0.43z)| norm 0.2542 (-0.18z)| lr 7.25e-04 | 4200.41 ms | 32.1% bf16 MFU | 125050 tok/s step 4669/19560 | loss 3.540811 (-0.80z)| norm 0.2712 (+0.10z)| lr 7.25e-04 | 4160.54 ms | 32.5% bf16 MFU | 125098 tok/s step 4670/19560 | loss 3.535569 (-0.92z)| norm 0.2307 (-0.58z)| lr 7.25e-04 | 4181.38 ms | 32.3% bf16 MFU | 125112 tok/s step 4671/19560 | loss 3.544614 (-0.70z)| norm 0.2295 (-0.60z)| lr 7.25e-04 | 4201.99 ms | 32.1% bf16 MFU | 125095 tok/s step 4672/19560 | loss 3.535681 (-0.92z)| norm 0.2393 (-0.43z)| lr 7.25e-04 | 4222.98 ms | 32.0% bf16 MFU | 125048 tok/s step 4673/19560 | loss 3.538638 (-0.84z)| norm 0.2412 (-0.40z)| lr 7.25e-04 | 4173.11 ms | 32.4% bf16 MFU | 125078 tok/s step 4674/19560 | loss 3.578528 (+0.18z)| norm 0.2375 (-0.46z)| lr 7.25e-04 | 4167.06 ms | 32.4% bf16 MFU | 125115 tok/s step 4675/19560 | loss 3.470490 (-2.50z)| norm 0.2402 (-0.41z)| lr 7.25e-04 | 4164.63 ms | 32.4% bf16 MFU | 125153 tok/s step 4676/19560 | loss 3.612736 (+1.05z)| norm 0.2310 (-0.56z)| lr 7.25e-04 | 4213.08 ms | 32.0% bf16 MFU | 125118 tok/s step 4677/19560 | loss 3.581990 (+0.28z)| norm 0.2521 (-0.21z)| lr 7.25e-04 | 4175.20 ms | 32.3% bf16 MFU | 125141 tok/s step 4678/19560 | loss 3.603180 (+0.80z)| norm 0.2536 (-0.18z)| lr 7.25e-04 | 4187.40 ms | 32.2% bf16 MFU | 125144 tok/s step 4679/19560 | loss 3.585315 (+0.35z)| norm 0.2405 (-0.40z)| lr 7.25e-04 | 4156.35 ms | 32.5% bf16 MFU | 125194 tok/s step 4680/19560 | loss 3.590398 (+0.47z)| norm 0.2330 (-0.52z)| lr 7.25e-04 | 4169.39 ms | 32.4% bf16 MFU | 125221 tok/s step 4681/19560 | loss 3.552577 (-0.47z)| norm 0.2717 (+0.13z)| lr 7.25e-04 | 4172.94 ms | 32.4% bf16 MFU | 125242 tok/s step 4682/19560 | loss 3.597805 (+0.66z)| norm 0.2388 (-0.42z)| lr 7.25e-04 | 4152.75 ms | 32.5% bf16 MFU | 125293 tok/s step 4683/19560 | loss 3.583035 (+0.29z)| norm 0.2445 (-0.32z)| lr 7.25e-04 | 5571.87 ms | 24.2% bf16 MFU | 123733 tok/s step 4684/19560 | loss 3.548778 (-0.57z)| norm 0.2686 (+0.07z)| lr 7.25e-04 | 4443.79 ms | 30.4% bf16 MFU | 123445 tok/s step 4685/19560 | loss 3.580614 (+0.22z)| norm 0.2652 (+0.01z)| lr 7.25e-04 | 4612.52 ms | 29.3% bf16 MFU | 122956 tok/s step 4686/19560 | loss 3.531817 (-0.99z)| norm 0.2784 (+0.23z)| lr 7.25e-04 | 4282.85 ms | 31.5% bf16 MFU | 122929 tok/s step 4687/19560 | loss 3.638021 (+1.64z)| norm 0.3079 (+0.72z)| lr 7.25e-04 | 4262.68 ms | 31.7% bf16 MFU | 122933 tok/s step 4688/19560 | loss 3.537468 (-0.86z)| norm 0.2718 (+0.12z)| lr 7.25e-04 | 4243.37 ms | 31.8% bf16 MFU | 122964 tok/s step 4689/19560 | loss 3.596709 (+0.61z)| norm 0.2778 (+0.21z)| lr 7.25e-04 | 4194.10 ms | 32.2% bf16 MFU | 123066 tok/s step 4690/19560 | loss 3.567986 (-0.12z)| norm 0.2824 (+0.29z)| lr 7.25e-04 | 4246.04 ms | 31.8% bf16 MFU | 123086 tok/s step 4691/19560 | loss 3.634155 (+1.51z)| norm 0.2654 (-0.00z)| lr 7.24e-04 | 4161.33 ms | 32.4% bf16 MFU | 123232 tok/s step 4692/19560 | loss 3.541584 (-0.78z)| norm 0.2443 (-0.36z)| lr 7.24e-04 | 4204.78 ms | 32.1% bf16 MFU | 123304 tok/s step 4693/19560 | loss 3.572054 (-0.01z)| norm 0.2584 (-0.12z)| lr 7.24e-04 | 4245.35 ms | 31.8% bf16 MFU | 123314 tok/s step 4694/19560 | loss 3.594151 (+0.55z)| norm 0.2561 (-0.15z)| lr 7.24e-04 | 4172.57 ms | 32.4% bf16 MFU | 123431 tok/s step 4695/19560 | loss 3.540922 (-0.79z)| norm 0.2414 (-0.40z)| lr 7.24e-04 | 4172.93 ms | 32.4% bf16 MFU | 123541 tok/s step 4696/19560 | loss 3.598924 (+0.71z)| norm 0.2842 (+0.32z)| lr 7.24e-04 | 4173.27 ms | 32.4% bf16 MFU | 123646 tok/s step 4697/19560 | loss 3.548378 (-0.61z)| norm 0.2715 (+0.10z)| lr 7.24e-04 | 4171.56 ms | 32.4% bf16 MFU | 123748 tok/s step 4698/19560 | loss 3.602141 (+0.79z)| norm 0.2463 (-0.32z)| lr 7.24e-04 | 4218.72 ms | 32.0% bf16 MFU | 123774 tok/s step 4699/19560 | loss 3.584323 (+0.33z)| norm 0.2678 (+0.04z)| lr 7.24e-04 | 4168.63 ms | 32.4% bf16 MFU | 123874 tok/s step 4700/19560 | loss 3.544573 (-0.70z)| norm 0.2604 (-0.09z)| lr 7.24e-04 | 4160.02 ms | 32.5% bf16 MFU | 123982 tok/s step 4701/19560 | loss 3.639022 (+1.72z)| norm 0.2319 (-0.56z)| lr 7.24e-04 | 4171.27 ms | 32.4% bf16 MFU | 124067 tok/s step 4702/19560 | loss 3.613748 (+1.07z)| norm 0.2340 (-0.52z)| lr 7.24e-04 | 4275.30 ms | 31.6% bf16 MFU | 123995 tok/s step 4703/19560 | loss 3.593474 (+0.54z)| norm 0.2770 (+0.20z)| lr 7.24e-04 | 4169.40 ms | 32.4% bf16 MFU | 124083 tok/s step 4704/19560 | loss 3.657527 (+2.14z)| norm 0.2375 (-0.46z)| lr 7.24e-04 | 4169.96 ms | 32.4% bf16 MFU | 124165 tok/s step 4705/19560 | loss 3.581168 (+0.22z)| norm 0.2404 (-0.41z)| lr 7.24e-04 | 4164.62 ms | 32.4% bf16 MFU | 124251 tok/s step 4706/19560 | loss 3.588391 (+0.40z)| norm 0.2548 (-0.17z)| lr 7.24e-04 | 4300.25 ms | 31.4% bf16 MFU | 124135 tok/s step 4707/19560 | loss 3.559588 (-0.34z)| norm 0.2533 (-0.19z)| lr 7.24e-04 | 4166.29 ms | 32.4% bf16 MFU | 124220 tok/s step 4708/19560 | loss 3.523703 (-1.26z)| norm 0.2847 (+0.33z)| lr 7.24e-04 | 4218.74 ms | 32.0% bf16 MFU | 124223 tok/s step 4709/19560 | loss 3.626018 (+1.40z)| norm 0.2927 (+0.46z)| lr 7.24e-04 | 4179.29 ms | 32.3% bf16 MFU | 124284 tok/s step 4710/19560 | loss 3.627512 (+1.44z)| norm 0.2820 (+0.28z)| lr 7.24e-04 | 4165.57 ms | 32.4% bf16 MFU | 124363 tok/s step 4711/19560 | loss 3.636703 (+1.65z)| norm 0.2737 (+0.14z)| lr 7.24e-04 | 4169.96 ms | 32.4% bf16 MFU | 124432 tok/s step 4712/19560 | loss 3.591526 (+0.48z)| norm 0.2809 (+0.26z)| lr 7.24e-04 | 4166.82 ms | 32.4% bf16 MFU | 124501 tok/s step 4713/19560 | loss 3.570059 (-0.09z)| norm 0.2655 (-0.00z)| lr 7.24e-04 | 4160.85 ms | 32.4% bf16 MFU | 124576 tok/s step 4714/19560 | loss 3.593355 (+0.51z)| norm 0.2887 (+0.38z)| lr 7.24e-04 | 4176.68 ms | 32.3% bf16 MFU | 124624 tok/s step 4715/19560 | loss 3.621325 (+1.22z)| norm 0.2812 (+0.25z)| lr 7.24e-04 | 4231.56 ms | 31.9% bf16 MFU | 124588 tok/s step 4716/19560 | loss 3.576094 (+0.05z)| norm 0.2598 (-0.10z)| lr 7.23e-04 | 4159.02 ms | 32.5% bf16 MFU | 124661 tok/s step 4717/19560 | loss 3.667118 (+2.34z)| norm 0.2787 (+0.21z)| lr 7.23e-04 | 4234.41 ms | 31.9% bf16 MFU | 124619 tok/s step 4718/19560 | loss 3.514637 (-1.52z)| norm 0.2982 (+0.53z)| lr 7.23e-04 | 4179.86 ms | 32.3% bf16 MFU | 124660 tok/s step 4719/19560 | loss 3.585939 (+0.28z)| norm 0.3294 (+1.04z)| lr 7.23e-04 | 4167.05 ms | 32.4% bf16 MFU | 124718 tok/s step 4720/19560 | loss 3.660523 (+2.12z)| norm 0.2953 (+0.47z)| lr 7.23e-04 | 4193.13 ms | 32.2% bf16 MFU | 124733 tok/s step 4721/19560 | loss 3.587946 (+0.32z)| norm 0.2497 (-0.29z)| lr 7.23e-04 | 4166.38 ms | 32.4% bf16 MFU | 124789 tok/s step 4722/19560 | loss 3.505440 (-1.71z)| norm 0.2714 (+0.07z)| lr 7.23e-04 | 4159.18 ms | 32.5% bf16 MFU | 124852 tok/s step 4723/19560 | loss 3.518377 (-1.36z)| norm 0.2616 (-0.09z)| lr 7.23e-04 | 4252.02 ms | 31.8% bf16 MFU | 124775 tok/s step 4724/19560 | loss 3.572962 (-0.03z)| norm 0.2858 (+0.31z)| lr 7.23e-04 | 4168.53 ms | 32.4% bf16 MFU | 124824 tok/s step 4725/19560 | loss 3.578896 (+0.11z)| norm 0.2633 (-0.07z)| lr 7.23e-04 | 4169.37 ms | 32.4% bf16 MFU | 124871 tok/s step 4726/19560 | loss 3.561449 (-0.32z)| norm 0.2570 (-0.18z)| lr 7.23e-04 | 4215.78 ms | 32.0% bf16 MFU | 124845 tok/s step 4727/19560 | loss 3.539716 (-0.86z)| norm 0.2328 (-0.59z)| lr 7.23e-04 | 4160.04 ms | 32.5% bf16 MFU | 124904 tok/s step 4728/19560 | loss 3.565930 (-0.21z)| norm 0.2596 (-0.14z)| lr 7.23e-04 | 4166.98 ms | 32.4% bf16 MFU | 124950 tok/s step 4729/19560 | loss 3.607717 (+0.89z)| norm 0.3076 (+1.85z)| lr 7.23e-04 | 4162.52 ms | 32.4% bf16 MFU | 125000 tok/s step 4730/19560 | loss 3.583968 (+0.27z)| norm 0.3140 (+2.17z)| lr 7.23e-04 | 4168.54 ms | 32.4% bf16 MFU | 125039 tok/s step 4731/19560 | loss 3.525416 (-1.26z)| norm 0.2411 (-0.97z)| lr 7.23e-04 | 4236.79 ms | 31.9% bf16 MFU | 124974 tok/s step 4732/19560 | loss 3.589833 (+0.47z)| norm 0.2878 (+1.11z)| lr 7.23e-04 | 4164.41 ms | 32.4% bf16 MFU | 125021 tok/s step 4733/19560 | loss 3.545556 (-0.73z)| norm 0.3007 (+1.65z)| lr 7.23e-04 | 4209.88 ms | 32.1% bf16 MFU | 124996 tok/s step 4734/19560 | loss 3.610220 (+1.00z)| norm 0.2244 (-1.67z)| lr 7.23e-04 | 4165.71 ms | 32.4% bf16 MFU | 125040 tok/s step 4735/19560 | loss 3.603014 (+0.80z)| norm 0.2605 (-0.10z)| lr 7.23e-04 | 4176.76 ms | 32.3% bf16 MFU | 125064 tok/s step 4736/19560 | loss 3.535765 (-1.01z)| norm 0.2508 (-0.54z)| lr 7.23e-04 | 4164.74 ms | 32.4% bf16 MFU | 125105 tok/s step 4737/19560 | loss 3.571911 (-0.03z)| norm 0.2369 (-1.14z)| lr 7.23e-04 | 4205.40 ms | 32.1% bf16 MFU | 125083 tok/s step 4738/19560 | loss 3.549263 (-0.63z)| norm 0.2701 (+0.31z)| lr 7.23e-04 | 4166.02 ms | 32.4% bf16 MFU | 125121 tok/s step 4739/19560 | loss 3.527376 (-1.21z)| norm 0.2214 (-1.83z)| lr 7.23e-04 | 4260.88 ms | 31.7% bf16 MFU | 125018 tok/s step 4740/19560 | loss 3.592098 (+0.51z)| norm 0.2406 (-0.98z)| lr 7.23e-04 | 4169.75 ms | 32.4% bf16 MFU | 125054 tok/s step 4741/19560 | loss 3.658386 (+2.23z)| norm 0.2501 (-0.58z)| lr 7.22e-04 | 4221.63 ms | 32.0% bf16 MFU | 125011 tok/s step 4742/19560 | loss 3.565028 (-0.22z)| norm 0.2524 (-0.50z)| lr 7.22e-04 | 4746.56 ms | 28.4% bf16 MFU | 124283 tok/s step 4743/19560 | loss 3.659235 (+2.22z)| norm 0.2537 (-0.44z)| lr 7.22e-04 | 4164.63 ms | 32.4% bf16 MFU | 124363 tok/s step 4744/19560 | loss 3.606256 (+0.83z)| norm 0.2480 (-0.70z)| lr 7.22e-04 | 4265.11 ms | 31.7% bf16 MFU | 124291 tok/s step 4745/19560 | loss 3.629915 (+1.43z)| norm 0.2332 (-1.35z)| lr 7.22e-04 | 4163.85 ms | 32.4% bf16 MFU | 124372 tok/s step 4746/19560 | loss 3.570402 (-0.11z)| norm 0.2752 (+0.52z)| lr 7.22e-04 | 4155.55 ms | 32.5% bf16 MFU | 124462 tok/s step 4747/19560 | loss 3.696425 (+3.06z)| norm 0.2523 (-0.51z)| lr 7.22e-04 | 4161.94 ms | 32.4% bf16 MFU | 124538 tok/s step 4748/19560 | loss 3.576334 (+0.02z)| norm 0.2567 (-0.32z)| lr 7.22e-04 | 4238.59 ms | 31.9% bf16 MFU | 124495 tok/s step 4749/19560 | loss 3.566665 (-0.22z)| norm 0.2522 (-0.51z)| lr 7.22e-04 | 4161.53 ms | 32.4% bf16 MFU | 124570 tok/s step 4750/19560 | loss 3.513104 (-1.56z)| norm 0.2453 (-0.81z)| lr 7.22e-04 | 4182.01 ms | 32.3% bf16 MFU | 124610 tok/s val loss 3.557086 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2766/10042 = 0.275443 step 4751/19560 | loss 3.556088 (-0.47z)| norm 0.3023 (+1.71z)| lr 7.22e-04 | 4329.22 ms | 31.2% bf16 MFU | 124434 tok/s step 4752/19560 | loss 3.547723 (-0.68z)| norm 0.2864 (+0.99z)| lr 7.22e-04 | 4160.08 ms | 32.5% bf16 MFU | 124514 tok/s step 4753/19560 | loss 3.543332 (-0.78z)| norm 0.2678 (+0.17z)| lr 7.22e-04 | 4723.09 ms | 28.6% bf16 MFU | 123839 tok/s step 4754/19560 | loss 3.562022 (-0.33z)| norm 0.2508 (-0.58z)| lr 7.22e-04 | 4253.33 ms | 31.7% bf16 MFU | 123810 tok/s step 4755/19560 | loss 3.599430 (+0.63z)| norm 0.2560 (-0.35z)| lr 7.22e-04 | 4200.34 ms | 32.1% bf16 MFU | 123861 tok/s step 4756/19560 | loss 3.567388 (-0.19z)| norm 0.2391 (-1.08z)| lr 7.22e-04 | 4156.94 ms | 32.5% bf16 MFU | 123974 tok/s step 4757/19560 | loss 3.568311 (-0.16z)| norm 0.2370 (-1.16z)| lr 7.22e-04 | 4227.49 ms | 31.9% bf16 MFU | 123976 tok/s step 4758/19560 | loss 3.717326 (+3.45z)| norm 0.7286 (+9.88z)| lr 7.22e-04 | 4232.92 ms | 31.9% bf16 MFU | 123970 tok/s step 4759/19560 | loss 3.583682 (+0.20z)| norm 0.3915 (+2.58z)| lr 7.22e-04 | 4239.43 ms | 31.8% bf16 MFU | 123955 tok/s step 4760/19560 | loss 3.556258 (-0.48z)| norm 0.2923 (+0.51z)| lr 7.22e-04 | 4170.05 ms | 32.4% bf16 MFU | 124044 tok/s step 4761/19560 | loss 3.583910 (+0.20z)| norm 0.3043 (+0.75z)| lr 7.22e-04 | 4288.61 ms | 31.5% bf16 MFU | 123954 tok/s step 4762/19560 | loss 3.594003 (+0.44z)| norm 0.3110 (+0.88z)| lr 7.22e-04 | 4153.39 ms | 32.5% bf16 MFU | 124068 tok/s step 4763/19560 | loss 3.581400 (+0.13z)| norm 0.2955 (+0.55z)| lr 7.22e-04 | 4208.40 ms | 32.1% bf16 MFU | 124094 tok/s step 4764/19560 | loss 3.595525 (+0.48z)| norm 0.2714 (+0.05z)| lr 7.22e-04 | 4751.53 ms | 28.4% bf16 MFU | 123406 tok/s step 4765/19560 | loss 3.575644 (-0.01z)| norm 0.2624 (-0.14z)| lr 7.22e-04 | 4162.36 ms | 32.4% bf16 MFU | 123534 tok/s step 4766/19560 | loss 3.632888 (+1.38z)| norm 0.2506 (-0.38z)| lr 7.21e-04 | 4176.42 ms | 32.3% bf16 MFU | 123634 tok/s step 4767/19560 | loss 3.589626 (+0.32z)| norm 0.3048 (+0.74z)| lr 7.21e-04 | 4164.34 ms | 32.4% bf16 MFU | 123747 tok/s step 4768/19560 | loss 3.634022 (+1.39z)| norm 0.2775 (+0.17z)| lr 7.21e-04 | 4167.21 ms | 32.4% bf16 MFU | 123850 tok/s step 4769/19560 | loss 3.592248 (+0.36z)| norm 0.2831 (+0.28z)| lr 7.21e-04 | 4267.57 ms | 31.6% bf16 MFU | 123800 tok/s step 4770/19560 | loss 3.561660 (-0.40z)| norm 0.2849 (+0.31z)| lr 7.21e-04 | 4510.87 ms | 29.9% bf16 MFU | 123422 tok/s step 4771/19560 | loss 3.619153 (+1.00z)| norm 0.2602 (-0.20z)| lr 7.21e-04 | 4165.83 ms | 32.4% bf16 MFU | 123543 tok/s step 4772/19560 | loss 3.598028 (+0.50z)| norm 0.2753 (+0.13z)| lr 7.21e-04 | 4608.04 ms | 29.3% bf16 MFU | 123055 tok/s step 4773/19560 | loss 3.645151 (+1.67z)| norm 0.2553 (-0.29z)| lr 7.21e-04 | 4278.25 ms | 31.6% bf16 MFU | 123030 tok/s step 4774/19560 | loss 3.539169 (-0.97z)| norm 0.2302 (-0.81z)| lr 7.21e-04 | 4165.81 ms | 32.4% bf16 MFU | 123171 tok/s step 4775/19560 | loss 3.637665 (+1.45z)| norm 0.2507 (-0.38z)| lr 7.21e-04 | 4156.49 ms | 32.5% bf16 MFU | 123319 tok/s step 4776/19560 | loss 3.564322 (-0.36z)| norm 0.2480 (-0.43z)| lr 7.21e-04 | 4157.15 ms | 32.5% bf16 MFU | 123459 tok/s step 4777/19560 | loss 3.552114 (-0.65z)| norm 0.2705 (+0.04z)| lr 7.21e-04 | 4155.81 ms | 32.5% bf16 MFU | 123594 tok/s step 4778/19560 | loss 3.589877 (+0.27z)| norm 0.2533 (-0.32z)| lr 7.21e-04 | 4159.01 ms | 32.5% bf16 MFU | 123717 tok/s step 4779/19560 | loss 3.577262 (-0.05z)| norm 0.2564 (-0.25z)| lr 7.21e-04 | 4162.58 ms | 32.4% bf16 MFU | 123829 tok/s step 4780/19560 | loss 3.582274 (+0.08z)| norm 0.2775 (+0.18z)| lr 7.21e-04 | 4293.83 ms | 31.4% bf16 MFU | 123743 tok/s step 4781/19560 | loss 3.563203 (-0.40z)| norm 0.2853 (+0.34z)| lr 7.21e-04 | 4165.95 ms | 32.4% bf16 MFU | 123848 tok/s step 4782/19560 | loss 3.574579 (-0.12z)| norm 0.2461 (-0.48z)| lr 7.21e-04 | 4152.77 ms | 32.5% bf16 MFU | 123968 tok/s step 4783/19560 | loss 3.566593 (-0.33z)| norm 0.2455 (-0.49z)| lr 7.21e-04 | 4160.43 ms | 32.5% bf16 MFU | 124071 tok/s step 4784/19560 | loss 3.567543 (-0.30z)| norm 0.2336 (-0.73z)| lr 7.21e-04 | 4167.09 ms | 32.4% bf16 MFU | 124158 tok/s step 4785/19560 | loss 3.532351 (-1.19z)| norm 0.2357 (-0.68z)| lr 7.21e-04 | 4168.28 ms | 32.4% bf16 MFU | 124239 tok/s step 4786/19560 | loss 3.577755 (-0.04z)| norm 0.2469 (-0.45z)| lr 7.21e-04 | 4164.69 ms | 32.4% bf16 MFU | 124322 tok/s step 4787/19560 | loss 3.577250 (-0.05z)| norm 0.2315 (-0.76z)| lr 7.21e-04 | 4162.18 ms | 32.4% bf16 MFU | 124404 tok/s step 4788/19560 | loss 3.545362 (-0.86z)| norm 0.2374 (-0.64z)| lr 7.21e-04 | 4163.55 ms | 32.4% bf16 MFU | 124480 tok/s step 4789/19560 | loss 3.571619 (-0.20z)| norm 0.2971 (+0.59z)| lr 7.21e-04 | 4192.64 ms | 32.2% bf16 MFU | 124508 tok/s step 4790/19560 | loss 3.580670 (+0.03z)| norm 0.2321 (-0.74z)| lr 7.21e-04 | 4156.94 ms | 32.5% bf16 MFU | 124589 tok/s step 4791/19560 | loss 3.510816 (-1.73z)| norm 0.2329 (-0.71z)| lr 7.20e-04 | 4252.76 ms | 31.7% bf16 MFU | 124524 tok/s step 4792/19560 | loss 3.606964 (+0.68z)| norm 0.2360 (-0.64z)| lr 7.20e-04 | 4163.83 ms | 32.4% bf16 MFU | 124593 tok/s step 4793/19560 | loss 3.571362 (-0.21z)| norm 0.3008 (+0.71z)| lr 7.20e-04 | 4162.09 ms | 32.4% bf16 MFU | 124662 tok/s step 4794/19560 | loss 3.541897 (-0.97z)| norm 0.2835 (+0.34z)| lr 7.20e-04 | 4168.94 ms | 32.4% bf16 MFU | 124717 tok/s step 4795/19560 | loss 3.546372 (-0.85z)| norm 0.2733 (+0.13z)| lr 7.20e-04 | 4168.99 ms | 32.4% bf16 MFU | 124769 tok/s step 4796/19560 | loss 3.589951 (+0.26z)| norm 0.2535 (-0.27z)| lr 7.20e-04 | 4188.38 ms | 32.2% bf16 MFU | 124789 tok/s step 4797/19560 | loss 3.547003 (-0.84z)| norm 0.2519 (-0.30z)| lr 7.20e-04 | 4167.49 ms | 32.4% bf16 MFU | 124840 tok/s step 4798/19560 | loss 3.600773 (+0.52z)| norm 0.2594 (-0.15z)| lr 7.20e-04 | 4172.91 ms | 32.4% bf16 MFU | 124880 tok/s step 4799/19560 | loss 3.562661 (-0.46z)| norm 0.3006 (+0.69z)| lr 7.20e-04 | 4166.99 ms | 32.4% bf16 MFU | 124927 tok/s step 4800/19560 | loss 3.593685 (+0.33z)| norm 0.2873 (+0.41z)| lr 7.20e-04 | 4162.68 ms | 32.4% bf16 MFU | 124978 tok/s step 4801/19560 | loss 3.538731 (-1.09z)| norm 0.2478 (-0.42z)| lr 7.20e-04 | 4169.98 ms | 32.4% bf16 MFU | 125016 tok/s step 4802/19560 | loss 3.585916 (+0.12z)| norm 0.2467 (-0.44z)| lr 7.20e-04 | 4155.98 ms | 32.5% bf16 MFU | 125073 tok/s step 4803/19560 | loss 3.555650 (-0.70z)| norm 0.2641 (-0.08z)| lr 7.20e-04 | 4160.53 ms | 32.5% bf16 MFU | 125120 tok/s step 4804/19560 | loss 3.589102 (+0.20z)| norm 0.2671 (-0.03z)| lr 7.20e-04 | 4177.79 ms | 32.3% bf16 MFU | 125138 tok/s step 4805/19560 | loss 3.589951 (+0.22z)| norm 0.2380 (-0.63z)| lr 7.20e-04 | 4241.26 ms | 31.8% bf16 MFU | 125062 tok/s step 4806/19560 | loss 3.522007 (-1.56z)| norm 0.2356 (-0.68z)| lr 7.20e-04 | 4254.52 ms | 31.7% bf16 MFU | 124971 tok/s step 4807/19560 | loss 3.602656 (+0.57z)| norm 0.2622 (-0.13z)| lr 7.20e-04 | 4165.84 ms | 32.4% bf16 MFU | 125015 tok/s step 4808/19560 | loss 3.578150 (-0.08z)| norm 0.2475 (-0.44z)| lr 7.20e-04 | 4162.95 ms | 32.4% bf16 MFU | 125061 tok/s step 4809/19560 | loss 3.631305 (+1.31z)| norm 0.2512 (-0.35z)| lr 7.20e-04 | 4169.98 ms | 32.4% bf16 MFU | 125095 tok/s step 4810/19560 | loss 3.533771 (-1.24z)| norm 0.2413 (-0.56z)| lr 7.20e-04 | 4370.15 ms | 30.9% bf16 MFU | 124838 tok/s step 4811/19560 | loss 3.537478 (-1.13z)| norm 0.2405 (-0.58z)| lr 7.20e-04 | 4178.10 ms | 32.3% bf16 MFU | 124871 tok/s step 4812/19560 | loss 3.559881 (-0.55z)| norm 0.2425 (-0.53z)| lr 7.20e-04 | 4164.20 ms | 32.4% bf16 MFU | 124922 tok/s step 4813/19560 | loss 3.548016 (-0.85z)| norm 0.2723 (+0.09z)| lr 7.20e-04 | 4171.10 ms | 32.4% bf16 MFU | 124961 tok/s step 4814/19560 | loss 3.550563 (-0.79z)| norm 0.2339 (-0.70z)| lr 7.20e-04 | 4186.74 ms | 32.2% bf16 MFU | 124974 tok/s step 4815/19560 | loss 3.613719 (+0.87z)| norm 0.2523 (-0.31z)| lr 7.19e-04 | 4175.87 ms | 32.3% bf16 MFU | 125003 tok/s step 4816/19560 | loss 3.519228 (-1.60z)| norm 0.3042 (+0.76z)| lr 7.19e-04 | 4172.62 ms | 32.4% bf16 MFU | 125035 tok/s step 4817/19560 | loss 3.533891 (-1.20z)| norm 0.2913 (+0.49z)| lr 7.19e-04 | 4175.59 ms | 32.3% bf16 MFU | 125062 tok/s step 4818/19560 | loss 3.605511 (+0.66z)| norm 0.2901 (+0.46z)| lr 7.19e-04 | 4167.68 ms | 32.4% bf16 MFU | 125099 tok/s step 4819/19560 | loss 3.560083 (-0.51z)| norm 0.3165 (+1.00z)| lr 7.19e-04 | 4171.44 ms | 32.4% bf16 MFU | 125128 tok/s step 4820/19560 | loss 3.639301 (+1.53z)| norm 0.2537 (-0.30z)| lr 7.19e-04 | 4267.08 ms | 31.6% bf16 MFU | 125015 tok/s step 4821/19560 | loss 3.562550 (-0.46z)| norm 0.2674 (-0.02z)| lr 7.19e-04 | 4159.90 ms | 32.5% bf16 MFU | 125066 tok/s step 4822/19560 | loss 3.605363 (+0.65z)| norm 0.2647 (-0.08z)| lr 7.19e-04 | 4167.94 ms | 32.4% bf16 MFU | 125102 tok/s step 4823/19560 | loss 3.534122 (-1.20z)| norm 0.2705 (+0.04z)| lr 7.19e-04 | 4313.56 ms | 31.3% bf16 MFU | 124924 tok/s step 4824/19560 | loss 3.552947 (-0.70z)| norm 0.2663 (-0.04z)| lr 7.19e-04 | 4151.25 ms | 32.5% bf16 MFU | 124993 tok/s step 4825/19560 | loss 3.504741 (-1.92z)| norm 0.2365 (-0.66z)| lr 7.19e-04 | 4156.22 ms | 32.5% bf16 MFU | 125050 tok/s step 4826/19560 | loss 3.498795 (-2.02z)| norm 0.2553 (-0.27z)| lr 7.19e-04 | 4180.16 ms | 32.3% bf16 MFU | 125069 tok/s step 4827/19560 | loss 3.584691 (+0.15z)| norm 0.2769 (+0.18z)| lr 7.19e-04 | 4162.11 ms | 32.4% bf16 MFU | 125114 tok/s step 4828/19560 | loss 3.569301 (-0.25z)| norm 0.2810 (+0.26z)| lr 7.19e-04 | 4174.97 ms | 32.3% bf16 MFU | 125137 tok/s step 4829/19560 | loss 3.566293 (-0.31z)| norm 0.2739 (+0.11z)| lr 7.19e-04 | 4171.13 ms | 32.4% bf16 MFU | 125165 tok/s step 4830/19560 | loss 3.541695 (-0.92z)| norm 0.2582 (-0.22z)| lr 7.19e-04 | 4203.12 ms | 32.1% bf16 MFU | 125144 tok/s step 4831/19560 | loss 3.573535 (-0.11z)| norm 0.2764 (+0.16z)| lr 7.19e-04 | 4165.94 ms | 32.4% bf16 MFU | 125179 tok/s step 4832/19560 | loss 3.561343 (-0.41z)| norm 0.2568 (-0.26z)| lr 7.19e-04 | 4169.83 ms | 32.4% bf16 MFU | 125207 tok/s step 4833/19560 | loss 3.556641 (-0.52z)| norm 0.2718 (+0.05z)| lr 7.19e-04 | 4165.70 ms | 32.4% bf16 MFU | 125239 tok/s step 4834/19560 | loss 3.559861 (-0.43z)| norm 0.2754 (+0.12z)| lr 7.19e-04 | 4173.64 ms | 32.4% bf16 MFU | 125258 tok/s step 4835/19560 | loss 3.620035 (+1.11z)| norm 0.2519 (-0.37z)| lr 7.19e-04 | 4164.51 ms | 32.4% bf16 MFU | 125290 tok/s step 4836/19560 | loss 3.554804 (-0.58z)| norm 0.2248 (-0.92z)| lr 7.19e-04 | 4157.68 ms | 32.5% bf16 MFU | 125331 tok/s step 4837/19560 | loss 3.576594 (-0.01z)| norm 0.2509 (-0.37z)| lr 7.19e-04 | 4191.80 ms | 32.2% bf16 MFU | 125318 tok/s step 4838/19560 | loss 3.564808 (-0.31z)| norm 0.2545 (-0.29z)| lr 7.19e-04 | 4174.08 ms | 32.3% bf16 MFU | 125332 tok/s step 4839/19560 | loss 3.538646 (-0.98z)| norm 0.2327 (-0.74z)| lr 7.19e-04 | 4158.00 ms | 32.5% bf16 MFU | 125370 tok/s step 4840/19560 | loss 3.540326 (-0.92z)| norm 0.2338 (-0.71z)| lr 7.18e-04 | 4164.21 ms | 32.4% bf16 MFU | 125397 tok/s step 4841/19560 | loss 3.456177 (-3.00z)| norm 0.2739 (+0.13z)| lr 7.18e-04 | 4255.36 ms | 31.7% bf16 MFU | 125287 tok/s step 4842/19560 | loss 3.617676 (+1.09z)| norm 0.2340 (-0.69z)| lr 7.18e-04 | 4162.00 ms | 32.4% bf16 MFU | 125322 tok/s step 4843/19560 | loss 3.528309 (-1.15z)| norm 0.2805 (+0.27z)| lr 7.18e-04 | 4160.73 ms | 32.5% bf16 MFU | 125356 tok/s step 4844/19560 | loss 3.605786 (+0.80z)| norm 0.2870 (+0.40z)| lr 7.18e-04 | 4607.85 ms | 29.3% bf16 MFU | 124777 tok/s step 4845/19560 | loss 3.576288 (+0.07z)| norm 0.2480 (-0.40z)| lr 7.18e-04 | 4160.40 ms | 32.5% bf16 MFU | 124839 tok/s step 4846/19560 | loss 3.592970 (+0.49z)| norm 0.2830 (+0.33z)| lr 7.18e-04 | 4170.38 ms | 32.4% bf16 MFU | 124883 tok/s step 4847/19560 | loss 3.558039 (-0.41z)| norm 0.2910 (+0.50z)| lr 7.18e-04 | 4174.70 ms | 32.3% bf16 MFU | 124918 tok/s step 4848/19560 | loss 3.516502 (-1.48z)| norm 0.2645 (-0.05z)| lr 7.18e-04 | 4169.06 ms | 32.4% bf16 MFU | 124960 tok/s step 4849/19560 | loss 3.534032 (-1.00z)| norm 0.2632 (-0.07z)| lr 7.18e-04 | 4197.59 ms | 32.2% bf16 MFU | 124957 tok/s step 4850/19560 | loss 3.529061 (-1.15z)| norm 0.3122 (+0.94z)| lr 7.18e-04 | 4155.63 ms | 32.5% bf16 MFU | 125018 tok/s step 4851/19560 | loss 3.547982 (-0.66z)| norm 0.2648 (-0.05z)| lr 7.18e-04 | 4180.43 ms | 32.3% bf16 MFU | 125038 tok/s step 4852/19560 | loss 3.565715 (-0.18z)| norm 0.2765 (+0.20z)| lr 7.18e-04 | 4162.65 ms | 32.4% bf16 MFU | 125083 tok/s step 4853/19560 | loss 3.548017 (-0.65z)| norm 0.2598 (-0.15z)| lr 7.18e-04 | 4166.27 ms | 32.4% bf16 MFU | 125121 tok/s step 4854/19560 | loss 3.648848 (+1.98z)| norm 0.2707 (+0.07z)| lr 7.18e-04 | 4251.82 ms | 31.8% bf16 MFU | 125030 tok/s step 4855/19560 | loss 3.488116 (-2.18z)| norm 0.2952 (+0.57z)| lr 7.18e-04 | 4162.43 ms | 32.4% bf16 MFU | 125077 tok/s step 4856/19560 | loss 3.573248 (+0.01z)| norm 0.2651 (-0.05z)| lr 7.18e-04 | 4160.84 ms | 32.4% bf16 MFU | 125123 tok/s step 4857/19560 | loss 3.552063 (-0.52z)| norm 0.2659 (-0.03z)| lr 7.18e-04 | 4157.99 ms | 32.5% bf16 MFU | 125172 tok/s step 4858/19560 | loss 3.575117 (+0.07z)| norm 0.2580 (-0.19z)| lr 7.18e-04 | 4162.01 ms | 32.4% bf16 MFU | 125212 tok/s step 4859/19560 | loss 3.596860 (+0.62z)| norm 0.2829 (+0.33z)| lr 7.18e-04 | 4329.59 ms | 31.2% bf16 MFU | 125006 tok/s step 4860/19560 | loss 3.565845 (-0.17z)| norm 0.2335 (-0.70z)| lr 7.18e-04 | 4169.40 ms | 32.4% bf16 MFU | 125043 tok/s step 4861/19560 | loss 3.542095 (-0.79z)| norm 0.2494 (-0.36z)| lr 7.18e-04 | 4155.59 ms | 32.5% bf16 MFU | 125099 tok/s step 4862/19560 | loss 3.572984 (+0.02z)| norm 0.2351 (-0.66z)| lr 7.18e-04 | 4180.35 ms | 32.3% bf16 MFU | 125115 tok/s step 4863/19560 | loss 3.546019 (-0.67z)| norm 0.2460 (-0.43z)| lr 7.18e-04 | 4204.37 ms | 32.1% bf16 MFU | 125094 tok/s step 4864/19560 | loss 3.589180 (+0.44z)| norm 0.2428 (-0.49z)| lr 7.17e-04 | 4163.08 ms | 32.4% bf16 MFU | 125136 tok/s step 4865/19560 | loss 3.563190 (-0.23z)| norm 0.2183 (-1.00z)| lr 7.17e-04 | 4171.64 ms | 32.4% bf16 MFU | 125163 tok/s step 4866/19560 | loss 3.584932 (+0.33z)| norm 0.2598 (-0.13z)| lr 7.17e-04 | 4163.06 ms | 32.4% bf16 MFU | 125202 tok/s step 4867/19560 | loss 3.583619 (+0.28z)| norm 0.2462 (-0.42z)| lr 7.17e-04 | 4163.00 ms | 32.4% bf16 MFU | 125239 tok/s step 4868/19560 | loss 3.527548 (-1.17z)| norm 0.2378 (-0.60z)| lr 7.17e-04 | 4167.79 ms | 32.4% bf16 MFU | 125267 tok/s step 4869/19560 | loss 3.555131 (-0.44z)| norm 0.2202 (-0.96z)| lr 7.17e-04 | 4161.16 ms | 32.4% bf16 MFU | 125303 tok/s step 4870/19560 | loss 3.678413 (+2.74z)| norm 0.2416 (-0.51z)| lr 7.17e-04 | 4163.16 ms | 32.4% bf16 MFU | 125335 tok/s step 4871/19560 | loss 3.596181 (+0.64z)| norm 0.2644 (-0.04z)| lr 7.17e-04 | 4181.97 ms | 32.3% bf16 MFU | 125336 tok/s step 4872/19560 | loss 3.523609 (-1.25z)| norm 0.2630 (-0.07z)| lr 7.17e-04 | 4188.26 ms | 32.2% bf16 MFU | 125329 tok/s step 4873/19560 | loss 3.579238 (+0.22z)| norm 0.2568 (-0.20z)| lr 7.17e-04 | 5101.24 ms | 26.5% bf16 MFU | 124201 tok/s step 4874/19560 | loss 3.545156 (-0.68z)| norm 0.2442 (-0.46z)| lr 7.17e-04 | 5427.29 ms | 24.9% bf16 MFU | 122821 tok/s step 4875/19560 | loss 3.630638 (+1.65z)| norm 0.2456 (-0.43z)| lr 7.17e-04 | 4730.19 ms | 28.5% bf16 MFU | 122222 tok/s step 4876/19560 | loss 3.542881 (-0.74z)| norm 0.2497 (-0.34z)| lr 7.17e-04 | 4662.70 ms | 29.0% bf16 MFU | 121733 tok/s step 4877/19560 | loss 3.593683 (+0.64z)| norm 0.2304 (-0.74z)| lr 7.17e-04 | 4591.28 ms | 29.4% bf16 MFU | 121356 tok/s step 4878/19560 | loss 3.597500 (+0.73z)| norm 0.2519 (-0.29z)| lr 7.17e-04 | 4177.14 ms | 32.3% bf16 MFU | 121564 tok/s step 4879/19560 | loss 3.541778 (-0.79z)| norm 0.2583 (-0.15z)| lr 7.17e-04 | 4337.35 ms | 31.1% bf16 MFU | 121530 tok/s step 4880/19560 | loss 3.603143 (+0.88z)| norm 0.2594 (-0.13z)| lr 7.17e-04 | 4416.64 ms | 30.6% bf16 MFU | 121388 tok/s step 4881/19560 | loss 3.555662 (-0.43z)| norm 0.2469 (-0.38z)| lr 7.17e-04 | 4443.99 ms | 30.4% bf16 MFU | 121218 tok/s step 4882/19560 | loss 3.538587 (-0.89z)| norm 0.2738 (+0.18z)| lr 7.17e-04 | 4292.74 ms | 31.5% bf16 MFU | 121264 tok/s step 4883/19560 | loss 3.619734 (+1.32z)| norm 0.2761 (+0.22z)| lr 7.17e-04 | 4207.59 ms | 32.1% bf16 MFU | 121431 tok/s step 4884/19560 | loss 3.556900 (-0.39z)| norm 0.2526 (-0.27z)| lr 7.17e-04 | 4408.60 ms | 30.6% bf16 MFU | 121305 tok/s step 4885/19560 | loss 3.573098 (+0.05z)| norm 0.2318 (-0.71z)| lr 7.17e-04 | 4173.73 ms | 32.3% bf16 MFU | 121521 tok/s step 4886/19560 | loss 3.535069 (-1.01z)| norm 0.2564 (-0.23z)| lr 7.17e-04 | 4608.01 ms | 29.3% bf16 MFU | 121134 tok/s step 4887/19560 | loss 3.567534 (-0.06z)| norm 0.2297 (-1.44z)| lr 7.17e-04 | 4349.14 ms | 31.0% bf16 MFU | 121105 tok/s step 4888/19560 | loss 3.621016 (+1.46z)| norm 0.2439 (-0.76z)| lr 7.17e-04 | 4407.96 ms | 30.6% bf16 MFU | 120996 tok/s step 4889/19560 | loss 3.570441 (+0.01z)| norm 0.2314 (-1.34z)| lr 7.16e-04 | 4330.81 ms | 31.2% bf16 MFU | 121000 tok/s step 4890/19560 | loss 3.571505 (+0.05z)| norm 0.2256 (-1.60z)| lr 7.16e-04 | 4155.94 ms | 32.5% bf16 MFU | 121257 tok/s step 4891/19560 | loss 3.543614 (-0.75z)| norm 0.2464 (-0.59z)| lr 7.16e-04 | 4197.03 ms | 32.2% bf16 MFU | 121440 tok/s step 4892/19560 | loss 3.616640 (+1.34z)| norm 0.2573 (-0.06z)| lr 7.16e-04 | 4248.29 ms | 31.8% bf16 MFU | 121539 tok/s step 4893/19560 | loss 3.587204 (+0.50z)| norm 0.2694 (+0.52z)| lr 7.16e-04 | 4356.95 ms | 31.0% bf16 MFU | 121479 tok/s step 4894/19560 | loss 3.582035 (+0.37z)| norm 0.2904 (+1.51z)| lr 7.16e-04 | 4163.46 ms | 32.4% bf16 MFU | 121701 tok/s step 4895/19560 | loss 3.580147 (+0.31z)| norm 0.3068 (+2.29z)| lr 7.16e-04 | 4305.21 ms | 31.4% bf16 MFU | 121705 tok/s step 4896/19560 | loss 3.554358 (-0.42z)| norm 0.2922 (+1.58z)| lr 7.16e-04 | 4221.77 ms | 32.0% bf16 MFU | 121829 tok/s step 4897/19560 | loss 3.512788 (-1.61z)| norm 0.2619 (+0.14z)| lr 7.16e-04 | 4234.88 ms | 31.9% bf16 MFU | 121928 tok/s step 4898/19560 | loss 3.627130 (+1.69z)| norm 0.2816 (+1.09z)| lr 7.16e-04 | 4166.50 ms | 32.4% bf16 MFU | 122123 tok/s step 4899/19560 | loss 3.591680 (+0.68z)| norm 0.2891 (+1.43z)| lr 7.16e-04 | 4399.17 ms | 30.7% bf16 MFU | 121976 tok/s step 4900/19560 | loss 3.598066 (+0.86z)| norm 0.3228 (+2.91z)| lr 7.16e-04 | 4170.29 ms | 32.4% bf16 MFU | 122163 tok/s step 4901/19560 | loss 3.634989 (+1.94z)| norm 0.2662 (+0.30z)| lr 7.16e-04 | 4166.68 ms | 32.4% bf16 MFU | 122346 tok/s step 4902/19560 | loss 3.534169 (-0.99z)| norm 0.2831 (+1.07z)| lr 7.16e-04 | 4295.39 ms | 31.4% bf16 MFU | 122332 tok/s step 4903/19560 | loss 3.571848 (+0.12z)| norm 0.2659 (+0.27z)| lr 7.16e-04 | 4507.52 ms | 30.0% bf16 MFU | 122031 tok/s step 4904/19560 | loss 3.591635 (+0.70z)| norm 0.2862 (+1.19z)| lr 7.16e-04 | 4425.83 ms | 30.5% bf16 MFU | 121852 tok/s step 4905/19560 | loss 3.568114 (+0.00z)| norm 0.2672 (+0.32z)| lr 7.16e-04 | 4240.19 ms | 31.8% bf16 MFU | 121942 tok/s step 4906/19560 | loss 3.584672 (+0.49z)| norm 0.2368 (-1.07z)| lr 7.16e-04 | 4267.92 ms | 31.6% bf16 MFU | 121987 tok/s step 4907/19560 | loss 3.549552 (-0.54z)| norm 0.2542 (-0.28z)| lr 7.16e-04 | 4179.94 ms | 32.3% bf16 MFU | 122159 tok/s step 4908/19560 | loss 3.559568 (-0.24z)| norm 0.2481 (-0.54z)| lr 7.16e-04 | 4177.60 ms | 32.3% bf16 MFU | 122326 tok/s step 4909/19560 | loss 3.571197 (+0.10z)| norm 0.2754 (+0.72z)| lr 7.16e-04 | 4181.28 ms | 32.3% bf16 MFU | 122480 tok/s step 4910/19560 | loss 3.720691 (+4.17z)| norm 0.2486 (-0.52z)| lr 7.16e-04 | 4171.97 ms | 32.4% bf16 MFU | 122639 tok/s step 4911/19560 | loss 3.584607 (+0.43z)| norm 0.2553 (-0.21z)| lr 7.16e-04 | 4171.23 ms | 32.4% bf16 MFU | 122792 tok/s step 4912/19560 | loss 3.532444 (-0.99z)| norm 0.2720 (+0.54z)| lr 7.16e-04 | 4178.07 ms | 32.3% bf16 MFU | 122926 tok/s step 4913/19560 | loss 3.622413 (+1.44z)| norm 0.2783 (+0.82z)| lr 7.15e-04 | 4179.70 ms | 32.3% bf16 MFU | 123052 tok/s step 4914/19560 | loss 3.554045 (-0.41z)| norm 0.2705 (+0.45z)| lr 7.15e-04 | 4178.07 ms | 32.3% bf16 MFU | 123174 tok/s step 4915/19560 | loss 3.583095 (+0.38z)| norm 0.2358 (-1.16z)| lr 7.15e-04 | 4173.97 ms | 32.3% bf16 MFU | 123295 tok/s step 4916/19560 | loss 3.604274 (+0.94z)| norm 0.2617 (+0.03z)| lr 7.15e-04 | 4167.06 ms | 32.4% bf16 MFU | 123421 tok/s step 4917/19560 | loss 3.591609 (+0.59z)| norm 0.2490 (-0.55z)| lr 7.15e-04 | 4172.94 ms | 32.4% bf16 MFU | 123532 tok/s step 4918/19560 | loss 3.673612 (+2.71z)| norm 0.2423 (-0.88z)| lr 7.15e-04 | 4200.69 ms | 32.1% bf16 MFU | 123596 tok/s step 4919/19560 | loss 3.565957 (-0.14z)| norm 0.2282 (-1.54z)| lr 7.15e-04 | 4175.78 ms | 32.3% bf16 MFU | 123694 tok/s step 4920/19560 | loss 3.607100 (+0.96z)| norm 0.2611 (+0.01z)| lr 7.15e-04 | 4193.82 ms | 32.2% bf16 MFU | 123760 tok/s step 4921/19560 | loss 3.575615 (+0.12z)| norm 0.2573 (-0.16z)| lr 7.15e-04 | 4180.30 ms | 32.3% bf16 MFU | 123843 tok/s step 4922/19560 | loss 3.555023 (-0.43z)| norm 0.2881 (+1.33z)| lr 7.15e-04 | 4214.26 ms | 32.0% bf16 MFU | 123871 tok/s step 4923/19560 | loss 3.571339 (-0.00z)| norm 0.2514 (-0.43z)| lr 7.15e-04 | 4180.57 ms | 32.3% bf16 MFU | 123948 tok/s step 4924/19560 | loss 3.536560 (-0.92z)| norm 0.2514 (-0.43z)| lr 7.15e-04 | 4255.58 ms | 31.7% bf16 MFU | 123911 tok/s step 4925/19560 | loss 3.570760 (-0.01z)| norm 0.2450 (-0.74z)| lr 7.15e-04 | 4169.57 ms | 32.4% bf16 MFU | 124002 tok/s step 4926/19560 | loss 3.529910 (-1.08z)| norm 0.2608 (+0.02z)| lr 7.15e-04 | 4367.60 ms | 30.9% bf16 MFU | 123804 tok/s step 4927/19560 | loss 3.556228 (-0.38z)| norm 0.2344 (-1.23z)| lr 7.15e-04 | 4296.35 ms | 31.4% bf16 MFU | 123716 tok/s step 4928/19560 | loss 3.570173 (-0.01z)| norm 0.2347 (-1.20z)| lr 7.15e-04 | 4293.02 ms | 31.5% bf16 MFU | 123636 tok/s step 4929/19560 | loss 3.598829 (+0.75z)| norm 0.2401 (-0.93z)| lr 7.15e-04 | 4184.63 ms | 32.3% bf16 MFU | 123719 tok/s step 4930/19560 | loss 3.536983 (-0.89z)| norm 0.2588 (-0.03z)| lr 7.15e-04 | 4173.00 ms | 32.4% bf16 MFU | 123815 tok/s step 4931/19560 | loss 3.633410 (+1.64z)| norm 0.2802 (+1.00z)| lr 7.15e-04 | 4163.05 ms | 32.4% bf16 MFU | 123921 tok/s step 4932/19560 | loss 3.673063 (+2.60z)| norm 0.2579 (-0.08z)| lr 7.15e-04 | 5654.25 ms | 23.9% bf16 MFU | 122361 tok/s step 4933/19560 | loss 3.582054 (+0.27z)| norm 0.2613 (+0.08z)| lr 7.15e-04 | 4176.49 ms | 32.3% bf16 MFU | 122520 tok/s step 4934/19560 | loss 3.565261 (-0.17z)| norm 0.2666 (+0.33z)| lr 7.15e-04 | 4178.43 ms | 32.3% bf16 MFU | 122667 tok/s step 4935/19560 | loss 3.534530 (-0.95z)| norm 0.2570 (-0.14z)| lr 7.15e-04 | 4170.39 ms | 32.4% bf16 MFU | 122820 tok/s step 4936/19560 | loss 3.541007 (-0.78z)| norm 0.2700 (+0.49z)| lr 7.15e-04 | 4183.32 ms | 32.3% bf16 MFU | 122945 tok/s step 4937/19560 | loss 3.704658 (+3.30z)| norm 0.2745 (+0.70z)| lr 7.14e-04 | 4181.57 ms | 32.3% bf16 MFU | 123067 tok/s step 4938/19560 | loss 3.568487 (-0.09z)| norm 0.2776 (+0.83z)| lr 7.14e-04 | 4166.72 ms | 32.4% bf16 MFU | 123205 tok/s step 4939/19560 | loss 3.609978 (+0.93z)| norm 0.2720 (+0.55z)| lr 7.14e-04 | 4292.48 ms | 31.5% bf16 MFU | 123152 tok/s step 4940/19560 | loss 3.627891 (+1.36z)| norm 0.2733 (+0.60z)| lr 7.14e-04 | 4178.75 ms | 32.3% bf16 MFU | 123268 tok/s step 4941/19560 | loss 3.587401 (+0.35z)| norm 0.2681 (+0.35z)| lr 7.14e-04 | 4347.47 ms | 31.1% bf16 MFU | 123134 tok/s step 4942/19560 | loss 3.613802 (+0.99z)| norm 0.2321 (-1.41z)| lr 7.14e-04 | 4180.47 ms | 32.3% bf16 MFU | 123248 tok/s step 4943/19560 | loss 3.599442 (+0.63z)| norm 0.2773 (+0.79z)| lr 7.14e-04 | 4221.05 ms | 32.0% bf16 MFU | 123296 tok/s step 4944/19560 | loss 3.528572 (-1.13z)| norm 0.2519 (-0.44z)| lr 7.14e-04 | 4171.49 ms | 32.4% bf16 MFU | 123415 tok/s step 4945/19560 | loss 3.590005 (+0.39z)| norm 0.2662 (+0.28z)| lr 7.14e-04 | 4174.73 ms | 32.3% bf16 MFU | 123524 tok/s step 4946/19560 | loss 3.578793 (+0.12z)| norm 0.2564 (-0.19z)| lr 7.14e-04 | 4172.59 ms | 32.4% bf16 MFU | 123630 tok/s step 4947/19560 | loss 3.545508 (-0.71z)| norm 0.2268 (-1.70z)| lr 7.14e-04 | 4163.69 ms | 32.4% bf16 MFU | 123745 tok/s step 4948/19560 | loss 3.550886 (-0.56z)| norm 0.2528 (-0.35z)| lr 7.14e-04 | 4187.56 ms | 32.2% bf16 MFU | 123818 tok/s step 4949/19560 | loss 3.508637 (-1.60z)| norm 0.2317 (-1.42z)| lr 7.14e-04 | 4179.26 ms | 32.3% bf16 MFU | 123899 tok/s step 4950/19560 | loss 3.571277 (-0.03z)| norm 0.2508 (-0.43z)| lr 7.14e-04 | 4179.97 ms | 32.3% bf16 MFU | 123976 tok/s step 4951/19560 | loss 3.549994 (-0.57z)| norm 0.2449 (-0.73z)| lr 7.14e-04 | 4183.50 ms | 32.3% bf16 MFU | 124043 tok/s step 4952/19560 | loss 3.548806 (-0.60z)| norm 0.2306 (-1.43z)| lr 7.14e-04 | 4373.84 ms | 30.9% bf16 MFU | 123834 tok/s step 4953/19560 | loss 3.625864 (+1.31z)| norm 0.2474 (-0.58z)| lr 7.14e-04 | 4208.82 ms | 32.1% bf16 MFU | 123871 tok/s step 4954/19560 | loss 3.614773 (+1.02z)| norm 0.2909 (+1.61z)| lr 7.14e-04 | 4326.84 ms | 31.2% bf16 MFU | 123736 tok/s step 4955/19560 | loss 3.544918 (-0.74z)| norm 0.2878 (+1.44z)| lr 7.14e-04 | 4211.19 ms | 32.1% bf16 MFU | 123774 tok/s step 4956/19560 | loss 3.531107 (-1.08z)| norm 0.2905 (+1.56z)| lr 7.14e-04 | 4182.51 ms | 32.3% bf16 MFU | 123853 tok/s step 4957/19560 | loss 3.544528 (-0.74z)| norm 0.2741 (+0.74z)| lr 7.14e-04 | 4216.58 ms | 32.0% bf16 MFU | 123877 tok/s step 4958/19560 | loss 3.533800 (-1.00z)| norm 0.2768 (+0.87z)| lr 7.14e-04 | 4170.60 ms | 32.4% bf16 MFU | 123969 tok/s step 4959/19560 | loss 3.636446 (+1.55z)| norm 0.2889 (+1.46z)| lr 7.14e-04 | 4202.87 ms | 32.1% bf16 MFU | 124008 tok/s step 4960/19560 | loss 3.599486 (+0.62z)| norm 0.2845 (+1.22z)| lr 7.13e-04 | 4270.83 ms | 31.6% bf16 MFU | 123945 tok/s step 4961/19560 | loss 3.562871 (-0.29z)| norm 0.2588 (-0.04z)| lr 7.13e-04 | 4770.19 ms | 28.3% bf16 MFU | 123244 tok/s step 4962/19560 | loss 3.634104 (+1.46z)| norm 0.2575 (-0.10z)| lr 7.13e-04 | 4191.20 ms | 32.2% bf16 MFU | 123336 tok/s step 4963/19560 | loss 3.563743 (-0.27z)| norm 0.2238 (-1.74z)| lr 7.13e-04 | 4176.97 ms | 32.3% bf16 MFU | 123445 tok/s step 4964/19560 | loss 3.613134 (+0.94z)| norm 0.2586 (-0.05z)| lr 7.13e-04 | 4180.64 ms | 32.3% bf16 MFU | 123543 tok/s step 4965/19560 | loss 3.540716 (-0.84z)| norm 0.2544 (-0.25z)| lr 7.13e-04 | 4288.46 ms | 31.5% bf16 MFU | 123479 tok/s step 4966/19560 | loss 3.576547 (+0.04z)| norm 0.2414 (-0.89z)| lr 7.13e-04 | 4170.57 ms | 32.4% bf16 MFU | 123591 tok/s step 4967/19560 | loss 3.617490 (+1.03z)| norm 0.2423 (-0.85z)| lr 7.13e-04 | 4165.41 ms | 32.4% bf16 MFU | 123704 tok/s step 4968/19560 | loss 3.603335 (+0.67z)| norm 0.2591 (-0.03z)| lr 7.13e-04 | 4360.04 ms | 31.0% bf16 MFU | 123532 tok/s step 4969/19560 | loss 3.748757 (+4.07z)| norm 0.2334 (-1.29z)| lr 7.13e-04 | 4178.30 ms | 32.3% bf16 MFU | 123629 tok/s step 4970/19560 | loss 3.566511 (-0.27z)| norm 0.2739 (+0.71z)| lr 7.13e-04 | 4233.58 ms | 31.9% bf16 MFU | 123640 tok/s step 4971/19560 | loss 3.567184 (-0.27z)| norm 0.2701 (+0.53z)| lr 7.13e-04 | 4162.76 ms | 32.4% bf16 MFU | 123755 tok/s step 4972/19560 | loss 3.627633 (+1.18z)| norm 0.2704 (+0.55z)| lr 7.13e-04 | 4219.51 ms | 32.0% bf16 MFU | 123780 tok/s step 4973/19560 | loss 3.509468 (-1.62z)| norm 0.2831 (+1.17z)| lr 7.13e-04 | 4171.98 ms | 32.4% bf16 MFU | 123874 tok/s step 4974/19560 | loss 3.537711 (-0.94z)| norm 0.2615 (+0.10z)| lr 7.13e-04 | 4182.31 ms | 32.3% bf16 MFU | 123948 tok/s step 4975/19560 | loss 3.562987 (-0.34z)| norm 0.2741 (+0.74z)| lr 7.13e-04 | 4191.60 ms | 32.2% bf16 MFU | 124005 tok/s step 4976/19560 | loss 3.542819 (-0.83z)| norm 0.3239 (+3.13z)| lr 7.13e-04 | 4337.93 ms | 31.1% bf16 MFU | 123848 tok/s step 4977/19560 | loss 3.580398 (+0.05z)| norm 0.2743 (+0.70z)| lr 7.13e-04 | 4193.21 ms | 32.2% bf16 MFU | 123907 tok/s step 4978/19560 | loss 3.565275 (-0.31z)| norm 0.2675 (+0.39z)| lr 7.13e-04 | 4209.33 ms | 32.1% bf16 MFU | 123939 tok/s step 4979/19560 | loss 3.545002 (-0.80z)| norm 0.2443 (-0.76z)| lr 7.13e-04 | 4183.36 ms | 32.3% bf16 MFU | 124009 tok/s step 4980/19560 | loss 3.478552 (-2.33z)| norm 0.2536 (-0.29z)| lr 7.13e-04 | 4180.89 ms | 32.3% bf16 MFU | 124078 tok/s step 4981/19560 | loss 3.521432 (-1.31z)| norm 0.2483 (-0.55z)| lr 7.13e-04 | 4185.13 ms | 32.3% bf16 MFU | 124138 tok/s step 4982/19560 | loss 3.525585 (-1.20z)| norm 0.2274 (-1.56z)| lr 7.13e-04 | 4191.61 ms | 32.2% bf16 MFU | 124185 tok/s step 4983/19560 | loss 3.533185 (-1.04z)| norm 0.2601 (+0.07z)| lr 7.13e-04 | 4198.47 ms | 32.2% bf16 MFU | 124220 tok/s step 4984/19560 | loss 3.595736 (+0.44z)| norm 0.2531 (-0.27z)| lr 7.12e-04 | 4168.45 ms | 32.4% bf16 MFU | 124298 tok/s step 4985/19560 | loss 3.632051 (+1.29z)| norm 0.2909 (+1.60z)| lr 7.12e-04 | 4227.27 ms | 31.9% bf16 MFU | 124284 tok/s step 4986/19560 | loss 3.505048 (-1.69z)| norm 0.3003 (+2.02z)| lr 7.12e-04 | 4174.85 ms | 32.3% bf16 MFU | 124349 tok/s step 4987/19560 | loss 3.560781 (-0.38z)| norm 0.2555 (-0.16z)| lr 7.12e-04 | 4387.12 ms | 30.8% bf16 MFU | 124107 tok/s step 4988/19560 | loss 3.555696 (-0.49z)| norm 0.2688 (+0.48z)| lr 7.12e-04 | 4220.64 ms | 32.0% bf16 MFU | 124112 tok/s step 4989/19560 | loss 3.580769 (+0.09z)| norm 0.2677 (+0.42z)| lr 7.12e-04 | 4318.81 ms | 31.3% bf16 MFU | 123977 tok/s step 4990/19560 | loss 3.558522 (-0.43z)| norm 0.2707 (+0.55z)| lr 7.12e-04 | 4164.46 ms | 32.4% bf16 MFU | 124073 tok/s step 4991/19560 | loss 3.549009 (-0.66z)| norm 0.2510 (-0.42z)| lr 7.12e-04 | 4243.05 ms | 31.8% bf16 MFU | 124047 tok/s step 4992/19560 | loss 3.603563 (+0.62z)| norm 0.2515 (-0.41z)| lr 7.12e-04 | 4177.40 ms | 32.3% bf16 MFU | 124120 tok/s step 4993/19560 | loss 3.574097 (-0.07z)| norm 0.2536 (-0.32z)| lr 7.12e-04 | 4197.90 ms | 32.2% bf16 MFU | 124159 tok/s step 4994/19560 | loss 3.578175 (+0.02z)| norm 0.2460 (-0.70z)| lr 7.12e-04 | 4178.22 ms | 32.3% bf16 MFU | 124225 tok/s step 4995/19560 | loss 3.540737 (-0.84z)| norm 0.2670 (+0.35z)| lr 7.12e-04 | 4190.96 ms | 32.2% bf16 MFU | 124269 tok/s step 4996/19560 | loss 3.551431 (-0.60z)| norm 0.2751 (+0.75z)| lr 7.12e-04 | 4180.13 ms | 32.3% bf16 MFU | 124326 tok/s step 4997/19560 | loss 3.586931 (+0.23z)| norm 0.2522 (-0.43z)| lr 7.12e-04 | 4161.64 ms | 32.4% bf16 MFU | 124409 tok/s step 4998/19560 | loss 3.555845 (-0.49z)| norm 0.2736 (+0.67z)| lr 7.12e-04 | 4192.83 ms | 32.2% bf16 MFU | 124441 tok/s step 4999/19560 | loss 3.680467 (+2.43z)| norm 0.2661 (+0.28z)| lr 7.12e-04 | 4298.50 ms | 31.4% bf16 MFU | 124317 tok/s step 5000/19560 | loss 3.528572 (-1.14z)| norm 0.2807 (+1.02z)| lr 7.12e-04 | 4174.80 ms | 32.3% bf16 MFU | 124381 tok/s val loss 3.539801 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2757/10042 = 0.274547 Writing checkpoint at step 5000 Writing model to log124M/model_00005000.bin Writing state to log124M/state_00005000_00000.bin step 5001/19560 | loss 3.579327 (+0.06z)| norm 0.2735 (+0.64z)| lr 7.12e-04 | 4654.69 ms | 29.0% bf16 MFU | 123793 tok/s step 5002/19560 | loss 3.576760 (-0.01z)| norm 0.2612 (+0.00z)| lr 7.12e-04 | 4225.15 ms | 32.0% bf16 MFU | 123808 tok/s step 5003/19560 | loss 3.608395 (+0.74z)| norm 0.2382 (-1.18z)| lr 7.12e-04 | 4602.89 ms | 29.3% bf16 MFU | 123313 tok/s step 5004/19560 | loss 3.543381 (-0.80z)| norm 0.2511 (-0.52z)| lr 7.12e-04 | 4213.27 ms | 32.0% bf16 MFU | 123369 tok/s step 5005/19560 | loss 3.586375 (+0.22z)| norm 0.2715 (+0.52z)| lr 7.12e-04 | 4405.05 ms | 30.7% bf16 MFU | 123152 tok/s step 5006/19560 | loss 3.525840 (-1.19z)| norm 0.2605 (-0.05z)| lr 7.12e-04 | 4221.38 ms | 32.0% bf16 MFU | 123204 tok/s step 5007/19560 | loss 3.559880 (-0.39z)| norm 0.2654 (+0.20z)| lr 7.12e-04 | 4165.89 ms | 32.4% bf16 MFU | 123336 tok/s step 5008/19560 | loss 3.599133 (+0.53z)| norm 0.2607 (-0.05z)| lr 7.11e-04 | 4629.47 ms | 29.2% bf16 MFU | 122832 tok/s step 5009/19560 | loss 3.565617 (-0.26z)| norm 0.2627 (+0.05z)| lr 7.11e-04 | 4167.04 ms | 32.4% bf16 MFU | 122981 tok/s step 5010/19560 | loss 3.501582 (-1.75z)| norm 0.2720 (+0.54z)| lr 7.11e-04 | 4396.28 ms | 30.7% bf16 MFU | 122795 tok/s step 5011/19560 | loss 3.579702 (+0.09z)| norm 0.2481 (-0.70z)| lr 7.11e-04 | 4380.33 ms | 30.8% bf16 MFU | 122640 tok/s step 5012/19560 | loss 3.570136 (-0.14z)| norm 0.2407 (-1.07z)| lr 7.11e-04 | 4157.12 ms | 32.5% bf16 MFU | 122814 tok/s step 5013/19560 | loss 3.564162 (-0.28z)| norm 0.2421 (-1.01z)| lr 7.11e-04 | 4287.80 ms | 31.5% bf16 MFU | 122787 tok/s step 5014/19560 | loss 3.620401 (+1.03z)| norm 0.2608 (-0.03z)| lr 7.11e-04 | 4362.23 ms | 31.0% bf16 MFU | 122657 tok/s step 5015/19560 | loss 3.548593 (-0.66z)| norm 0.2361 (-1.34z)| lr 7.11e-04 | 4172.31 ms | 32.4% bf16 MFU | 122807 tok/s step 5016/19560 | loss 3.552080 (-0.56z)| norm 0.2621 (+0.02z)| lr 7.11e-04 | 4177.45 ms | 32.3% bf16 MFU | 122942 tok/s step 5017/19560 | loss 3.563809 (-0.29z)| norm 0.2783 (+0.86z)| lr 7.11e-04 | 4158.76 ms | 32.5% bf16 MFU | 123098 tok/s step 5018/19560 | loss 3.592167 (+0.38z)| norm 0.2691 (+0.36z)| lr 7.11e-04 | 4312.81 ms | 31.3% bf16 MFU | 123022 tok/s step 5019/19560 | loss 3.585313 (+0.21z)| norm 0.2978 (+1.86z)| lr 7.11e-04 | 4231.55 ms | 31.9% bf16 MFU | 123066 tok/s step 5020/19560 | loss 3.568797 (-0.17z)| norm 0.2565 (-0.33z)| lr 7.11e-04 | 4175.13 ms | 32.3% bf16 MFU | 123191 tok/s step 5021/19560 | loss 3.476868 (-2.28z)| norm 0.2532 (-0.50z)| lr 7.11e-04 | 4342.28 ms | 31.1% bf16 MFU | 123068 tok/s step 5022/19560 | loss 3.541941 (-0.76z)| norm 0.2548 (-0.40z)| lr 7.11e-04 | 4174.83 ms | 32.3% bf16 MFU | 123194 tok/s step 5023/19560 | loss 3.564184 (-0.24z)| norm 0.2790 (+0.92z)| lr 7.11e-04 | 4656.96 ms | 29.0% bf16 MFU | 122664 tok/s step 5024/19560 | loss 3.536073 (-0.89z)| norm 0.3031 (+2.21z)| lr 7.11e-04 | 4178.06 ms | 32.3% bf16 MFU | 122805 tok/s step 5025/19560 | loss 3.612261 (+0.85z)| norm 0.2875 (+1.35z)| lr 7.11e-04 | 4204.74 ms | 32.1% bf16 MFU | 122899 tok/s step 5026/19560 | loss 3.555917 (-0.44z)| norm 0.2492 (-0.70z)| lr 7.11e-04 | 4311.79 ms | 31.3% bf16 MFU | 122834 tok/s step 5027/19560 | loss 3.544953 (-0.69z)| norm 0.2351 (-1.44z)| lr 7.11e-04 | 4175.41 ms | 32.3% bf16 MFU | 122970 tok/s step 5028/19560 | loss 3.549670 (-0.57z)| norm 0.2576 (-0.21z)| lr 7.11e-04 | 4177.04 ms | 32.3% bf16 MFU | 123098 tok/s step 5029/19560 | loss 3.598310 (+0.57z)| norm 0.2579 (-0.19z)| lr 7.11e-04 | 4209.23 ms | 32.1% bf16 MFU | 123171 tok/s step 5030/19560 | loss 3.627916 (+1.25z)| norm 0.2581 (-0.16z)| lr 7.11e-04 | 4472.01 ms | 30.2% bf16 MFU | 122874 tok/s step 5031/19560 | loss 3.558422 (-0.37z)| norm 0.2527 (-0.47z)| lr 7.10e-04 | 4273.05 ms | 31.6% bf16 MFU | 122865 tok/s step 5032/19560 | loss 3.527321 (-1.08z)| norm 0.2659 (+0.29z)| lr 7.10e-04 | 4183.10 ms | 32.3% bf16 MFU | 122988 tok/s step 5033/19560 | loss 3.509570 (-1.47z)| norm 0.2186 (-2.33z)| lr 7.10e-04 | 4179.58 ms | 32.3% bf16 MFU | 123111 tok/s step 5034/19560 | loss 3.561541 (-0.27z)| norm 0.2851 (+1.36z)| lr 7.10e-04 | 4174.35 ms | 32.3% bf16 MFU | 123235 tok/s step 5035/19560 | loss 3.564216 (-0.21z)| norm 0.2546 (-0.34z)| lr 7.10e-04 | 4180.23 ms | 32.3% bf16 MFU | 123345 tok/s step 5036/19560 | loss 3.607960 (+0.79z)| norm 0.2472 (-0.76z)| lr 7.10e-04 | 4183.01 ms | 32.3% bf16 MFU | 123444 tok/s step 5037/19560 | loss 3.600864 (+0.62z)| norm 0.2627 (+0.11z)| lr 7.10e-04 | 4188.63 ms | 32.2% bf16 MFU | 123531 tok/s step 5038/19560 | loss 3.583909 (+0.26z)| norm 0.2710 (+0.57z)| lr 7.10e-04 | 4194.61 ms | 32.2% bf16 MFU | 123604 tok/s step 5039/19560 | loss 3.590487 (+0.42z)| norm 0.2737 (+0.71z)| lr 7.10e-04 | 4215.40 ms | 32.0% bf16 MFU | 123642 tok/s step 5040/19560 | loss 3.521535 (-1.23z)| norm 0.2755 (+0.81z)| lr 7.10e-04 | 4192.10 ms | 32.2% bf16 MFU | 123713 tok/s step 5041/19560 | loss 3.560291 (-0.29z)| norm 0.2650 (+0.23z)| lr 7.10e-04 | 4173.86 ms | 32.3% bf16 MFU | 123808 tok/s step 5042/19560 | loss 3.592974 (+0.49z)| norm 0.2501 (-0.59z)| lr 7.10e-04 | 4183.91 ms | 32.3% bf16 MFU | 123883 tok/s step 5043/19560 | loss 3.571457 (-0.03z)| norm 0.2598 (-0.06z)| lr 7.10e-04 | 4192.97 ms | 32.2% bf16 MFU | 123941 tok/s step 5044/19560 | loss 3.512445 (-1.43z)| norm 0.2505 (-0.58z)| lr 7.10e-04 | 4196.20 ms | 32.2% bf16 MFU | 123991 tok/s step 5045/19560 | loss 3.595677 (+0.57z)| norm 0.2472 (-0.77z)| lr 7.10e-04 | 4175.98 ms | 32.3% bf16 MFU | 124069 tok/s step 5046/19560 | loss 3.571800 (+0.02z)| norm 0.2418 (-1.07z)| lr 7.10e-04 | 4187.68 ms | 32.2% bf16 MFU | 124126 tok/s step 5047/19560 | loss 3.548515 (-0.55z)| norm 0.2542 (-0.39z)| lr 7.10e-04 | 4201.60 ms | 32.1% bf16 MFU | 124158 tok/s step 5048/19560 | loss 3.548420 (-0.54z)| norm 0.2388 (-1.25z)| lr 7.10e-04 | 4173.35 ms | 32.4% bf16 MFU | 124232 tok/s step 5049/19560 | loss 3.539259 (-0.76z)| norm 0.2456 (-0.85z)| lr 7.10e-04 | 4253.55 ms | 31.7% bf16 MFU | 124183 tok/s step 5050/19560 | loss 3.523732 (-1.13z)| norm 0.2670 (+0.36z)| lr 7.10e-04 | 4224.32 ms | 32.0% bf16 MFU | 124180 tok/s step 5051/19560 | loss 3.621341 (+1.24z)| norm 0.2718 (+0.63z)| lr 7.10e-04 | 4174.17 ms | 32.3% bf16 MFU | 124251 tok/s step 5052/19560 | loss 3.551771 (-0.46z)| norm 0.2392 (-1.22z)| lr 7.10e-04 | 4261.57 ms | 31.7% bf16 MFU | 124190 tok/s step 5053/19560 | loss 3.537141 (-0.80z)| norm 0.2373 (-1.32z)| lr 7.10e-04 | 4182.90 ms | 32.3% bf16 MFU | 124247 tok/s step 5054/19560 | loss 3.484773 (-2.04z)| norm 0.2552 (-0.30z)| lr 7.10e-04 | 4176.42 ms | 32.3% bf16 MFU | 124312 tok/s step 5055/19560 | loss 3.590167 (+0.48z)| norm 0.2323 (-1.59z)| lr 7.09e-04 | 4170.03 ms | 32.4% bf16 MFU | 124382 tok/s step 5056/19560 | loss 3.532568 (-0.89z)| norm 0.2377 (-1.29z)| lr 7.09e-04 | 4174.48 ms | 32.3% bf16 MFU | 124443 tok/s step 5057/19560 | loss 3.554219 (-0.37z)| norm 0.2284 (-1.80z)| lr 7.09e-04 | 4187.07 ms | 32.2% bf16 MFU | 124482 tok/s step 5058/19560 | loss 3.598666 (+0.68z)| norm 0.2491 (-0.63z)| lr 7.09e-04 | 4201.98 ms | 32.1% bf16 MFU | 124496 tok/s step 5059/19560 | loss 3.528743 (-0.97z)| norm 0.2614 (+0.06z)| lr 7.09e-04 | 4762.90 ms | 28.3% bf16 MFU | 123775 tok/s step 5060/19560 | loss 3.592657 (+0.59z)| norm 0.2385 (-1.21z)| lr 7.09e-04 | 4266.96 ms | 31.6% bf16 MFU | 123730 tok/s step 5061/19560 | loss 3.558647 (-0.24z)| norm 0.2685 (+0.47z)| lr 7.09e-04 | 4173.15 ms | 32.4% bf16 MFU | 123825 tok/s step 5062/19560 | loss 3.592550 (+0.59z)| norm 0.2842 (+1.33z)| lr 7.09e-04 | 4174.77 ms | 32.3% bf16 MFU | 123913 tok/s step 5063/19560 | loss 3.516746 (-1.27z)| norm 0.2487 (-0.64z)| lr 7.09e-04 | 4331.67 ms | 31.2% bf16 MFU | 123769 tok/s step 5064/19560 | loss 3.602389 (+0.82z)| norm 0.2893 (+1.60z)| lr 7.09e-04 | 4654.70 ms | 29.0% bf16 MFU | 123213 tok/s step 5065/19560 | loss 3.579857 (+0.30z)| norm 0.2653 (+0.28z)| lr 7.09e-04 | 4889.19 ms | 27.6% bf16 MFU | 122414 tok/s step 5066/19560 | loss 3.518688 (-1.25z)| norm 0.2581 (-0.11z)| lr 7.09e-04 | 4862.57 ms | 27.8% bf16 MFU | 121684 tok/s step 5067/19560 | loss 3.589894 (+0.57z)| norm 0.2596 (-0.03z)| lr 7.09e-04 | 4536.06 ms | 29.8% bf16 MFU | 121379 tok/s step 5068/19560 | loss 3.532154 (-0.89z)| norm 0.2566 (-0.18z)| lr 7.09e-04 | 4428.68 ms | 30.5% bf16 MFU | 121229 tok/s step 5069/19560 | loss 3.521862 (-1.13z)| norm 0.2417 (-1.00z)| lr 7.09e-04 | 4432.83 ms | 30.5% bf16 MFU | 121082 tok/s step 5070/19560 | loss 3.512202 (-1.36z)| norm 0.2483 (-0.64z)| lr 7.09e-04 | 4175.84 ms | 32.3% bf16 MFU | 121305 tok/s step 5071/19560 | loss 3.482908 (-2.06z)| norm 0.2610 (+0.07z)| lr 7.09e-04 | 4664.21 ms | 28.9% bf16 MFU | 120860 tok/s step 5072/19560 | loss 3.511198 (-1.33z)| norm 0.2943 (+1.90z)| lr 7.09e-04 | 4206.99 ms | 32.1% bf16 MFU | 121048 tok/s step 5073/19560 | loss 3.559352 (-0.12z)| norm 0.2707 (+0.59z)| lr 7.09e-04 | 4194.65 ms | 32.2% bf16 MFU | 121245 tok/s step 5074/19560 | loss 3.480003 (-2.06z)| norm 0.2827 (+1.23z)| lr 7.09e-04 | 4354.23 ms | 31.0% bf16 MFU | 121204 tok/s step 5075/19560 | loss 3.524020 (-0.97z)| norm 0.2688 (+0.46z)| lr 7.09e-04 | 4191.85 ms | 32.2% bf16 MFU | 121397 tok/s step 5076/19560 | loss 3.462973 (-2.40z)| norm 0.2817 (+1.15z)| lr 7.09e-04 | 4245.73 ms | 31.8% bf16 MFU | 121501 tok/s step 5077/19560 | loss 3.569058 (+0.15z)| norm 0.2785 (+0.96z)| lr 7.09e-04 | 4268.36 ms | 31.6% bf16 MFU | 121568 tok/s step 5078/19560 | loss 3.517951 (-1.08z)| norm 0.2635 (+0.12z)| lr 7.08e-04 | 4181.52 ms | 32.3% bf16 MFU | 121759 tok/s step 5079/19560 | loss 3.507076 (-1.32z)| norm 0.2676 (+0.34z)| lr 7.08e-04 | 4189.24 ms | 32.2% bf16 MFU | 121928 tok/s step 5080/19560 | loss 3.541228 (-0.50z)| norm 0.2603 (-0.08z)| lr 7.08e-04 | 4313.56 ms | 31.3% bf16 MFU | 121909 tok/s step 5081/19560 | loss 3.521036 (-0.97z)| norm 0.2532 (-0.49z)| lr 7.08e-04 | 4239.73 ms | 31.8% bf16 MFU | 121997 tok/s step 5082/19560 | loss 3.601748 (+0.98z)| norm 0.2431 (-1.04z)| lr 7.08e-04 | 4179.71 ms | 32.3% bf16 MFU | 122169 tok/s step 5083/19560 | loss 3.584798 (+0.56z)| norm 0.2317 (-1.66z)| lr 7.08e-04 | 4238.66 ms | 31.9% bf16 MFU | 122245 tok/s step 5084/19560 | loss 3.516819 (-1.08z)| norm 0.2600 (-0.04z)| lr 7.08e-04 | 4339.25 ms | 31.1% bf16 MFU | 122174 tok/s step 5085/19560 | loss 3.681990 (+2.79z)| norm 0.9652 (+10.85z)| lr 7.08e-04 | 4329.49 ms | 31.2% bf16 MFU | 122120 tok/s step 5086/19560 | loss 3.567994 (+0.12z)| norm 0.3464 (+1.23z)| lr 7.08e-04 | 4214.38 ms | 32.0% bf16 MFU | 122234 tok/s step 5087/19560 | loss 3.501936 (-1.41z)| norm 0.2937 (+0.42z)| lr 7.08e-04 | 4354.87 ms | 31.0% bf16 MFU | 122142 tok/s step 5088/19560 | loss 3.531621 (-0.70z)| norm 0.2919 (+0.39z)| lr 7.08e-04 | 4209.70 ms | 32.1% bf16 MFU | 122262 tok/s step 5089/19560 | loss 3.534663 (-0.62z)| norm 0.2507 (-0.25z)| lr 7.08e-04 | 4196.87 ms | 32.2% bf16 MFU | 122395 tok/s step 5090/19560 | loss 3.580835 (+0.48z)| norm 0.2809 (+0.22z)| lr 7.08e-04 | 4193.95 ms | 32.2% bf16 MFU | 122526 tok/s step 5091/19560 | loss 3.553251 (-0.17z)| norm 0.2782 (+0.17z)| lr 7.08e-04 | 4208.29 ms | 32.1% bf16 MFU | 122629 tok/s step 5092/19560 | loss 3.530198 (-0.71z)| norm 0.2631 (-0.06z)| lr 7.08e-04 | 4201.19 ms | 32.1% bf16 MFU | 122737 tok/s step 5093/19560 | loss 3.526860 (-0.78z)| norm 0.2641 (-0.05z)| lr 7.08e-04 | 4269.14 ms | 31.6% bf16 MFU | 122741 tok/s step 5094/19560 | loss 3.494134 (-1.53z)| norm 0.2351 (-0.50z)| lr 7.08e-04 | 4177.84 ms | 32.3% bf16 MFU | 122878 tok/s step 5095/19560 | loss 3.522259 (-0.86z)| norm 0.2272 (-0.62z)| lr 7.08e-04 | 4590.77 ms | 29.4% bf16 MFU | 122445 tok/s step 5096/19560 | loss 3.581986 (+0.57z)| norm 0.2328 (-0.53z)| lr 7.08e-04 | 4199.53 ms | 32.2% bf16 MFU | 122565 tok/s step 5097/19560 | loss 3.512787 (-1.13z)| norm 0.2346 (-0.50z)| lr 7.08e-04 | 4202.08 ms | 32.1% bf16 MFU | 122675 tok/s step 5098/19560 | loss 3.605626 (+1.26z)| norm 0.3060 (+0.60z)| lr 7.08e-04 | 4194.62 ms | 32.2% bf16 MFU | 122791 tok/s step 5099/19560 | loss 3.467207 (-2.23z)| norm 0.2834 (+0.25z)| lr 7.08e-04 | 4194.30 ms | 32.2% bf16 MFU | 122901 tok/s step 5100/19560 | loss 3.592475 (+0.94z)| norm 0.2653 (-0.03z)| lr 7.08e-04 | 4204.97 ms | 32.1% bf16 MFU | 122990 tok/s step 5101/19560 | loss 3.508573 (-1.20z)| norm 0.2557 (-0.18z)| lr 7.07e-04 | 4177.78 ms | 32.3% bf16 MFU | 123115 tok/s step 5102/19560 | loss 3.567123 (+0.29z)| norm 0.2728 (+0.09z)| lr 7.07e-04 | 4203.10 ms | 32.1% bf16 MFU | 123197 tok/s step 5103/19560 | loss 3.503190 (-1.32z)| norm 0.2551 (-0.18z)| lr 7.07e-04 | 4179.46 ms | 32.3% bf16 MFU | 123309 tok/s step 5104/19560 | loss 3.532958 (-0.56z)| norm 0.2772 (+0.16z)| lr 7.07e-04 | 4204.87 ms | 32.1% bf16 MFU | 123378 tok/s step 5105/19560 | loss 3.520237 (-0.87z)| norm 0.2628 (-0.06z)| lr 7.07e-04 | 4199.56 ms | 32.2% bf16 MFU | 123451 tok/s step 5106/19560 | loss 3.480809 (-1.83z)| norm 0.2380 (-0.44z)| lr 7.07e-04 | 4200.27 ms | 32.1% bf16 MFU | 123520 tok/s step 5107/19560 | loss 3.533767 (-0.50z)| norm 0.2709 (+0.07z)| lr 7.07e-04 | 4250.96 ms | 31.8% bf16 MFU | 123510 tok/s step 5108/19560 | loss 3.495259 (-1.47z)| norm 0.2470 (-0.30z)| lr 7.07e-04 | 4190.16 ms | 32.2% bf16 MFU | 123591 tok/s step 5109/19560 | loss 3.528095 (-0.65z)| norm 0.2451 (-0.33z)| lr 7.07e-04 | 4192.08 ms | 32.2% bf16 MFU | 123665 tok/s step 5110/19560 | loss 3.551428 (-0.07z)| norm 0.2460 (-0.32z)| lr 7.07e-04 | 4223.85 ms | 32.0% bf16 MFU | 123688 tok/s step 5111/19560 | loss 3.506110 (-1.20z)| norm 0.2582 (-0.13z)| lr 7.07e-04 | 4184.29 ms | 32.3% bf16 MFU | 123768 tok/s step 5112/19560 | loss 3.564721 (+0.27z)| norm 0.2569 (-0.15z)| lr 7.07e-04 | 4274.62 ms | 31.6% bf16 MFU | 123712 tok/s step 5113/19560 | loss 3.552943 (-0.01z)| norm 0.2694 (+0.04z)| lr 7.07e-04 | 4191.95 ms | 32.2% bf16 MFU | 123780 tok/s step 5114/19560 | loss 3.570839 (+0.43z)| norm 0.2349 (-0.48z)| lr 7.07e-04 | 4181.81 ms | 32.3% bf16 MFU | 123860 tok/s step 5115/19560 | loss 3.519505 (-0.87z)| norm 0.2703 (+0.07z)| lr 7.07e-04 | 4189.68 ms | 32.2% bf16 MFU | 123924 tok/s step 5116/19560 | loss 3.590737 (+0.94z)| norm 0.3529 (+1.32z)| lr 7.07e-04 | 4186.06 ms | 32.3% bf16 MFU | 123990 tok/s step 5117/19560 | loss 3.527121 (-0.67z)| norm 0.2543 (-0.19z)| lr 7.07e-04 | 4203.28 ms | 32.1% bf16 MFU | 124027 tok/s step 5118/19560 | loss 3.560671 (+0.18z)| norm 0.2631 (-0.05z)| lr 7.07e-04 | 4173.27 ms | 32.4% bf16 MFU | 124107 tok/s step 5119/19560 | loss 3.509672 (-1.10z)| norm 0.2236 (-0.66z)| lr 7.07e-04 | 4204.54 ms | 32.1% bf16 MFU | 124137 tok/s step 5120/19560 | loss 3.545302 (-0.19z)| norm 0.2622 (-0.07z)| lr 7.07e-04 | 4184.71 ms | 32.3% bf16 MFU | 124194 tok/s step 5121/19560 | loss 3.588034 (+0.90z)| norm 0.2468 (-0.30z)| lr 7.07e-04 | 4243.35 ms | 31.8% bf16 MFU | 124162 tok/s step 5122/19560 | loss 3.556621 (+0.10z)| norm 0.3093 (+0.65z)| lr 7.07e-04 | 4224.11 ms | 32.0% bf16 MFU | 124160 tok/s step 5123/19560 | loss 3.555835 (+0.08z)| norm 0.2478 (-0.29z)| lr 7.07e-04 | 4180.43 ms | 32.3% bf16 MFU | 124223 tok/s step 5124/19560 | loss 3.501267 (-1.29z)| norm 0.2671 (+0.01z)| lr 7.06e-04 | 4194.97 ms | 32.2% bf16 MFU | 124261 tok/s step 5125/19560 | loss 3.538364 (-0.34z)| norm 0.2848 (+0.27z)| lr 7.06e-04 | 4494.20 ms | 30.0% bf16 MFU | 123881 tok/s step 5126/19560 | loss 3.520533 (-0.79z)| norm 0.2439 (-0.35z)| lr 7.06e-04 | 4182.44 ms | 32.3% bf16 MFU | 123954 tok/s step 5127/19560 | loss 3.599974 (+1.29z)| norm 0.2775 (+0.16z)| lr 7.06e-04 | 4335.50 ms | 31.1% bf16 MFU | 123803 tok/s step 5128/19560 | loss 3.556722 (+0.14z)| norm 0.2475 (-0.29z)| lr 7.06e-04 | 4203.02 ms | 32.1% bf16 MFU | 123850 tok/s step 5129/19560 | loss 3.555741 (+0.12z)| norm 0.2807 (+0.22z)| lr 7.06e-04 | 4195.83 ms | 32.2% bf16 MFU | 123905 tok/s step 5130/19560 | loss 3.523494 (-0.72z)| norm 0.2962 (+0.45z)| lr 7.06e-04 | 4191.14 ms | 32.2% bf16 MFU | 123965 tok/s step 5131/19560 | loss 3.504658 (-1.20z)| norm 0.2539 (-0.20z)| lr 7.06e-04 | 4191.14 ms | 32.2% bf16 MFU | 124021 tok/s step 5132/19560 | loss 3.557933 (+0.21z)| norm 0.2907 (+0.36z)| lr 7.06e-04 | 4189.14 ms | 32.2% bf16 MFU | 124078 tok/s step 5133/19560 | loss 3.506627 (-1.13z)| norm 0.2726 (+0.08z)| lr 7.06e-04 | 4207.79 ms | 32.1% bf16 MFU | 124104 tok/s step 5134/19560 | loss 3.582299 (+0.86z)| norm 0.2693 (+0.03z)| lr 7.06e-04 | 4183.49 ms | 32.3% bf16 MFU | 124165 tok/s step 5135/19560 | loss 3.555562 (+0.15z)| norm 0.2835 (+0.24z)| lr 7.06e-04 | 4188.81 ms | 32.2% bf16 MFU | 124215 tok/s step 5136/19560 | loss 3.554839 (+0.14z)| norm 0.3606 (+1.40z)| lr 7.06e-04 | 4199.50 ms | 32.2% bf16 MFU | 124246 tok/s step 5137/19560 | loss 3.603471 (+1.42z)| norm 0.2760 (+0.11z)| lr 7.06e-04 | 4189.95 ms | 32.2% bf16 MFU | 124290 tok/s step 5138/19560 | loss 3.547676 (-0.06z)| norm 0.2964 (+0.42z)| lr 7.06e-04 | 4199.22 ms | 32.2% bf16 MFU | 124319 tok/s step 5139/19560 | loss 3.531202 (-0.49z)| norm 0.2735 (+0.07z)| lr 7.06e-04 | 4205.99 ms | 32.1% bf16 MFU | 124335 tok/s step 5140/19560 | loss 3.477309 (-1.88z)| norm 0.2398 (-0.44z)| lr 7.06e-04 | 4198.07 ms | 32.2% bf16 MFU | 124363 tok/s step 5141/19560 | loss 3.546746 (-0.05z)| norm 0.2777 (+0.13z)| lr 7.06e-04 | 4178.78 ms | 32.3% bf16 MFU | 124418 tok/s step 5142/19560 | loss 3.550062 (+0.05z)| norm 0.2490 (-0.30z)| lr 7.06e-04 | 4194.35 ms | 32.2% bf16 MFU | 124447 tok/s step 5143/19560 | loss 3.557216 (+0.24z)| norm 0.2749 (+0.08z)| lr 7.06e-04 | 4182.78 ms | 32.3% bf16 MFU | 124492 tok/s step 5144/19560 | loss 3.544874 (-0.09z)| norm 0.2579 (-0.17z)| lr 7.06e-04 | 4195.24 ms | 32.2% bf16 MFU | 124516 tok/s step 5145/19560 | loss 3.506653 (-1.09z)| norm 0.2435 (-0.39z)| lr 7.06e-04 | 4290.46 ms | 31.5% bf16 MFU | 124400 tok/s step 5146/19560 | loss 3.530667 (-0.44z)| norm 0.2547 (-0.21z)| lr 7.06e-04 | 4193.77 ms | 32.2% bf16 MFU | 124431 tok/s step 5147/19560 | loss 3.589019 (+1.11z)| norm 0.2474 (-0.32z)| lr 7.05e-04 | 4194.70 ms | 32.2% bf16 MFU | 124459 tok/s step 5148/19560 | loss 3.478744 (-1.79z)| norm 0.2625 (-0.09z)| lr 7.05e-04 | 4199.34 ms | 32.2% bf16 MFU | 124478 tok/s step 5149/19560 | loss 3.568307 (+0.56z)| norm 0.2516 (-0.26z)| lr 7.05e-04 | 4194.73 ms | 32.2% bf16 MFU | 124504 tok/s step 5150/19560 | loss 3.520580 (-0.71z)| norm 0.2612 (-0.11z)| lr 7.05e-04 | 4193.64 ms | 32.2% bf16 MFU | 124529 tok/s step 5151/19560 | loss 3.472374 (-1.94z)| norm 0.2488 (-0.30z)| lr 7.05e-04 | 4178.62 ms | 32.3% bf16 MFU | 124576 tok/s step 5152/19560 | loss 3.601615 (+1.42z)| norm 0.2383 (-0.45z)| lr 7.05e-04 | 4186.64 ms | 32.2% bf16 MFU | 124609 tok/s step 5153/19560 | loss 3.532782 (-0.36z)| norm 0.2544 (-0.20z)| lr 7.05e-04 | 4193.40 ms | 32.2% bf16 MFU | 124630 tok/s step 5154/19560 | loss 3.488284 (-1.50z)| norm 0.2498 (-0.27z)| lr 7.05e-04 | 4185.95 ms | 32.3% bf16 MFU | 124661 tok/s step 5155/19560 | loss 3.500703 (-1.16z)| norm 0.2289 (-0.59z)| lr 7.05e-04 | 4220.27 ms | 32.0% bf16 MFU | 124639 tok/s step 5156/19560 | loss 3.549605 (+0.11z)| norm 0.2314 (-0.54z)| lr 7.05e-04 | 4187.39 ms | 32.2% bf16 MFU | 124668 tok/s step 5157/19560 | loss 3.544981 (-0.00z)| norm 0.2407 (-0.40z)| lr 7.05e-04 | 4177.77 ms | 32.3% bf16 MFU | 124709 tok/s step 5158/19560 | loss 3.595257 (+1.33z)| norm 0.2697 (+0.04z)| lr 7.05e-04 | 4184.27 ms | 32.3% bf16 MFU | 124739 tok/s step 5159/19560 | loss 3.615733 (+1.84z)| norm 0.2785 (+0.17z)| lr 7.05e-04 | 4181.61 ms | 32.3% bf16 MFU | 124771 tok/s step 5160/19560 | loss 3.484066 (-1.57z)| norm 0.2876 (+0.30z)| lr 7.05e-04 | 4270.87 ms | 31.6% bf16 MFU | 124670 tok/s step 5161/19560 | loss 3.588006 (+1.10z)| norm 0.3058 (+0.57z)| lr 7.05e-04 | 4575.37 ms | 29.5% bf16 MFU | 124166 tok/s step 5162/19560 | loss 3.470236 (-1.90z)| norm 0.3022 (+0.51z)| lr 7.05e-04 | 4190.40 ms | 32.2% bf16 MFU | 124214 tok/s step 5163/19560 | loss 3.550786 (+0.15z)| norm 0.2532 (-0.23z)| lr 7.05e-04 | 4168.87 ms | 32.4% bf16 MFU | 124291 tok/s step 5164/19560 | loss 3.472672 (-1.81z)| norm 0.2636 (-0.08z)| lr 7.05e-04 | 4182.36 ms | 32.3% bf16 MFU | 124344 tok/s step 5165/19560 | loss 3.540037 (-0.08z)| norm 0.2702 (+0.02z)| lr 7.05e-04 | 4230.09 ms | 31.9% bf16 MFU | 124324 tok/s step 5166/19560 | loss 3.614010 (+1.79z)| norm 0.2738 (+0.08z)| lr 7.05e-04 | 4192.42 ms | 32.2% bf16 MFU | 124361 tok/s step 5167/19560 | loss 3.518361 (-0.63z)| norm 0.2793 (+0.16z)| lr 7.05e-04 | 4169.07 ms | 32.4% bf16 MFU | 124431 tok/s step 5168/19560 | loss 3.502428 (-1.03z)| norm 0.2474 (-0.32z)| lr 7.05e-04 | 4200.80 ms | 32.1% bf16 MFU | 124449 tok/s step 5169/19560 | loss 3.552345 (+0.25z)| norm 0.2578 (-0.16z)| lr 7.05e-04 | 4270.90 ms | 31.6% bf16 MFU | 124365 tok/s step 5170/19560 | loss 3.534780 (-0.19z)| norm 0.2535 (-0.23z)| lr 7.04e-04 | 4177.93 ms | 32.3% bf16 MFU | 124421 tok/s step 5171/19560 | loss 3.599645 (+1.46z)| norm 0.2709 (+0.04z)| lr 7.04e-04 | 4182.79 ms | 32.3% bf16 MFU | 124467 tok/s step 5172/19560 | loss 3.535465 (-0.18z)| norm 0.2818 (+0.20z)| lr 7.04e-04 | 4180.90 ms | 32.3% bf16 MFU | 124514 tok/s step 5173/19560 | loss 3.553676 (+0.29z)| norm 0.2459 (-0.35z)| lr 7.04e-04 | 4256.83 ms | 31.7% bf16 MFU | 124446 tok/s step 5174/19560 | loss 3.504803 (-0.95z)| norm 0.2308 (-0.57z)| lr 7.04e-04 | 4195.87 ms | 32.2% bf16 MFU | 124472 tok/s step 5175/19560 | loss 3.491227 (-1.28z)| norm 0.2671 (-0.03z)| lr 7.04e-04 | 4186.10 ms | 32.3% bf16 MFU | 124510 tok/s step 5176/19560 | loss 3.501197 (-1.01z)| norm 0.2698 (+0.01z)| lr 7.04e-04 | 4180.05 ms | 32.3% bf16 MFU | 124556 tok/s step 5177/19560 | loss 3.522861 (-0.46z)| norm 0.2789 (+0.15z)| lr 7.04e-04 | 4189.44 ms | 32.2% bf16 MFU | 124586 tok/s step 5178/19560 | loss 3.517329 (-0.59z)| norm 0.2461 (-0.35z)| lr 7.04e-04 | 4223.53 ms | 32.0% bf16 MFU | 124563 tok/s step 5179/19560 | loss 3.636101 (+2.40z)| norm 0.2637 (-0.08z)| lr 7.04e-04 | 4174.93 ms | 32.3% bf16 MFU | 124614 tok/s step 5180/19560 | loss 3.519554 (-0.53z)| norm 0.2660 (-0.05z)| lr 7.04e-04 | 4179.75 ms | 32.3% bf16 MFU | 124655 tok/s step 5181/19560 | loss 3.506800 (-0.84z)| norm 0.2547 (-0.22z)| lr 7.04e-04 | 4186.82 ms | 32.2% bf16 MFU | 124683 tok/s step 5182/19560 | loss 3.589630 (+1.22z)| norm 0.2795 (+0.15z)| lr 7.04e-04 | 4196.19 ms | 32.2% bf16 MFU | 124696 tok/s step 5183/19560 | loss 3.525981 (-0.37z)| norm 0.2771 (+0.11z)| lr 7.04e-04 | 4281.12 ms | 31.5% bf16 MFU | 124585 tok/s step 5184/19560 | loss 3.566023 (+0.63z)| norm 0.2641 (-0.09z)| lr 7.04e-04 | 4180.58 ms | 32.3% bf16 MFU | 124626 tok/s step 5185/19560 | loss 3.551554 (+0.27z)| norm 0.2704 (-0.00z)| lr 7.04e-04 | 4268.01 ms | 31.6% bf16 MFU | 124537 tok/s step 5186/19560 | loss 3.550944 (+0.26z)| norm 0.2503 (-0.31z)| lr 7.04e-04 | 4177.22 ms | 32.3% bf16 MFU | 124586 tok/s step 5187/19560 | loss 3.704070 (+3.88z)| norm 0.2988 (+0.43z)| lr 7.04e-04 | 4184.23 ms | 32.3% bf16 MFU | 124621 tok/s step 5188/19560 | loss 3.526551 (-0.36z)| norm 0.2654 (-0.09z)| lr 7.04e-04 | 4171.25 ms | 32.4% bf16 MFU | 124675 tok/s step 5189/19560 | loss 3.540325 (-0.02z)| norm 0.2454 (-0.39z)| lr 7.04e-04 | 4191.98 ms | 32.2% bf16 MFU | 124695 tok/s step 5190/19560 | loss 3.523180 (-0.43z)| norm 0.2383 (-0.49z)| lr 7.04e-04 | 4189.68 ms | 32.2% bf16 MFU | 124717 tok/s step 5191/19560 | loss 3.461251 (-1.89z)| norm 0.2723 (+0.03z)| lr 7.04e-04 | 4302.65 ms | 31.4% bf16 MFU | 124574 tok/s step 5192/19560 | loss 3.509156 (-0.73z)| norm 0.2556 (-0.22z)| lr 7.04e-04 | 4186.15 ms | 32.3% bf16 MFU | 124607 tok/s step 5193/19560 | loss 3.550132 (+0.26z)| norm 0.2621 (-0.13z)| lr 7.03e-04 | 4182.59 ms | 32.3% bf16 MFU | 124644 tok/s step 5194/19560 | loss 3.554031 (+0.35z)| norm 0.2457 (-0.37z)| lr 7.03e-04 | 4396.77 ms | 30.7% bf16 MFU | 124374 tok/s step 5195/19560 | loss 3.545343 (+0.15z)| norm 0.2343 (-0.54z)| lr 7.03e-04 | 4183.53 ms | 32.3% bf16 MFU | 124422 tok/s step 5196/19560 | loss 3.548657 (+0.22z)| norm 0.2527 (-0.26z)| lr 7.03e-04 | 4277.72 ms | 31.6% bf16 MFU | 124329 tok/s step 5197/19560 | loss 3.574764 (+0.85z)| norm 0.2706 (+0.00z)| lr 7.03e-04 | 4208.78 ms | 32.1% bf16 MFU | 124341 tok/s step 5198/19560 | loss 3.486380 (-1.28z)| norm 0.2695 (-0.01z)| lr 7.03e-04 | 4180.52 ms | 32.3% bf16 MFU | 124394 tok/s step 5199/19560 | loss 3.488379 (-1.24z)| norm 0.2622 (-0.13z)| lr 7.03e-04 | 4179.77 ms | 32.3% bf16 MFU | 124446 tok/s step 5200/19560 | loss 3.556748 (+0.40z)| norm 0.2580 (-0.18z)| lr 7.03e-04 | 4249.56 ms | 31.8% bf16 MFU | 124393 tok/s step 5201/19560 | loss 3.665463 (+2.91z)| norm 0.2631 (-0.11z)| lr 7.03e-04 | 4199.23 ms | 32.2% bf16 MFU | 124416 tok/s step 5202/19560 | loss 3.497726 (-1.02z)| norm 0.2754 (+0.08z)| lr 7.03e-04 | 4187.88 ms | 32.2% bf16 MFU | 124455 tok/s step 5203/19560 | loss 3.504268 (-0.86z)| norm 0.2719 (+0.03z)| lr 7.03e-04 | 4200.18 ms | 32.1% bf16 MFU | 124473 tok/s step 5204/19560 | loss 3.546666 (+0.12z)| norm 0.2435 (-0.40z)| lr 7.03e-04 | 4182.12 ms | 32.3% bf16 MFU | 124518 tok/s step 5205/19560 | loss 3.552253 (+0.26z)| norm 0.2416 (-0.42z)| lr 7.03e-04 | 4195.46 ms | 32.2% bf16 MFU | 124540 tok/s step 5206/19560 | loss 3.570106 (+0.67z)| norm 0.2419 (-0.42z)| lr 7.03e-04 | 4191.87 ms | 32.2% bf16 MFU | 124567 tok/s step 5207/19560 | loss 3.571157 (+0.69z)| norm 0.2289 (-0.61z)| lr 7.03e-04 | 4173.75 ms | 32.3% bf16 MFU | 124619 tok/s step 5208/19560 | loss 3.491145 (-1.20z)| norm 0.2576 (-0.17z)| lr 7.03e-04 | 4204.56 ms | 32.1% bf16 MFU | 124623 tok/s step 5209/19560 | loss 3.575892 (+0.79z)| norm 0.2984 (+0.44z)| lr 7.03e-04 | 4195.20 ms | 32.2% bf16 MFU | 124640 tok/s step 5210/19560 | loss 3.520279 (-0.51z)| norm 0.2744 (+0.07z)| lr 7.03e-04 | 4185.25 ms | 32.3% bf16 MFU | 124672 tok/s step 5211/19560 | loss 3.482843 (-1.37z)| norm 0.2430 (-0.40z)| lr 7.03e-04 | 4263.15 ms | 31.7% bf16 MFU | 124587 tok/s step 5212/19560 | loss 3.679821 (+3.13z)| norm 0.2945 (+0.37z)| lr 7.03e-04 | 4296.31 ms | 31.4% bf16 MFU | 124460 tok/s step 5213/19560 | loss 3.529937 (-0.26z)| norm 0.2737 (+0.40z)| lr 7.03e-04 | 4176.10 ms | 32.3% bf16 MFU | 124514 tok/s step 5214/19560 | loss 3.555285 (+0.34z)| norm 0.3416 (+3.35z)| lr 7.03e-04 | 4176.76 ms | 32.3% bf16 MFU | 124564 tok/s step 5215/19560 | loss 3.544360 (+0.08z)| norm 0.2605 (-0.16z)| lr 7.02e-04 | 4189.64 ms | 32.2% bf16 MFU | 124593 tok/s step 5216/19560 | loss 3.517640 (-0.56z)| norm 0.2574 (-0.29z)| lr 7.02e-04 | 4180.76 ms | 32.3% bf16 MFU | 124634 tok/s step 5217/19560 | loss 3.586859 (+1.08z)| norm 0.2398 (-1.06z)| lr 7.02e-04 | 4199.07 ms | 32.2% bf16 MFU | 124645 tok/s step 5218/19560 | loss 3.495221 (-1.08z)| norm 0.3014 (+1.62z)| lr 7.02e-04 | 4181.37 ms | 32.3% bf16 MFU | 124682 tok/s step 5219/19560 | loss 3.602663 (+1.44z)| norm 0.2663 (+0.10z)| lr 7.02e-04 | 4194.58 ms | 32.2% bf16 MFU | 124698 tok/s step 5220/19560 | loss 3.547005 (+0.13z)| norm 0.2759 (+0.52z)| lr 7.02e-04 | 4180.55 ms | 32.3% bf16 MFU | 124733 tok/s step 5221/19560 | loss 3.471038 (-1.62z)| norm 0.2632 (-0.04z)| lr 7.02e-04 | 4195.55 ms | 32.2% bf16 MFU | 124745 tok/s step 5222/19560 | loss 3.557036 (+0.37z)| norm 0.2597 (-0.20z)| lr 7.02e-04 | 4180.33 ms | 32.3% bf16 MFU | 124778 tok/s step 5223/19560 | loss 3.550431 (+0.21z)| norm 0.2557 (-0.39z)| lr 7.02e-04 | 4183.34 ms | 32.3% bf16 MFU | 124806 tok/s step 5224/19560 | loss 3.534295 (-0.16z)| norm 0.2612 (-0.15z)| lr 7.02e-04 | 4232.96 ms | 31.9% bf16 MFU | 124758 tok/s step 5225/19560 | loss 3.514409 (-0.63z)| norm 0.2521 (-0.57z)| lr 7.02e-04 | 4181.81 ms | 32.3% bf16 MFU | 124789 tok/s step 5226/19560 | loss 3.534298 (-0.15z)| norm 0.2624 (-0.09z)| lr 7.02e-04 | 4180.33 ms | 32.3% bf16 MFU | 124821 tok/s step 5227/19560 | loss 3.484602 (-1.34z)| norm 0.2454 (-0.85z)| lr 7.02e-04 | 4186.70 ms | 32.2% bf16 MFU | 124841 tok/s step 5228/19560 | loss 3.533113 (-0.17z)| norm 0.3184 (+2.38z)| lr 7.02e-04 | 4206.73 ms | 32.1% bf16 MFU | 124830 tok/s step 5229/19560 | loss 3.536532 (-0.10z)| norm 0.2550 (-0.42z)| lr 7.02e-04 | 4181.81 ms | 32.3% bf16 MFU | 124858 tok/s step 5230/19560 | loss 3.596157 (+1.32z)| norm 0.2763 (+0.52z)| lr 7.02e-04 | 4199.67 ms | 32.1% bf16 MFU | 124857 tok/s step 5231/19560 | loss 3.558837 (+0.42z)| norm 0.2782 (+0.59z)| lr 7.02e-04 | 4259.32 ms | 31.7% bf16 MFU | 124768 tok/s step 5232/19560 | loss 3.481936 (-1.40z)| norm 0.3227 (+2.49z)| lr 7.02e-04 | 4174.49 ms | 32.3% bf16 MFU | 124810 tok/s step 5233/19560 | loss 3.510501 (-0.72z)| norm 0.3254 (+2.52z)| lr 7.02e-04 | 4484.78 ms | 30.1% bf16 MFU | 124414 tok/s step 5234/19560 | loss 3.563790 (+0.53z)| norm 0.2970 (+1.30z)| lr 7.02e-04 | 4185.72 ms | 32.3% bf16 MFU | 124457 tok/s step 5235/19560 | loss 3.503371 (-0.90z)| norm 0.2513 (-0.61z)| lr 7.02e-04 | 4168.56 ms | 32.4% bf16 MFU | 124522 tok/s step 5236/19560 | loss 3.554349 (+0.30z)| norm 0.2732 (+0.30z)| lr 7.02e-04 | 4191.27 ms | 32.2% bf16 MFU | 124551 tok/s step 5237/19560 | loss 3.456747 (-1.98z)| norm 0.2608 (-0.23z)| lr 7.02e-04 | 4199.87 ms | 32.1% bf16 MFU | 124565 tok/s step 5238/19560 | loss 3.530731 (-0.24z)| norm 0.2381 (-1.18z)| lr 7.01e-04 | 4176.48 ms | 32.3% bf16 MFU | 124613 tok/s step 5239/19560 | loss 3.506846 (-0.80z)| norm 0.2537 (-0.52z)| lr 7.01e-04 | 4190.11 ms | 32.2% bf16 MFU | 124639 tok/s step 5240/19560 | loss 3.506785 (-0.79z)| norm 0.2786 (+0.52z)| lr 7.01e-04 | 4175.26 ms | 32.3% bf16 MFU | 124685 tok/s step 5241/19560 | loss 3.552791 (+0.29z)| norm 0.2596 (-0.28z)| lr 7.01e-04 | 4234.41 ms | 31.9% bf16 MFU | 124642 tok/s step 5242/19560 | loss 3.495846 (-1.03z)| norm 0.2791 (+0.53z)| lr 7.01e-04 | 4170.11 ms | 32.4% bf16 MFU | 124696 tok/s step 5243/19560 | loss 3.520910 (-0.45z)| norm 0.2772 (+0.44z)| lr 7.01e-04 | 4213.67 ms | 32.0% bf16 MFU | 124683 tok/s step 5244/19560 | loss 3.517258 (-0.52z)| norm 0.2495 (-0.73z)| lr 7.01e-04 | 4204.43 ms | 32.1% bf16 MFU | 124683 tok/s step 5245/19560 | loss 3.511322 (-0.66z)| norm 0.3022 (+1.59z)| lr 7.01e-04 | 4179.94 ms | 32.3% bf16 MFU | 124721 tok/s step 5246/19560 | loss 3.529147 (-0.23z)| norm 0.3085 (+1.83z)| lr 7.01e-04 | 4185.21 ms | 32.3% bf16 MFU | 124748 tok/s step 5247/19560 | loss 3.498320 (-0.95z)| norm 0.2492 (-0.78z)| lr 7.01e-04 | 4175.11 ms | 32.3% bf16 MFU | 124790 tok/s step 5248/19560 | loss 3.497958 (-0.95z)| norm 0.2824 (+0.68z)| lr 7.01e-04 | 4188.73 ms | 32.2% bf16 MFU | 124808 tok/s step 5249/19560 | loss 3.524516 (-0.32z)| norm 0.2715 (+0.19z)| lr 7.01e-04 | 4195.11 ms | 32.2% bf16 MFU | 124817 tok/s step 5250/19560 | loss 3.541123 (+0.08z)| norm 0.2435 (-1.03z)| lr 7.01e-04 | 4183.41 ms | 32.3% bf16 MFU | 124842 tok/s val loss 3.535002 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2785/10042 = 0.277335 step 5251/19560 | loss 3.451924 (-1.98z)| norm 0.2597 (-0.31z)| lr 7.01e-04 | 4246.63 ms | 31.8% bf16 MFU | 124773 tok/s step 5252/19560 | loss 3.561245 (+0.55z)| norm 0.2623 (-0.19z)| lr 7.01e-04 | 4183.68 ms | 32.3% bf16 MFU | 124800 tok/s step 5253/19560 | loss 3.524494 (-0.30z)| norm 0.2695 (+0.13z)| lr 7.01e-04 | 4236.25 ms | 31.9% bf16 MFU | 124748 tok/s step 5254/19560 | loss 3.544321 (+0.16z)| norm 0.2555 (-0.50z)| lr 7.01e-04 | 4191.92 ms | 32.2% bf16 MFU | 124765 tok/s step 5255/19560 | loss 3.519950 (-0.40z)| norm 0.2212 (-1.99z)| lr 7.01e-04 | 4980.22 ms | 27.1% bf16 MFU | 123790 tok/s step 5256/19560 | loss 3.607064 (+1.62z)| norm 0.2491 (-0.76z)| lr 7.01e-04 | 4431.00 ms | 30.5% bf16 MFU | 123517 tok/s step 5257/19560 | loss 3.566784 (+0.68z)| norm 0.2943 (+1.23z)| lr 7.01e-04 | 4340.77 ms | 31.1% bf16 MFU | 123380 tok/s step 5258/19560 | loss 3.531630 (-0.14z)| norm 0.2946 (+1.25z)| lr 7.01e-04 | 4267.26 ms | 31.6% bf16 MFU | 123354 tok/s step 5259/19560 | loss 3.508586 (-0.67z)| norm 0.2564 (-0.44z)| lr 7.01e-04 | 4240.43 ms | 31.8% bf16 MFU | 123368 tok/s step 5260/19560 | loss 3.568215 (+0.71z)| norm 0.2663 (+0.01z)| lr 7.00e-04 | 4218.60 ms | 32.0% bf16 MFU | 123414 tok/s step 5261/19560 | loss 3.568771 (+0.71z)| norm 0.2688 (+0.12z)| lr 7.00e-04 | 4375.32 ms | 30.9% bf16 MFU | 123235 tok/s step 5262/19560 | loss 3.511370 (-0.61z)| norm 0.2396 (-1.16z)| lr 7.00e-04 | 4240.94 ms | 31.8% bf16 MFU | 123254 tok/s step 5263/19560 | loss 3.533985 (-0.08z)| norm 0.2750 (+0.41z)| lr 7.00e-04 | 4250.49 ms | 31.8% bf16 MFU | 123259 tok/s step 5264/19560 | loss 3.592470 (+1.27z)| norm 0.2604 (-0.22z)| lr 7.00e-04 | 4223.12 ms | 32.0% bf16 MFU | 123303 tok/s step 5265/19560 | loss 3.460492 (-1.75z)| norm 0.2875 (+1.06z)| lr 7.00e-04 | 4222.37 ms | 32.0% bf16 MFU | 123347 tok/s step 5266/19560 | loss 3.512805 (-0.54z)| norm 0.2674 (+0.12z)| lr 7.00e-04 | 4230.95 ms | 31.9% bf16 MFU | 123375 tok/s step 5267/19560 | loss 3.583721 (+1.08z)| norm 0.2830 (+0.86z)| lr 7.00e-04 | 4206.64 ms | 32.1% bf16 MFU | 123438 tok/s step 5268/19560 | loss 3.529719 (-0.17z)| norm 0.2664 (+0.06z)| lr 7.00e-04 | 4264.80 ms | 31.7% bf16 MFU | 123413 tok/s step 5269/19560 | loss 3.565638 (+0.65z)| norm 0.2627 (-0.11z)| lr 7.00e-04 | 4159.33 ms | 32.5% bf16 MFU | 123545 tok/s step 5270/19560 | loss 3.523654 (-0.31z)| norm 0.2404 (-1.18z)| lr 7.00e-04 | 4309.92 ms | 31.3% bf16 MFU | 123450 tok/s step 5271/19560 | loss 3.575198 (+0.87z)| norm 0.2489 (-0.76z)| lr 7.00e-04 | 4203.66 ms | 32.1% bf16 MFU | 123513 tok/s step 5272/19560 | loss 3.535787 (-0.03z)| norm 0.2832 (+0.87z)| lr 7.00e-04 | 4179.12 ms | 32.3% bf16 MFU | 123610 tok/s step 5273/19560 | loss 3.550714 (+0.30z)| norm 0.2407 (-1.16z)| lr 7.00e-04 | 4227.53 ms | 31.9% bf16 MFU | 123631 tok/s step 5274/19560 | loss 3.521185 (-0.37z)| norm 0.2302 (-1.63z)| lr 7.00e-04 | 4171.03 ms | 32.4% bf16 MFU | 123734 tok/s step 5275/19560 | loss 3.525573 (-0.26z)| norm 0.2480 (-0.79z)| lr 7.00e-04 | 4176.46 ms | 32.3% bf16 MFU | 123824 tok/s step 5276/19560 | loss 3.518181 (-0.44z)| norm 0.2653 (+0.03z)| lr 7.00e-04 | 4173.51 ms | 32.4% bf16 MFU | 123914 tok/s step 5277/19560 | loss 3.540255 (+0.08z)| norm 0.3141 (+2.26z)| lr 7.00e-04 | 4173.01 ms | 32.4% bf16 MFU | 124000 tok/s step 5278/19560 | loss 3.578244 (+0.95z)| norm 0.3042 (+1.77z)| lr 7.00e-04 | 4214.22 ms | 32.0% bf16 MFU | 124021 tok/s step 5279/19560 | loss 3.525182 (-0.30z)| norm 0.2736 (+0.36z)| lr 7.00e-04 | 4166.48 ms | 32.4% bf16 MFU | 124111 tok/s step 5280/19560 | loss 3.479156 (-1.36z)| norm 0.2655 (-0.03z)| lr 7.00e-04 | 4398.48 ms | 30.7% bf16 MFU | 123866 tok/s step 5281/19560 | loss 3.537103 (+0.00z)| norm 0.2899 (+1.08z)| lr 7.00e-04 | 4168.73 ms | 32.4% bf16 MFU | 123961 tok/s step 5282/19560 | loss 3.503822 (-0.78z)| norm 0.2649 (-0.07z)| lr 6.99e-04 | 4166.34 ms | 32.4% bf16 MFU | 124055 tok/s step 5283/19560 | loss 3.556685 (+0.45z)| norm 0.2809 (+0.66z)| lr 6.99e-04 | 4214.25 ms | 32.0% bf16 MFU | 124072 tok/s step 5284/19560 | loss 3.734946 (+4.29z)| norm 0.2633 (-0.18z)| lr 6.99e-04 | 4160.75 ms | 32.5% bf16 MFU | 124169 tok/s step 5285/19560 | loss 3.585285 (+1.00z)| norm 0.3438 (+3.43z)| lr 6.99e-04 | 4223.12 ms | 32.0% bf16 MFU | 124168 tok/s step 5286/19560 | loss 3.481747 (-1.23z)| norm 0.3168 (+2.15z)| lr 6.99e-04 | 4166.70 ms | 32.4% bf16 MFU | 124251 tok/s step 5287/19560 | loss 3.576896 (+0.85z)| norm 0.3087 (+1.76z)| lr 6.99e-04 | 4169.33 ms | 32.4% bf16 MFU | 124326 tok/s step 5288/19560 | loss 3.530232 (-0.18z)| norm 0.2614 (-0.30z)| lr 6.99e-04 | 4193.30 ms | 32.2% bf16 MFU | 124361 tok/s step 5289/19560 | loss 3.593691 (+1.22z)| norm 0.2902 (+0.97z)| lr 6.99e-04 | 4293.20 ms | 31.4% bf16 MFU | 124249 tok/s step 5290/19560 | loss 3.506871 (-0.71z)| norm 0.2801 (+0.54z)| lr 6.99e-04 | 4173.30 ms | 32.4% bf16 MFU | 124318 tok/s step 5291/19560 | loss 3.580234 (+0.91z)| norm 0.2524 (-0.69z)| lr 6.99e-04 | 4170.01 ms | 32.4% bf16 MFU | 124389 tok/s step 5292/19560 | loss 3.642256 (+2.23z)| norm 0.2880 (+0.88z)| lr 6.99e-04 | 4202.05 ms | 32.1% bf16 MFU | 124408 tok/s step 5293/19560 | loss 3.526695 (-0.30z)| norm 0.2910 (+1.00z)| lr 6.99e-04 | 4185.81 ms | 32.3% bf16 MFU | 124450 tok/s step 5294/19560 | loss 3.505867 (-0.74z)| norm 0.3073 (+1.69z)| lr 6.99e-04 | 4172.56 ms | 32.4% bf16 MFU | 124510 tok/s step 5295/19560 | loss 3.578725 (+0.86z)| norm 0.2745 (+0.26z)| lr 6.99e-04 | 4176.52 ms | 32.3% bf16 MFU | 124561 tok/s step 5296/19560 | loss 3.540157 (+0.00z)| norm 0.2635 (-0.23z)| lr 6.99e-04 | 4167.70 ms | 32.4% bf16 MFU | 124623 tok/s step 5297/19560 | loss 3.516121 (-0.52z)| norm 0.2541 (-0.64z)| lr 6.99e-04 | 4165.82 ms | 32.4% bf16 MFU | 124685 tok/s step 5298/19560 | loss 3.492394 (-1.04z)| norm 0.2426 (-1.14z)| lr 6.99e-04 | 4247.87 ms | 31.8% bf16 MFU | 124622 tok/s step 5299/19560 | loss 3.496374 (-0.93z)| norm 0.2465 (-0.95z)| lr 6.99e-04 | 4169.14 ms | 32.4% bf16 MFU | 124678 tok/s step 5300/19560 | loss 3.534528 (-0.09z)| norm 0.2403 (-1.21z)| lr 6.99e-04 | 4752.18 ms | 28.4% bf16 MFU | 123961 tok/s step 5301/19560 | loss 3.512235 (-0.58z)| norm 0.2391 (-1.25z)| lr 6.99e-04 | 4162.31 ms | 32.4% bf16 MFU | 124061 tok/s step 5302/19560 | loss 3.555005 (+0.36z)| norm 0.2526 (-0.68z)| lr 6.99e-04 | 4165.60 ms | 32.4% bf16 MFU | 124151 tok/s step 5303/19560 | loss 3.540328 (+0.03z)| norm 0.2432 (-1.08z)| lr 6.99e-04 | 4166.37 ms | 32.4% bf16 MFU | 124235 tok/s step 5304/19560 | loss 3.504585 (-0.77z)| norm 0.2312 (-1.57z)| lr 6.99e-04 | 4168.82 ms | 32.4% bf16 MFU | 124311 tok/s step 5305/19560 | loss 3.531945 (-0.16z)| norm 0.2705 (+0.12z)| lr 6.98e-04 | 4169.84 ms | 32.4% bf16 MFU | 124383 tok/s step 5306/19560 | loss 3.529895 (-0.21z)| norm 0.2556 (-0.52z)| lr 6.98e-04 | 4255.56 ms | 31.7% bf16 MFU | 124323 tok/s step 5307/19560 | loss 3.540851 (+0.05z)| norm 0.2393 (-1.21z)| lr 6.98e-04 | 4170.21 ms | 32.4% bf16 MFU | 124393 tok/s step 5308/19560 | loss 3.494848 (-0.98z)| norm 0.2277 (-1.68z)| lr 6.98e-04 | 4163.33 ms | 32.4% bf16 MFU | 124470 tok/s step 5309/19560 | loss 3.541839 (+0.07z)| norm 0.2393 (-1.17z)| lr 6.98e-04 | 4190.71 ms | 32.2% bf16 MFU | 124502 tok/s step 5310/19560 | loss 3.514661 (-0.53z)| norm 0.2384 (-1.19z)| lr 6.98e-04 | 4168.10 ms | 32.4% bf16 MFU | 124566 tok/s step 5311/19560 | loss 3.595737 (+1.29z)| norm 0.2609 (-0.24z)| lr 6.98e-04 | 4176.02 ms | 32.3% bf16 MFU | 124615 tok/s step 5312/19560 | loss 3.513038 (-0.57z)| norm 0.2390 (-1.15z)| lr 6.98e-04 | 4167.99 ms | 32.4% bf16 MFU | 124674 tok/s step 5313/19560 | loss 3.548674 (+0.24z)| norm 0.2258 (-1.67z)| lr 6.98e-04 | 4172.35 ms | 32.4% bf16 MFU | 124723 tok/s step 5314/19560 | loss 3.501823 (-0.81z)| norm 0.2457 (-0.84z)| lr 6.98e-04 | 4166.12 ms | 32.4% bf16 MFU | 124779 tok/s step 5315/19560 | loss 3.498389 (-0.90z)| norm 0.2442 (-0.89z)| lr 6.98e-04 | 4179.08 ms | 32.3% bf16 MFU | 124813 tok/s step 5316/19560 | loss 3.568135 (+0.75z)| norm 0.2415 (-0.99z)| lr 6.98e-04 | 4162.99 ms | 32.4% bf16 MFU | 124869 tok/s step 5317/19560 | loss 3.470084 (-1.55z)| norm 0.2426 (-0.94z)| lr 6.98e-04 | 4175.77 ms | 32.3% bf16 MFU | 124904 tok/s step 5318/19560 | loss 3.527625 (-0.20z)| norm 0.2339 (-1.30z)| lr 6.98e-04 | 4195.10 ms | 32.2% bf16 MFU | 124907 tok/s step 5319/19560 | loss 3.529567 (-0.17z)| norm 0.2679 (+0.10z)| lr 6.98e-04 | 4168.60 ms | 32.4% bf16 MFU | 124951 tok/s step 5320/19560 | loss 3.521679 (-0.36z)| norm 0.2463 (-0.78z)| lr 6.98e-04 | 4179.88 ms | 32.3% bf16 MFU | 124975 tok/s step 5321/19560 | loss 3.591803 (+1.30z)| norm 0.2557 (-0.39z)| lr 6.98e-04 | 4178.96 ms | 32.3% bf16 MFU | 124999 tok/s step 5322/19560 | loss 3.543369 (+0.15z)| norm 0.2694 (+0.17z)| lr 6.98e-04 | 4174.10 ms | 32.3% bf16 MFU | 125029 tok/s step 5323/19560 | loss 3.504033 (-0.77z)| norm 0.2596 (-0.25z)| lr 6.98e-04 | 4167.90 ms | 32.4% bf16 MFU | 125067 tok/s step 5324/19560 | loss 3.569647 (+0.78z)| norm 0.2972 (+1.29z)| lr 6.98e-04 | 4178.21 ms | 32.3% bf16 MFU | 125088 tok/s step 5325/19560 | loss 3.511765 (-0.58z)| norm 0.2808 (+0.61z)| lr 6.98e-04 | 4187.75 ms | 32.2% bf16 MFU | 125093 tok/s step 5326/19560 | loss 3.539830 (+0.07z)| norm 0.2482 (-0.73z)| lr 6.98e-04 | 4176.65 ms | 32.3% bf16 MFU | 125115 tok/s step 5327/19560 | loss 3.512917 (-0.57z)| norm 0.2716 (+0.23z)| lr 6.97e-04 | 4163.27 ms | 32.4% bf16 MFU | 125156 tok/s step 5328/19560 | loss 3.560056 (+0.55z)| norm 0.2615 (-0.18z)| lr 6.97e-04 | 4175.47 ms | 32.3% bf16 MFU | 125176 tok/s step 5329/19560 | loss 3.601131 (+1.59z)| norm 0.2550 (-0.45z)| lr 6.97e-04 | 4173.25 ms | 32.4% bf16 MFU | 125199 tok/s step 5330/19560 | loss 3.553545 (+0.41z)| norm 0.2816 (+0.64z)| lr 6.97e-04 | 4275.04 ms | 31.6% bf16 MFU | 125071 tok/s step 5331/19560 | loss 3.616700 (+1.92z)| norm 0.2762 (+0.42z)| lr 6.97e-04 | 4165.46 ms | 32.4% bf16 MFU | 125111 tok/s step 5332/19560 | loss 3.511139 (-0.64z)| norm 0.2738 (+0.31z)| lr 6.97e-04 | 4276.27 ms | 31.6% bf16 MFU | 124985 tok/s step 5333/19560 | loss 3.622972 (+2.03z)| norm 0.2778 (+0.47z)| lr 6.97e-04 | 4170.31 ms | 32.4% bf16 MFU | 125022 tok/s step 5334/19560 | loss 3.678893 (+3.22z)| norm 0.2630 (-0.15z)| lr 6.97e-04 | 4177.13 ms | 32.3% bf16 MFU | 125047 tok/s step 5335/19560 | loss 3.568190 (+0.67z)| norm 0.2488 (-0.75z)| lr 6.97e-04 | 4172.51 ms | 32.4% bf16 MFU | 125077 tok/s step 5336/19560 | loss 3.540586 (+0.03z)| norm 0.2390 (-1.15z)| lr 6.97e-04 | 4166.62 ms | 32.4% bf16 MFU | 125115 tok/s step 5337/19560 | loss 3.569114 (+0.69z)| norm 0.2475 (-0.79z)| lr 6.97e-04 | 4167.40 ms | 32.4% bf16 MFU | 125149 tok/s step 5338/19560 | loss 3.546875 (+0.17z)| norm 0.2327 (-1.38z)| lr 6.97e-04 | 4166.85 ms | 32.4% bf16 MFU | 125183 tok/s step 5339/19560 | loss 3.565690 (+0.60z)| norm 0.2384 (-1.14z)| lr 6.97e-04 | 4163.71 ms | 32.4% bf16 MFU | 125220 tok/s step 5340/19560 | loss 3.489046 (-1.19z)| norm 0.2537 (-0.50z)| lr 6.97e-04 | 4177.54 ms | 32.3% bf16 MFU | 125234 tok/s step 5341/19560 | loss 3.555392 (+0.40z)| norm 0.2569 (-0.36z)| lr 6.97e-04 | 4180.87 ms | 32.3% bf16 MFU | 125242 tok/s step 5342/19560 | loss 3.479929 (-1.39z)| norm 0.2401 (-1.06z)| lr 6.97e-04 | 4164.47 ms | 32.4% bf16 MFU | 125275 tok/s step 5343/19560 | loss 3.537466 (-0.01z)| norm 0.2599 (-0.21z)| lr 6.97e-04 | 4156.91 ms | 32.5% bf16 MFU | 125317 tok/s step 5344/19560 | loss 3.532167 (-0.14z)| norm 0.2466 (-0.78z)| lr 6.97e-04 | 4182.12 ms | 32.3% bf16 MFU | 125320 tok/s step 5345/19560 | loss 3.496706 (-0.98z)| norm 0.2545 (-0.44z)| lr 6.97e-04 | 4197.70 ms | 32.2% bf16 MFU | 125299 tok/s step 5346/19560 | loss 3.568421 (+0.73z)| norm 0.2658 (+0.06z)| lr 6.97e-04 | 4176.41 ms | 32.3% bf16 MFU | 125311 tok/s step 5347/19560 | loss 3.628762 (+2.16z)| norm 0.2566 (-0.34z)| lr 6.97e-04 | 4161.63 ms | 32.4% bf16 MFU | 125344 tok/s step 5348/19560 | loss 3.596574 (+1.37z)| norm 0.2350 (-1.26z)| lr 6.97e-04 | 4180.48 ms | 32.3% bf16 MFU | 125348 tok/s step 5349/19560 | loss 3.575496 (+0.86z)| norm 0.3008 (+1.57z)| lr 6.96e-04 | 4175.18 ms | 32.3% bf16 MFU | 125359 tok/s step 5350/19560 | loss 3.549912 (+0.25z)| norm 0.2671 (+0.12z)| lr 6.96e-04 | 4169.82 ms | 32.4% bf16 MFU | 125378 tok/s step 5351/19560 | loss 3.534197 (-0.12z)| norm 0.2605 (-0.17z)| lr 6.96e-04 | 4175.82 ms | 32.3% bf16 MFU | 125386 tok/s step 5352/19560 | loss 3.589122 (+1.17z)| norm 0.2501 (-0.61z)| lr 6.96e-04 | 4173.40 ms | 32.4% bf16 MFU | 125398 tok/s step 5353/19560 | loss 3.545306 (+0.13z)| norm 0.2979 (+1.42z)| lr 6.96e-04 | 4172.80 ms | 32.4% bf16 MFU | 125411 tok/s step 5354/19560 | loss 3.534804 (-0.12z)| norm 0.3110 (+1.93z)| lr 6.96e-04 | 4173.77 ms | 32.3% bf16 MFU | 125421 tok/s step 5355/19560 | loss 3.577452 (+0.88z)| norm 0.2677 (+0.10z)| lr 6.96e-04 | 4166.49 ms | 32.4% bf16 MFU | 125442 tok/s step 5356/19560 | loss 3.467440 (-1.71z)| norm 0.2798 (+0.64z)| lr 6.96e-04 | 4172.61 ms | 32.4% bf16 MFU | 125452 tok/s step 5357/19560 | loss 3.579247 (+0.91z)| norm 0.2865 (+0.91z)| lr 6.96e-04 | 4183.49 ms | 32.3% bf16 MFU | 125446 tok/s step 5358/19560 | loss 3.535086 (-0.12z)| norm 0.2957 (+1.29z)| lr 6.96e-04 | 4166.23 ms | 32.4% bf16 MFU | 125465 tok/s step 5359/19560 | loss 3.559136 (+0.45z)| norm 0.2575 (-0.33z)| lr 6.96e-04 | 4166.91 ms | 32.4% bf16 MFU | 125483 tok/s step 5360/19560 | loss 3.515646 (-0.59z)| norm 0.2608 (-0.17z)| lr 6.96e-04 | 4174.75 ms | 32.3% bf16 MFU | 125488 tok/s step 5361/19560 | loss 3.530831 (-0.23z)| norm 0.2468 (-0.78z)| lr 6.96e-04 | 4201.79 ms | 32.1% bf16 MFU | 125453 tok/s step 5362/19560 | loss 3.567829 (+0.65z)| norm 0.2616 (-0.10z)| lr 6.96e-04 | 4166.36 ms | 32.4% bf16 MFU | 125472 tok/s step 5363/19560 | loss 3.578191 (+0.89z)| norm 0.2669 (+0.14z)| lr 6.96e-04 | 4170.55 ms | 32.4% bf16 MFU | 125484 tok/s step 5364/19560 | loss 3.473295 (-1.58z)| norm 0.2605 (-0.15z)| lr 6.96e-04 | 4162.94 ms | 32.4% bf16 MFU | 125507 tok/s step 5365/19560 | loss 3.716895 (+3.92z)| norm 0.2782 (+0.65z)| lr 6.96e-04 | 4195.55 ms | 32.2% bf16 MFU | 125480 tok/s step 5366/19560 | loss 3.531278 (-0.25z)| norm 0.2696 (+0.24z)| lr 6.96e-04 | 4165.57 ms | 32.4% bf16 MFU | 125499 tok/s step 5367/19560 | loss 3.504416 (-0.85z)| norm 0.2602 (-0.19z)| lr 6.96e-04 | 4164.09 ms | 32.4% bf16 MFU | 125519 tok/s step 5368/19560 | loss 3.577072 (+0.77z)| norm 0.3114 (+2.10z)| lr 6.96e-04 | 4173.53 ms | 32.4% bf16 MFU | 125524 tok/s step 5369/19560 | loss 3.494879 (-1.07z)| norm 0.2991 (+1.52z)| lr 6.96e-04 | 4163.69 ms | 32.4% bf16 MFU | 125544 tok/s step 5370/19560 | loss 3.561086 (+0.41z)| norm 0.2478 (-0.74z)| lr 6.96e-04 | 4171.05 ms | 32.4% bf16 MFU | 125552 tok/s step 5371/19560 | loss 3.618894 (+1.67z)| norm 0.3049 (+1.76z)| lr 6.95e-04 | 4167.30 ms | 32.4% bf16 MFU | 125565 tok/s step 5372/19560 | loss 3.544177 (+0.00z)| norm 0.2890 (+1.05z)| lr 6.95e-04 | 4169.17 ms | 32.4% bf16 MFU | 125574 tok/s step 5373/19560 | loss 3.520733 (-0.52z)| norm 0.2398 (-1.10z)| lr 6.95e-04 | 4169.99 ms | 32.4% bf16 MFU | 125582 tok/s step 5374/19560 | loss 3.542421 (-0.04z)| norm 0.2626 (-0.08z)| lr 6.95e-04 | 4171.89 ms | 32.4% bf16 MFU | 125586 tok/s step 5375/19560 | loss 3.574206 (+0.66z)| norm 0.2357 (-1.27z)| lr 6.95e-04 | 4172.56 ms | 32.4% bf16 MFU | 125590 tok/s step 5376/19560 | loss 3.546087 (+0.02z)| norm 0.2593 (-0.21z)| lr 6.95e-04 | 4165.25 ms | 32.4% bf16 MFU | 125604 tok/s step 5377/19560 | loss 3.563746 (+0.41z)| norm 0.2654 (+0.07z)| lr 6.95e-04 | 4173.38 ms | 32.4% bf16 MFU | 125605 tok/s step 5378/19560 | loss 3.491122 (-1.20z)| norm 0.2481 (-0.71z)| lr 6.95e-04 | 4167.70 ms | 32.4% bf16 MFU | 125615 tok/s step 5379/19560 | loss 3.453118 (-2.05z)| norm 0.2477 (-0.72z)| lr 6.95e-04 | 4165.12 ms | 32.4% bf16 MFU | 125628 tok/s step 5380/19560 | loss 3.497043 (-1.06z)| norm 0.2475 (-0.72z)| lr 6.95e-04 | 4167.48 ms | 32.4% bf16 MFU | 125636 tok/s step 5381/19560 | loss 3.506272 (-0.85z)| norm 0.2295 (-1.50z)| lr 6.95e-04 | 4165.19 ms | 32.4% bf16 MFU | 125648 tok/s step 5382/19560 | loss 3.586234 (+0.92z)| norm 0.2838 (+0.89z)| lr 6.95e-04 | 4170.97 ms | 32.4% bf16 MFU | 125651 tok/s step 5383/19560 | loss 3.522834 (-0.49z)| norm 0.2957 (+1.39z)| lr 6.95e-04 | 4166.52 ms | 32.4% bf16 MFU | 125660 tok/s step 5384/19560 | loss 3.542134 (-0.05z)| norm 0.2766 (+0.54z)| lr 6.95e-04 | 4176.51 ms | 32.3% bf16 MFU | 125654 tok/s step 5385/19560 | loss 3.504331 (-0.88z)| norm 0.3094 (+1.97z)| lr 6.95e-04 | 4171.32 ms | 32.4% bf16 MFU | 125655 tok/s step 5386/19560 | loss 3.566149 (+0.49z)| norm 0.2940 (+1.30z)| lr 6.95e-04 | 4173.85 ms | 32.3% bf16 MFU | 125653 tok/s step 5387/19560 | loss 3.511626 (-0.72z)| norm 0.2704 (+0.25z)| lr 6.95e-04 | 4171.03 ms | 32.4% bf16 MFU | 125655 tok/s step 5388/19560 | loss 3.508883 (-0.77z)| norm 0.2696 (+0.22z)| lr 6.95e-04 | 4182.69 ms | 32.3% bf16 MFU | 125640 tok/s step 5389/19560 | loss 3.553175 (+0.21z)| norm 0.2553 (-0.41z)| lr 6.95e-04 | 4178.62 ms | 32.3% bf16 MFU | 125631 tok/s step 5390/19560 | loss 3.478412 (-1.43z)| norm 0.3085 (+1.89z)| lr 6.95e-04 | 4163.75 ms | 32.4% bf16 MFU | 125646 tok/s step 5391/19560 | loss 3.547930 (+0.10z)| norm 0.2849 (+0.85z)| lr 6.95e-04 | 4186.29 ms | 32.3% bf16 MFU | 125625 tok/s step 5392/19560 | loss 3.507964 (-0.77z)| norm 0.2899 (+1.06z)| lr 6.95e-04 | 4175.90 ms | 32.3% bf16 MFU | 125622 tok/s step 5393/19560 | loss 3.529540 (-0.31z)| norm 0.2559 (-0.41z)| lr 6.94e-04 | 4171.84 ms | 32.4% bf16 MFU | 125624 tok/s step 5394/19560 | loss 3.550858 (+0.16z)| norm 0.2655 (+0.01z)| lr 6.94e-04 | 4166.42 ms | 32.4% bf16 MFU | 125635 tok/s step 5395/19560 | loss 3.499396 (-0.98z)| norm 0.2986 (+1.43z)| lr 6.94e-04 | 4168.43 ms | 32.4% bf16 MFU | 125642 tok/s step 5396/19560 | loss 3.586320 (+0.96z)| norm 0.2772 (+0.51z)| lr 6.94e-04 | 4168.14 ms | 32.4% bf16 MFU | 125649 tok/s step 5397/19560 | loss 3.543117 (-0.00z)| norm 0.2481 (-0.74z)| lr 6.94e-04 | 4168.91 ms | 32.4% bf16 MFU | 125655 tok/s step 5398/19560 | loss 3.557750 (+0.32z)| norm 0.2598 (-0.25z)| lr 6.94e-04 | 4172.51 ms | 32.4% bf16 MFU | 125655 tok/s step 5399/19560 | loss 3.489588 (-1.19z)| norm 0.2523 (-0.57z)| lr 6.94e-04 | 4163.02 ms | 32.4% bf16 MFU | 125669 tok/s step 5400/19560 | loss 3.557071 (+0.32z)| norm 0.2373 (-1.20z)| lr 6.94e-04 | 4162.11 ms | 32.4% bf16 MFU | 125684 tok/s step 5401/19560 | loss 3.478432 (-1.42z)| norm 0.2429 (-0.96z)| lr 6.94e-04 | 4176.97 ms | 32.3% bf16 MFU | 125676 tok/s step 5402/19560 | loss 3.557915 (+0.34z)| norm 0.2382 (-1.17z)| lr 6.94e-04 | 4171.47 ms | 32.4% bf16 MFU | 125676 tok/s step 5403/19560 | loss 3.565903 (+0.51z)| norm 0.2351 (-1.29z)| lr 6.94e-04 | 4165.87 ms | 32.4% bf16 MFU | 125685 tok/s step 5404/19560 | loss 3.542154 (-0.02z)| norm 0.2263 (-1.64z)| lr 6.94e-04 | 4171.47 ms | 32.4% bf16 MFU | 125685 tok/s step 5405/19560 | loss 3.602517 (+1.30z)| norm 0.2306 (-1.44z)| lr 6.94e-04 | 4170.58 ms | 32.4% bf16 MFU | 125686 tok/s step 5406/19560 | loss 3.502244 (-0.90z)| norm 0.2650 (+0.05z)| lr 6.94e-04 | 4168.08 ms | 32.4% bf16 MFU | 125691 tok/s step 5407/19560 | loss 3.557618 (+0.32z)| norm 0.2705 (+0.29z)| lr 6.94e-04 | 4235.69 ms | 31.9% bf16 MFU | 125595 tok/s step 5408/19560 | loss 3.507027 (-0.81z)| norm 0.2700 (+0.27z)| lr 6.94e-04 | 4173.14 ms | 32.4% bf16 MFU | 125597 tok/s step 5409/19560 | loss 3.543743 (+0.00z)| norm 0.2645 (+0.04z)| lr 6.94e-04 | 4170.35 ms | 32.4% bf16 MFU | 125603 tok/s step 5410/19560 | loss 3.515299 (-0.63z)| norm 0.2770 (+0.58z)| lr 6.94e-04 | 4173.19 ms | 32.4% bf16 MFU | 125605 tok/s step 5411/19560 | loss 3.518317 (-0.55z)| norm 0.3221 (+2.48z)| lr 6.94e-04 | 4165.58 ms | 32.4% bf16 MFU | 125618 tok/s step 5412/19560 | loss 3.550452 (+0.20z)| norm 0.2965 (+1.37z)| lr 6.94e-04 | 4179.25 ms | 32.3% bf16 MFU | 125609 tok/s step 5413/19560 | loss 3.537871 (-0.09z)| norm 0.2740 (+0.46z)| lr 6.94e-04 | 4167.97 ms | 32.4% bf16 MFU | 125618 tok/s step 5414/19560 | loss 3.546664 (+0.11z)| norm 0.2597 (-0.16z)| lr 6.93e-04 | 4170.44 ms | 32.4% bf16 MFU | 125623 tok/s step 5415/19560 | loss 3.518689 (-0.56z)| norm 0.2491 (-0.64z)| lr 6.93e-04 | 4168.44 ms | 32.4% bf16 MFU | 125631 tok/s step 5416/19560 | loss 3.540501 (-0.03z)| norm 0.2716 (+0.40z)| lr 6.93e-04 | 4168.00 ms | 32.4% bf16 MFU | 125639 tok/s step 5417/19560 | loss 3.619303 (+1.87z)| norm 0.2478 (-0.68z)| lr 6.93e-04 | 4169.58 ms | 32.4% bf16 MFU | 125644 tok/s step 5418/19560 | loss 3.583982 (+1.00z)| norm 0.2740 (+0.53z)| lr 6.93e-04 | 4165.72 ms | 32.4% bf16 MFU | 125655 tok/s step 5419/19560 | loss 3.517772 (-0.58z)| norm 0.2656 (+0.14z)| lr 6.93e-04 | 4164.92 ms | 32.4% bf16 MFU | 125666 tok/s step 5420/19560 | loss 3.527811 (-0.33z)| norm 0.2265 (-1.65z)| lr 6.93e-04 | 4180.64 ms | 32.3% bf16 MFU | 125653 tok/s step 5421/19560 | loss 3.525290 (-0.39z)| norm 0.2554 (-0.30z)| lr 6.93e-04 | 4164.89 ms | 32.4% bf16 MFU | 125665 tok/s step 5422/19560 | loss 3.481074 (-1.47z)| norm 0.2519 (-0.45z)| lr 6.93e-04 | 4174.54 ms | 32.3% bf16 MFU | 125661 tok/s step 5423/19560 | loss 3.563511 (+0.56z)| norm 0.2871 (+1.20z)| lr 6.93e-04 | 4161.22 ms | 32.4% bf16 MFU | 125678 tok/s step 5424/19560 | loss 3.604739 (+1.55z)| norm 0.2832 (+1.01z)| lr 6.93e-04 | 4213.99 ms | 32.0% bf16 MFU | 125614 tok/s step 5425/19560 | loss 3.569296 (+0.67z)| norm 0.2633 (+0.07z)| lr 6.93e-04 | 4161.39 ms | 32.4% bf16 MFU | 125633 tok/s step 5426/19560 | loss 3.504173 (-0.92z)| norm 0.2467 (-0.71z)| lr 6.93e-04 | 4249.13 ms | 31.8% bf16 MFU | 125521 tok/s step 5427/19560 | loss 3.499440 (-1.03z)| norm 0.2497 (-0.57z)| lr 6.93e-04 | 4169.63 ms | 32.4% bf16 MFU | 125532 tok/s step 5428/19560 | loss 3.523549 (-0.44z)| norm 0.2610 (-0.05z)| lr 6.93e-04 | 4168.50 ms | 32.4% bf16 MFU | 125544 tok/s step 5429/19560 | loss 3.586369 (+1.07z)| norm 0.2570 (-0.24z)| lr 6.93e-04 | 4163.78 ms | 32.4% bf16 MFU | 125563 tok/s step 5430/19560 | loss 3.567129 (+0.60z)| norm 0.2791 (+0.79z)| lr 6.93e-04 | 4163.25 ms | 32.4% bf16 MFU | 125581 tok/s step 5431/19560 | loss 3.531997 (-0.25z)| norm 0.2901 (+1.29z)| lr 6.93e-04 | 4161.89 ms | 32.4% bf16 MFU | 125601 tok/s step 5432/19560 | loss 3.574481 (+0.77z)| norm 0.2362 (-1.26z)| lr 6.93e-04 | 4174.02 ms | 32.3% bf16 MFU | 125601 tok/s step 5433/19560 | loss 3.537114 (-0.14z)| norm 0.2421 (-0.96z)| lr 6.93e-04 | 4168.15 ms | 32.4% bf16 MFU | 125610 tok/s step 5434/19560 | loss 3.556499 (+0.33z)| norm 0.2604 (-0.10z)| lr 6.93e-04 | 4161.72 ms | 32.4% bf16 MFU | 125629 tok/s step 5435/19560 | loss 3.546004 (+0.07z)| norm 0.2398 (-1.07z)| lr 6.93e-04 | 4163.76 ms | 32.4% bf16 MFU | 125643 tok/s step 5436/19560 | loss 3.556077 (+0.30z)| norm 0.2607 (-0.10z)| lr 6.92e-04 | 4162.35 ms | 32.4% bf16 MFU | 125659 tok/s step 5437/19560 | loss 3.523277 (-0.49z)| norm 0.2458 (-0.81z)| lr 6.92e-04 | 4183.31 ms | 32.3% bf16 MFU | 125642 tok/s step 5438/19560 | loss 3.570491 (+0.65z)| norm 0.2459 (-0.82z)| lr 6.92e-04 | 4162.82 ms | 32.4% bf16 MFU | 125657 tok/s step 5439/19560 | loss 3.554869 (+0.28z)| norm 0.2427 (-0.96z)| lr 6.92e-04 | 4167.99 ms | 32.4% bf16 MFU | 125664 tok/s step 5440/19560 | loss 3.500860 (-1.05z)| norm 0.2368 (-1.24z)| lr 6.92e-04 | 4169.88 ms | 32.4% bf16 MFU | 125667 tok/s step 5441/19560 | loss 3.424297 (-2.81z)| norm 0.2631 (+0.00z)| lr 6.92e-04 | 4195.17 ms | 32.2% bf16 MFU | 125633 tok/s step 5442/19560 | loss 3.558311 (+0.37z)| norm 0.2332 (-1.43z)| lr 6.92e-04 | 4170.37 ms | 32.4% bf16 MFU | 125637 tok/s step 5443/19560 | loss 3.525422 (-0.43z)| norm 0.2455 (-0.84z)| lr 6.92e-04 | 4260.05 ms | 31.7% bf16 MFU | 125509 tok/s step 5444/19560 | loss 3.519765 (-0.55z)| norm 0.2487 (-0.69z)| lr 6.92e-04 | 4180.92 ms | 32.3% bf16 MFU | 125503 tok/s step 5445/19560 | loss 3.539310 (-0.10z)| norm 0.2997 (+1.73z)| lr 6.92e-04 | 4566.29 ms | 29.6% bf16 MFU | 124969 tok/s step 5446/19560 | loss 3.595689 (+1.25z)| norm 0.2627 (-0.05z)| lr 6.92e-04 | 5288.01 ms | 25.5% bf16 MFU | 123678 tok/s step 5447/19560 | loss 3.487847 (-1.33z)| norm 0.2519 (-0.56z)| lr 6.92e-04 | 4758.97 ms | 28.4% bf16 MFU | 123002 tok/s step 5448/19560 | loss 3.632565 (+2.08z)| norm 0.2421 (-1.03z)| lr 6.92e-04 | 4248.28 ms | 31.8% bf16 MFU | 123023 tok/s step 5449/19560 | loss 3.574501 (+0.71z)| norm 0.2651 (+0.07z)| lr 6.92e-04 | 4157.65 ms | 32.5% bf16 MFU | 123177 tok/s step 5450/19560 | loss 3.533052 (-0.26z)| norm 0.2808 (+0.82z)| lr 6.92e-04 | 4424.09 ms | 30.5% bf16 MFU | 122943 tok/s step 5451/19560 | loss 3.539752 (-0.11z)| norm 0.2509 (-0.61z)| lr 6.92e-04 | 4157.79 ms | 32.5% bf16 MFU | 123101 tok/s step 5452/19560 | loss 3.519584 (-0.58z)| norm 0.2590 (-0.21z)| lr 6.92e-04 | 4240.71 ms | 31.8% bf16 MFU | 123128 tok/s step 5453/19560 | loss 3.533935 (-0.25z)| norm 0.2790 (+0.76z)| lr 6.92e-04 | 4388.87 ms | 30.8% bf16 MFU | 122944 tok/s step 5454/19560 | loss 3.566185 (+0.52z)| norm 0.2600 (-0.17z)| lr 6.92e-04 | 4161.14 ms | 32.4% bf16 MFU | 123097 tok/s step 5455/19560 | loss 3.535447 (-0.22z)| norm 0.2746 (+0.54z)| lr 6.92e-04 | 4164.86 ms | 32.4% bf16 MFU | 123236 tok/s step 5456/19560 | loss 3.490278 (-1.27z)| norm 0.2571 (-0.30z)| lr 6.92e-04 | 4167.75 ms | 32.4% bf16 MFU | 123364 tok/s step 5457/19560 | loss 3.562926 (+0.45z)| norm 0.2525 (-0.53z)| lr 6.92e-04 | 4165.49 ms | 32.4% bf16 MFU | 123489 tok/s step 5458/19560 | loss 3.572649 (+0.68z)| norm 0.2901 (+1.29z)| lr 6.91e-04 | 4210.23 ms | 32.1% bf16 MFU | 123541 tok/s step 5459/19560 | loss 3.588777 (+1.08z)| norm 0.2605 (-0.14z)| lr 6.91e-04 | 4199.78 ms | 32.1% bf16 MFU | 123606 tok/s step 5460/19560 | loss 3.561526 (+0.42z)| norm 0.2687 (+0.26z)| lr 6.91e-04 | 4171.18 ms | 32.4% bf16 MFU | 123710 tok/s step 5461/19560 | loss 3.560344 (+0.41z)| norm 0.2697 (+0.31z)| lr 6.91e-04 | 4220.09 ms | 32.0% bf16 MFU | 123737 tok/s step 5462/19560 | loss 3.517349 (-0.63z)| norm 0.2344 (-1.37z)| lr 6.91e-04 | 4257.56 ms | 31.7% bf16 MFU | 123707 tok/s step 5463/19560 | loss 3.534560 (-0.19z)| norm 0.2422 (-0.99z)| lr 6.91e-04 | 4166.94 ms | 32.4% bf16 MFU | 123813 tok/s step 5464/19560 | loss 3.480873 (-1.53z)| norm 0.2563 (-0.33z)| lr 6.91e-04 | 4204.94 ms | 32.1% bf16 MFU | 123856 tok/s step 5465/19560 | loss 3.507529 (-0.85z)| norm 0.2693 (+0.29z)| lr 6.91e-04 | 4168.15 ms | 32.4% bf16 MFU | 123952 tok/s step 5466/19560 | loss 3.514165 (-0.67z)| norm 0.2724 (+0.43z)| lr 6.91e-04 | 4171.09 ms | 32.4% bf16 MFU | 124040 tok/s step 5467/19560 | loss 3.526933 (-0.34z)| norm 0.2635 (-0.01z)| lr 6.91e-04 | 4169.67 ms | 32.4% bf16 MFU | 124125 tok/s step 5468/19560 | loss 3.540198 (-0.02z)| norm 0.2568 (-0.34z)| lr 6.91e-04 | 4175.41 ms | 32.3% bf16 MFU | 124197 tok/s step 5469/19560 | loss 3.533414 (-0.19z)| norm 0.2633 (-0.03z)| lr 6.91e-04 | 4215.21 ms | 32.0% bf16 MFU | 124206 tok/s step 5470/19560 | loss 3.515692 (-0.65z)| norm 0.2425 (-1.05z)| lr 6.91e-04 | 4181.11 ms | 32.3% bf16 MFU | 124265 tok/s step 5471/19560 | loss 3.555636 (+0.37z)| norm 0.2712 (+0.35z)| lr 6.91e-04 | 4169.04 ms | 32.4% bf16 MFU | 124340 tok/s step 5472/19560 | loss 3.519257 (-0.56z)| norm 0.2672 (+0.15z)| lr 6.91e-04 | 4167.34 ms | 32.4% bf16 MFU | 124413 tok/s step 5473/19560 | loss 3.516338 (-0.64z)| norm 0.2823 (+0.88z)| lr 6.91e-04 | 4205.51 ms | 32.1% bf16 MFU | 124426 tok/s step 5474/19560 | loss 3.468801 (-1.81z)| norm 0.2391 (-1.22z)| lr 6.91e-04 | 4235.91 ms | 31.9% bf16 MFU | 124393 tok/s step 5475/19560 | loss 3.589034 (+1.25z)| norm 0.2439 (-0.98z)| lr 6.91e-04 | 4173.21 ms | 32.4% bf16 MFU | 124455 tok/s step 5476/19560 | loss 3.502860 (-0.94z)| norm 0.2457 (-0.90z)| lr 6.91e-04 | 4164.92 ms | 32.4% bf16 MFU | 124527 tok/s step 5477/19560 | loss 3.547140 (+0.20z)| norm 0.2310 (-1.60z)| lr 6.91e-04 | 4177.08 ms | 32.3% bf16 MFU | 124576 tok/s step 5478/19560 | loss 3.522301 (-0.43z)| norm 0.2636 (+0.00z)| lr 6.91e-04 | 4170.79 ms | 32.4% bf16 MFU | 124632 tok/s step 5479/19560 | loss 3.560160 (+0.54z)| norm 0.2899 (+1.27z)| lr 6.90e-04 | 4173.53 ms | 32.4% bf16 MFU | 124682 tok/s step 5480/19560 | loss 3.539072 (+0.01z)| norm 0.2930 (+1.40z)| lr 6.90e-04 | 4168.00 ms | 32.4% bf16 MFU | 124737 tok/s step 5481/19560 | loss 3.562807 (+0.62z)| norm 0.2740 (+0.49z)| lr 6.90e-04 | 4180.57 ms | 32.3% bf16 MFU | 124771 tok/s step 5482/19560 | loss 3.619034 (+2.02z)| norm 0.3256 (+2.96z)| lr 6.90e-04 | 4174.04 ms | 32.3% bf16 MFU | 124813 tok/s step 5483/19560 | loss 3.572887 (+0.85z)| norm 0.2794 (+0.74z)| lr 6.90e-04 | 4165.89 ms | 32.4% bf16 MFU | 124865 tok/s step 5484/19560 | loss 3.576114 (+0.92z)| norm 0.2488 (-0.72z)| lr 6.90e-04 | 4361.08 ms | 31.0% bf16 MFU | 124632 tok/s step 5485/19560 | loss 3.523185 (-0.44z)| norm 0.2957 (+1.52z)| lr 6.90e-04 | 4211.31 ms | 32.1% bf16 MFU | 124626 tok/s step 5486/19560 | loss 3.581094 (+1.05z)| norm 0.2518 (-0.57z)| lr 6.90e-04 | 4222.57 ms | 32.0% bf16 MFU | 124602 tok/s step 5487/19560 | loss 3.531218 (-0.23z)| norm 0.2587 (-0.24z)| lr 6.90e-04 | 4170.41 ms | 32.4% bf16 MFU | 124658 tok/s step 5488/19560 | loss 3.549603 (+0.24z)| norm 0.2629 (-0.04z)| lr 6.90e-04 | 4197.30 ms | 32.2% bf16 MFU | 124671 tok/s step 5489/19560 | loss 3.540023 (-0.01z)| norm 0.2885 (+1.18z)| lr 6.90e-04 | 4168.16 ms | 32.4% bf16 MFU | 124726 tok/s step 5490/19560 | loss 3.506944 (-0.85z)| norm 0.2677 (+0.18z)| lr 6.90e-04 | 4172.63 ms | 32.4% bf16 MFU | 124773 tok/s step 5491/19560 | loss 3.536479 (-0.08z)| norm 0.2650 (+0.05z)| lr 6.90e-04 | 4171.57 ms | 32.4% bf16 MFU | 124818 tok/s step 5492/19560 | loss 3.529501 (-0.28z)| norm 0.2937 (+1.41z)| lr 6.90e-04 | 4180.70 ms | 32.3% bf16 MFU | 124847 tok/s step 5493/19560 | loss 3.525759 (-0.37z)| norm 0.2666 (+0.12z)| lr 6.90e-04 | 4231.28 ms | 31.9% bf16 MFU | 124800 tok/s step 5494/19560 | loss 3.539091 (+0.01z)| norm 0.2511 (-0.62z)| lr 6.90e-04 | 4249.59 ms | 31.8% bf16 MFU | 124729 tok/s step 5495/19560 | loss 3.522876 (-0.46z)| norm 0.2603 (-0.18z)| lr 6.90e-04 | 4218.89 ms | 32.0% bf16 MFU | 124706 tok/s step 5496/19560 | loss 3.559555 (+0.60z)| norm 0.2478 (-0.76z)| lr 6.90e-04 | 4169.27 ms | 32.4% bf16 MFU | 124758 tok/s step 5497/19560 | loss 3.584395 (+1.30z)| norm 0.2417 (-1.04z)| lr 6.90e-04 | 4170.48 ms | 32.4% bf16 MFU | 124806 tok/s step 5498/19560 | loss 3.542437 (+0.09z)| norm 0.2593 (-0.19z)| lr 6.90e-04 | 4210.51 ms | 32.1% bf16 MFU | 124792 tok/s step 5499/19560 | loss 3.553396 (+0.43z)| norm 0.2435 (-0.95z)| lr 6.90e-04 | 4172.48 ms | 32.4% bf16 MFU | 124835 tok/s step 5500/19560 | loss 3.570433 (+0.93z)| norm 0.2511 (-0.56z)| lr 6.90e-04 | 4168.02 ms | 32.4% bf16 MFU | 124883 tok/s val loss 3.519167 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2787/10042 = 0.277534 step 5501/19560 | loss 3.568386 (+0.85z)| norm 0.2518 (-0.54z)| lr 6.89e-04 | 4186.99 ms | 32.2% bf16 MFU | 124899 tok/s step 5502/19560 | loss 3.542474 (+0.09z)| norm 0.2255 (-1.82z)| lr 6.89e-04 | 4237.97 ms | 31.9% bf16 MFU | 124840 tok/s step 5503/19560 | loss 3.538556 (-0.02z)| norm 0.2551 (-0.36z)| lr 6.89e-04 | 4444.61 ms | 30.4% bf16 MFU | 124496 tok/s step 5504/19560 | loss 3.538248 (-0.02z)| norm 0.2360 (-1.30z)| lr 6.89e-04 | 4177.56 ms | 32.3% bf16 MFU | 124546 tok/s step 5505/19560 | loss 3.618828 (+2.30z)| norm 0.2625 (+0.02z)| lr 6.89e-04 | 4170.60 ms | 32.4% bf16 MFU | 124605 tok/s step 5506/19560 | loss 3.526687 (-0.38z)| norm 0.2551 (-0.35z)| lr 6.89e-04 | 4234.84 ms | 31.9% bf16 MFU | 124564 tok/s step 5507/19560 | loss 3.543533 (+0.09z)| norm 0.2602 (-0.11z)| lr 6.89e-04 | 4174.74 ms | 32.3% bf16 MFU | 124616 tok/s step 5508/19560 | loss 3.520997 (-0.59z)| norm 0.2590 (-0.17z)| lr 6.89e-04 | 4171.83 ms | 32.4% bf16 MFU | 124668 tok/s step 5509/19560 | loss 3.569791 (+0.86z)| norm 0.2741 (+0.57z)| lr 6.89e-04 | 4174.32 ms | 32.3% bf16 MFU | 124715 tok/s step 5510/19560 | loss 3.549057 (+0.25z)| norm 0.2780 (+0.77z)| lr 6.89e-04 | 4179.12 ms | 32.3% bf16 MFU | 124752 tok/s step 5511/19560 | loss 3.571199 (+0.91z)| norm 0.2622 (-0.01z)| lr 6.89e-04 | 4175.57 ms | 32.3% bf16 MFU | 124792 tok/s step 5512/19560 | loss 3.559776 (+0.56z)| norm 0.2637 (+0.07z)| lr 6.89e-04 | 4177.31 ms | 32.3% bf16 MFU | 124828 tok/s step 5513/19560 | loss 3.492332 (-1.47z)| norm 0.2558 (-0.32z)| lr 6.89e-04 | 4167.49 ms | 32.4% bf16 MFU | 124877 tok/s step 5514/19560 | loss 3.672950 (+3.73z)| norm 0.2626 (+0.05z)| lr 6.89e-04 | 4176.91 ms | 32.3% bf16 MFU | 124909 tok/s step 5515/19560 | loss 3.519491 (-0.65z)| norm 0.2596 (-0.11z)| lr 6.89e-04 | 4202.52 ms | 32.1% bf16 MFU | 124901 tok/s step 5516/19560 | loss 3.547172 (+0.14z)| norm 0.2856 (+1.25z)| lr 6.89e-04 | 4170.17 ms | 32.4% bf16 MFU | 124943 tok/s step 5517/19560 | loss 3.473099 (-1.94z)| norm 0.2924 (+1.59z)| lr 6.89e-04 | 4171.62 ms | 32.4% bf16 MFU | 124979 tok/s step 5518/19560 | loss 3.537100 (-0.15z)| norm 0.2550 (-0.35z)| lr 6.89e-04 | 4168.98 ms | 32.4% bf16 MFU | 125018 tok/s step 5519/19560 | loss 3.537750 (-0.13z)| norm 0.2741 (+0.67z)| lr 6.89e-04 | 4173.41 ms | 32.4% bf16 MFU | 125049 tok/s step 5520/19560 | loss 3.619654 (+2.15z)| norm 0.2600 (-0.07z)| lr 6.89e-04 | 4216.53 ms | 32.0% bf16 MFU | 125013 tok/s step 5521/19560 | loss 3.512487 (-0.85z)| norm 0.2885 (+1.44z)| lr 6.89e-04 | 4173.00 ms | 32.4% bf16 MFU | 125045 tok/s step 5522/19560 | loss 3.557903 (+0.42z)| norm 0.2734 (+0.63z)| lr 6.88e-04 | 4184.59 ms | 32.3% bf16 MFU | 125057 tok/s step 5523/19560 | loss 3.575401 (+0.90z)| norm 0.2898 (+1.53z)| lr 6.88e-04 | 4208.43 ms | 32.1% bf16 MFU | 125033 tok/s step 5524/19560 | loss 3.505443 (-1.06z)| norm 0.3103 (+2.55z)| lr 6.88e-04 | 4169.53 ms | 32.4% bf16 MFU | 125069 tok/s step 5525/19560 | loss 3.569326 (+0.74z)| norm 0.2598 (-0.11z)| lr 6.88e-04 | 4176.54 ms | 32.3% bf16 MFU | 125092 tok/s step 5526/19560 | loss 3.509885 (-0.92z)| norm 0.2501 (-0.61z)| lr 6.88e-04 | 4187.66 ms | 32.2% bf16 MFU | 125097 tok/s step 5527/19560 | loss 3.547753 (+0.13z)| norm 0.2569 (-0.26z)| lr 6.88e-04 | 4168.02 ms | 32.4% bf16 MFU | 125132 tok/s step 5528/19560 | loss 3.618501 (+2.08z)| norm 0.2660 (+0.21z)| lr 6.88e-04 | 4176.56 ms | 32.3% bf16 MFU | 125152 tok/s step 5529/19560 | loss 3.513684 (-0.85z)| norm 0.2954 (+1.73z)| lr 6.88e-04 | 4170.51 ms | 32.4% bf16 MFU | 125180 tok/s step 5530/19560 | loss 3.500893 (-1.19z)| norm 0.2610 (-0.09z)| lr 6.88e-04 | 4393.69 ms | 30.7% bf16 MFU | 124887 tok/s step 5531/19560 | loss 3.541072 (-0.06z)| norm 0.2775 (+0.77z)| lr 6.88e-04 | 4172.40 ms | 32.4% bf16 MFU | 124926 tok/s step 5532/19560 | loss 3.565051 (+0.60z)| norm 0.2608 (-0.13z)| lr 6.88e-04 | 4169.35 ms | 32.4% bf16 MFU | 124967 tok/s step 5533/19560 | loss 3.533366 (-0.27z)| norm 0.2592 (-0.23z)| lr 6.88e-04 | 4167.38 ms | 32.4% bf16 MFU | 125009 tok/s step 5534/19560 | loss 3.511523 (-0.89z)| norm 0.2664 (+0.16z)| lr 6.88e-04 | 4171.53 ms | 32.4% bf16 MFU | 125042 tok/s step 5535/19560 | loss 3.636920 (+2.57z)| norm 0.2734 (+0.54z)| lr 6.88e-04 | 4170.39 ms | 32.4% bf16 MFU | 125076 tok/s step 5536/19560 | loss 3.477345 (-1.81z)| norm 0.2685 (+0.27z)| lr 6.88e-04 | 4162.61 ms | 32.4% bf16 MFU | 125120 tok/s step 5537/19560 | loss 3.617716 (+1.98z)| norm 0.2855 (+1.19z)| lr 6.88e-04 | 4180.30 ms | 32.3% bf16 MFU | 125135 tok/s step 5538/19560 | loss 3.564780 (+0.55z)| norm 0.2609 (-0.14z)| lr 6.88e-04 | 4263.42 ms | 31.7% bf16 MFU | 125027 tok/s step 5539/19560 | loss 3.523854 (-0.56z)| norm 0.2846 (+1.21z)| lr 6.88e-04 | 4175.56 ms | 32.3% bf16 MFU | 125053 tok/s step 5540/19560 | loss 3.518929 (-0.68z)| norm 0.2523 (-0.61z)| lr 6.88e-04 | 4177.73 ms | 32.3% bf16 MFU | 125076 tok/s step 5541/19560 | loss 3.612073 (+1.79z)| norm 0.2797 (+0.96z)| lr 6.88e-04 | 4163.09 ms | 32.4% bf16 MFU | 125119 tok/s step 5542/19560 | loss 3.546263 (+0.04z)| norm 0.3043 (+2.30z)| lr 6.88e-04 | 4165.87 ms | 32.4% bf16 MFU | 125155 tok/s step 5543/19560 | loss 3.555654 (+0.28z)| norm 0.2480 (-0.85z)| lr 6.87e-04 | 4168.81 ms | 32.4% bf16 MFU | 125186 tok/s step 5544/19560 | loss 3.539369 (-0.15z)| norm 0.2737 (+0.58z)| lr 6.87e-04 | 4167.20 ms | 32.4% bf16 MFU | 125217 tok/s step 5545/19560 | loss 3.575705 (+0.84z)| norm 0.2920 (+1.58z)| lr 6.87e-04 | 4182.19 ms | 32.3% bf16 MFU | 125224 tok/s step 5546/19560 | loss 3.580027 (+0.95z)| norm 0.2753 (+0.65z)| lr 6.87e-04 | 4172.83 ms | 32.4% bf16 MFU | 125245 tok/s step 5547/19560 | loss 3.521600 (-0.62z)| norm 0.2646 (+0.05z)| lr 6.87e-04 | 4216.66 ms | 32.0% bf16 MFU | 125200 tok/s step 5548/19560 | loss 3.518766 (-0.70z)| norm 0.2820 (+1.01z)| lr 6.87e-04 | 4169.75 ms | 32.4% bf16 MFU | 125227 tok/s step 5549/19560 | loss 3.546464 (+0.04z)| norm 0.2705 (+0.36z)| lr 6.87e-04 | 4170.49 ms | 32.4% bf16 MFU | 125251 tok/s step 5550/19560 | loss 3.548441 (+0.08z)| norm 0.2599 (-0.24z)| lr 6.87e-04 | 4176.78 ms | 32.3% bf16 MFU | 125265 tok/s step 5551/19560 | loss 3.495151 (-1.35z)| norm 0.2657 (+0.09z)| lr 6.87e-04 | 4173.16 ms | 32.4% bf16 MFU | 125283 tok/s step 5552/19560 | loss 3.591220 (+1.27z)| norm 0.2277 (-2.01z)| lr 6.87e-04 | 4169.83 ms | 32.4% bf16 MFU | 125306 tok/s step 5553/19560 | loss 3.527799 (-0.45z)| norm 0.3288 (+3.45z)| lr 6.87e-04 | 4168.90 ms | 32.4% bf16 MFU | 125329 tok/s step 5554/19560 | loss 3.512597 (-0.87z)| norm 0.2526 (-0.62z)| lr 6.87e-04 | 4170.54 ms | 32.4% bf16 MFU | 125348 tok/s step 5555/19560 | loss 3.508274 (-0.99z)| norm 0.2369 (-1.45z)| lr 6.87e-04 | 4165.34 ms | 32.4% bf16 MFU | 125374 tok/s step 5556/19560 | loss 3.526186 (-0.50z)| norm 0.2456 (-0.97z)| lr 6.87e-04 | 4164.39 ms | 32.4% bf16 MFU | 125400 tok/s step 5557/19560 | loss 3.554444 (+0.28z)| norm 0.2455 (-0.97z)| lr 6.87e-04 | 4172.60 ms | 32.4% bf16 MFU | 125413 tok/s step 5558/19560 | loss 3.573587 (+0.80z)| norm 0.2607 (-0.16z)| lr 6.87e-04 | 4175.87 ms | 32.3% bf16 MFU | 125419 tok/s step 5559/19560 | loss 3.482318 (-1.68z)| norm 0.2581 (-0.29z)| lr 6.87e-04 | 4160.31 ms | 32.5% bf16 MFU | 125450 tok/s step 5560/19560 | loss 3.519671 (-0.65z)| norm 0.2361 (-1.46z)| lr 6.87e-04 | 4168.41 ms | 32.4% bf16 MFU | 125466 tok/s step 5561/19560 | loss 3.534776 (-0.24z)| norm 0.2513 (-0.66z)| lr 6.87e-04 | 4168.10 ms | 32.4% bf16 MFU | 125482 tok/s step 5562/19560 | loss 3.544353 (+0.03z)| norm 0.2585 (-0.27z)| lr 6.87e-04 | 4245.42 ms | 31.8% bf16 MFU | 125383 tok/s step 5563/19560 | loss 3.557942 (+0.39z)| norm 0.2308 (-1.73z)| lr 6.87e-04 | 4167.73 ms | 32.4% bf16 MFU | 125403 tok/s step 5564/19560 | loss 3.501267 (-1.13z)| norm 0.2308 (-1.70z)| lr 6.87e-04 | 4169.02 ms | 32.4% bf16 MFU | 125421 tok/s step 5565/19560 | loss 3.492519 (-1.36z)| norm 0.2399 (-1.22z)| lr 6.86e-04 | 4170.75 ms | 32.4% bf16 MFU | 125435 tok/s step 5566/19560 | loss 3.546715 (+0.11z)| norm 0.2538 (-0.50z)| lr 6.86e-04 | 4171.79 ms | 32.4% bf16 MFU | 125447 tok/s step 5567/19560 | loss 3.540548 (-0.05z)| norm 0.2944 (+1.61z)| lr 6.86e-04 | 4156.96 ms | 32.5% bf16 MFU | 125481 tok/s step 5568/19560 | loss 3.646237 (+2.70z)| norm 0.2649 (+0.06z)| lr 6.86e-04 | 4173.35 ms | 32.4% bf16 MFU | 125488 tok/s step 5569/19560 | loss 3.507894 (-1.00z)| norm 0.2823 (+0.96z)| lr 6.86e-04 | 4181.87 ms | 32.3% bf16 MFU | 125483 tok/s step 5570/19560 | loss 3.526402 (-0.48z)| norm 0.2593 (-0.26z)| lr 6.86e-04 | 4173.11 ms | 32.4% bf16 MFU | 125490 tok/s step 5571/19560 | loss 3.589349 (+1.22z)| norm 0.2686 (+0.22z)| lr 6.86e-04 | 4207.56 ms | 32.1% bf16 MFU | 125446 tok/s step 5572/19560 | loss 3.519962 (-0.67z)| norm 0.2524 (-0.64z)| lr 6.86e-04 | 4169.20 ms | 32.4% bf16 MFU | 125461 tok/s step 5573/19560 | loss 3.568067 (+0.63z)| norm 0.2591 (-0.27z)| lr 6.86e-04 | 4172.94 ms | 32.4% bf16 MFU | 125470 tok/s step 5574/19560 | loss 3.508416 (-0.97z)| norm 0.2460 (-0.97z)| lr 6.86e-04 | 4167.34 ms | 32.4% bf16 MFU | 125487 tok/s step 5575/19560 | loss 3.502410 (-1.14z)| norm 0.2743 (+0.54z)| lr 6.86e-04 | 4180.18 ms | 32.3% bf16 MFU | 125484 tok/s step 5576/19560 | loss 3.505359 (-1.06z)| norm 0.2641 (-0.01z)| lr 6.86e-04 | 4162.62 ms | 32.4% bf16 MFU | 125507 tok/s step 5577/19560 | loss 3.554682 (+0.32z)| norm 0.2445 (-1.06z)| lr 6.86e-04 | 4167.41 ms | 32.4% bf16 MFU | 125522 tok/s step 5578/19560 | loss 3.501454 (-1.15z)| norm 0.2607 (-0.18z)| lr 6.86e-04 | 4158.80 ms | 32.5% bf16 MFU | 125549 tok/s step 5579/19560 | loss 3.495896 (-1.29z)| norm 0.2665 (+0.13z)| lr 6.86e-04 | 4163.89 ms | 32.4% bf16 MFU | 125568 tok/s step 5580/19560 | loss 3.520502 (-0.61z)| norm 0.2291 (-1.86z)| lr 6.86e-04 | 4162.59 ms | 32.4% bf16 MFU | 125587 tok/s step 5581/19560 | loss 3.507210 (-0.97z)| norm 0.2487 (-0.80z)| lr 6.86e-04 | 4176.61 ms | 32.3% bf16 MFU | 125584 tok/s step 5582/19560 | loss 3.565505 (+0.64z)| norm 0.2634 (-0.02z)| lr 6.86e-04 | 4165.03 ms | 32.4% bf16 MFU | 125599 tok/s step 5583/19560 | loss 3.530552 (-0.32z)| norm 0.2527 (-0.58z)| lr 6.86e-04 | 4162.78 ms | 32.4% bf16 MFU | 125616 tok/s step 5584/19560 | loss 3.603490 (+1.66z)| norm 0.2427 (-1.10z)| lr 6.86e-04 | 4185.65 ms | 32.3% bf16 MFU | 125598 tok/s step 5585/19560 | loss 3.564948 (+0.60z)| norm 0.2693 (+0.30z)| lr 6.86e-04 | 4167.14 ms | 32.4% bf16 MFU | 125609 tok/s step 5586/19560 | loss 3.549051 (+0.17z)| norm 0.2699 (+0.35z)| lr 6.85e-04 | 4166.81 ms | 32.4% bf16 MFU | 125620 tok/s step 5587/19560 | loss 3.526655 (-0.44z)| norm 0.2445 (-1.00z)| lr 6.85e-04 | 4173.62 ms | 32.4% bf16 MFU | 125620 tok/s step 5588/19560 | loss 3.537674 (-0.13z)| norm 0.2600 (-0.17z)| lr 6.85e-04 | 4162.68 ms | 32.4% bf16 MFU | 125636 tok/s step 5589/19560 | loss 3.538144 (-0.11z)| norm 0.2399 (-1.23z)| lr 6.85e-04 | 4181.44 ms | 32.3% bf16 MFU | 125624 tok/s step 5590/19560 | loss 3.444040 (-2.64z)| norm 0.2552 (-0.42z)| lr 6.85e-04 | 4162.62 ms | 32.4% bf16 MFU | 125640 tok/s step 5591/19560 | loss 3.527178 (-0.39z)| norm 0.2461 (-0.91z)| lr 6.85e-04 | 4171.92 ms | 32.4% bf16 MFU | 125642 tok/s step 5592/19560 | loss 3.533005 (-0.24z)| norm 0.2383 (-1.32z)| lr 6.85e-04 | 4166.13 ms | 32.4% bf16 MFU | 125652 tok/s step 5593/19560 | loss 3.514018 (-0.76z)| norm 0.2287 (-1.79z)| lr 6.85e-04 | 4169.33 ms | 32.4% bf16 MFU | 125657 tok/s step 5594/19560 | loss 3.551190 (+0.25z)| norm 0.2643 (+0.09z)| lr 6.85e-04 | 4164.73 ms | 32.4% bf16 MFU | 125668 tok/s step 5595/19560 | loss 3.566151 (+0.65z)| norm 0.2328 (-1.55z)| lr 6.85e-04 | 4165.07 ms | 32.4% bf16 MFU | 125679 tok/s step 5596/19560 | loss 3.511825 (-0.83z)| norm 0.2666 (+0.21z)| lr 6.85e-04 | 4171.68 ms | 32.4% bf16 MFU | 125679 tok/s step 5597/19560 | loss 3.536588 (-0.16z)| norm 0.2536 (-0.46z)| lr 6.85e-04 | 4196.29 ms | 32.2% bf16 MFU | 125642 tok/s step 5598/19560 | loss 3.575889 (+0.90z)| norm 0.2704 (+0.41z)| lr 6.85e-04 | 4162.53 ms | 32.4% bf16 MFU | 125657 tok/s step 5599/19560 | loss 3.547785 (+0.14z)| norm 0.2800 (+0.91z)| lr 6.85e-04 | 4161.24 ms | 32.4% bf16 MFU | 125674 tok/s step 5600/19560 | loss 3.622880 (+2.13z)| norm 0.2830 (+1.05z)| lr 6.85e-04 | 4158.41 ms | 32.5% bf16 MFU | 125694 tok/s step 5601/19560 | loss 3.531787 (-0.32z)| norm 0.2480 (-0.76z)| lr 6.85e-04 | 4214.73 ms | 32.0% bf16 MFU | 125629 tok/s step 5602/19560 | loss 3.532639 (-0.32z)| norm 0.2424 (-1.05z)| lr 6.85e-04 | 4200.12 ms | 32.1% bf16 MFU | 125589 tok/s step 5603/19560 | loss 3.554820 (+0.30z)| norm 0.2326 (-1.55z)| lr 6.85e-04 | 4161.81 ms | 32.4% bf16 MFU | 125609 tok/s step 5604/19560 | loss 3.526068 (-0.50z)| norm 0.2634 (+0.04z)| lr 6.85e-04 | 4167.60 ms | 32.4% bf16 MFU | 125618 tok/s step 5605/19560 | loss 3.538900 (-0.14z)| norm 0.3128 (+2.54z)| lr 6.85e-04 | 4156.75 ms | 32.5% bf16 MFU | 125644 tok/s step 5606/19560 | loss 3.594981 (+1.38z)| norm 1.1869 (+10.96z)| lr 6.85e-04 | 4174.56 ms | 32.3% bf16 MFU | 125641 tok/s step 5607/19560 | loss 3.618163 (+1.97z)| norm 0.3545 (+1.00z)| lr 6.84e-04 | 4165.14 ms | 32.4% bf16 MFU | 125653 tok/s step 5608/19560 | loss 3.504675 (-1.08z)| norm 0.3665 (+1.13z)| lr 6.84e-04 | 4186.14 ms | 32.3% bf16 MFU | 125632 tok/s step 5609/19560 | loss 3.586877 (+1.12z)| norm 0.3280 (+0.66z)| lr 6.84e-04 | 4203.34 ms | 32.1% bf16 MFU | 125587 tok/s step 5610/19560 | loss 3.563439 (+0.51z)| norm 0.3351 (+0.75z)| lr 6.84e-04 | 4189.35 ms | 32.2% bf16 MFU | 125565 tok/s step 5611/19560 | loss 3.550712 (+0.17z)| norm 0.3033 (+0.37z)| lr 6.84e-04 | 4188.61 ms | 32.2% bf16 MFU | 125546 tok/s step 5612/19560 | loss 3.622047 (+2.08z)| norm 0.3266 (+0.63z)| lr 6.84e-04 | 4205.93 ms | 32.1% bf16 MFU | 125501 tok/s step 5613/19560 | loss 3.627336 (+2.16z)| norm 0.2626 (-0.12z)| lr 6.84e-04 | 4168.37 ms | 32.4% bf16 MFU | 125515 tok/s step 5614/19560 | loss 3.563622 (+0.48z)| norm 0.2647 (-0.09z)| lr 6.84e-04 | 4196.47 ms | 32.2% bf16 MFU | 125486 tok/s step 5615/19560 | loss 3.563291 (+0.47z)| norm 0.2490 (-0.28z)| lr 6.84e-04 | 4174.29 ms | 32.3% bf16 MFU | 125492 tok/s step 5616/19560 | loss 3.573252 (+0.72z)| norm 0.2358 (-0.43z)| lr 6.84e-04 | 4155.21 ms | 32.5% bf16 MFU | 125526 tok/s step 5617/19560 | loss 3.482239 (-1.65z)| norm 0.2607 (-0.14z)| lr 6.84e-04 | 4157.69 ms | 32.5% bf16 MFU | 125555 tok/s step 5618/19560 | loss 3.597757 (+1.34z)| norm 0.2386 (-0.39z)| lr 6.84e-04 | 4166.15 ms | 32.4% bf16 MFU | 125569 tok/s step 5619/19560 | loss 3.520386 (-0.66z)| norm 0.2518 (-0.24z)| lr 6.84e-04 | 4184.82 ms | 32.3% bf16 MFU | 125555 tok/s step 5620/19560 | loss 3.550261 (+0.11z)| norm 0.2865 (+0.17z)| lr 6.84e-04 | 4161.13 ms | 32.4% bf16 MFU | 125577 tok/s step 5621/19560 | loss 3.531301 (-0.39z)| norm 0.2694 (-0.03z)| lr 6.84e-04 | 4189.97 ms | 32.2% bf16 MFU | 125554 tok/s step 5622/19560 | loss 3.517290 (-0.75z)| norm 0.2346 (-0.44z)| lr 6.84e-04 | 4190.16 ms | 32.2% bf16 MFU | 125533 tok/s step 5623/19560 | loss 3.562491 (+0.42z)| norm 0.2686 (-0.04z)| lr 6.84e-04 | 4163.60 ms | 32.4% bf16 MFU | 125552 tok/s step 5624/19560 | loss 3.558378 (+0.31z)| norm 0.2358 (-0.42z)| lr 6.84e-04 | 4170.51 ms | 32.4% bf16 MFU | 125560 tok/s step 5625/19560 | loss 3.518099 (-0.72z)| norm 0.2788 (+0.08z)| lr 6.84e-04 | 4169.15 ms | 32.4% bf16 MFU | 125570 tok/s step 5626/19560 | loss 3.536427 (-0.24z)| norm 0.2319 (-0.47z)| lr 6.84e-04 | 4168.32 ms | 32.4% bf16 MFU | 125581 tok/s step 5627/19560 | loss 3.547664 (+0.05z)| norm 0.2734 (+0.02z)| lr 6.84e-04 | 4169.33 ms | 32.4% bf16 MFU | 125589 tok/s step 5628/19560 | loss 3.544192 (-0.03z)| norm 0.2470 (-0.29z)| lr 6.83e-04 | 4161.46 ms | 32.4% bf16 MFU | 125609 tok/s step 5629/19560 | loss 3.548294 (+0.08z)| norm 0.2435 (-0.33z)| lr 6.83e-04 | 4159.27 ms | 32.5% bf16 MFU | 125631 tok/s step 5630/19560 | loss 3.642523 (+2.46z)| norm 0.2564 (-0.19z)| lr 6.83e-04 | 4175.88 ms | 32.3% bf16 MFU | 125627 tok/s step 5631/19560 | loss 3.542492 (-0.09z)| norm 0.2373 (-0.41z)| lr 6.83e-04 | 4275.09 ms | 31.6% bf16 MFU | 125478 tok/s step 5632/19560 | loss 3.518596 (-0.70z)| norm 0.2582 (-0.16z)| lr 6.83e-04 | 4170.00 ms | 32.4% bf16 MFU | 125490 tok/s step 5633/19560 | loss 3.520013 (-0.65z)| norm 0.2411 (-0.36z)| lr 6.83e-04 | 4170.70 ms | 32.4% bf16 MFU | 125501 tok/s step 5634/19560 | loss 3.582708 (+0.95z)| norm 0.2526 (-0.23z)| lr 6.83e-04 | 4192.78 ms | 32.2% bf16 MFU | 125478 tok/s step 5635/19560 | loss 3.542240 (-0.09z)| norm 0.2548 (-0.20z)| lr 6.83e-04 | 4160.49 ms | 32.5% bf16 MFU | 125505 tok/s step 5636/19560 | loss 3.531590 (-0.36z)| norm 0.2504 (-0.25z)| lr 6.83e-04 | 4612.45 ms | 29.3% bf16 MFU | 124913 tok/s step 5637/19560 | loss 3.484303 (-1.55z)| norm 0.2349 (-0.43z)| lr 6.83e-04 | 4588.83 ms | 29.4% bf16 MFU | 124380 tok/s step 5638/19560 | loss 3.520625 (-0.62z)| norm 0.2446 (-0.31z)| lr 6.83e-04 | 4522.17 ms | 29.9% bf16 MFU | 123958 tok/s step 5639/19560 | loss 3.535215 (-0.24z)| norm 0.2374 (-0.40z)| lr 6.83e-04 | 4280.02 ms | 31.5% bf16 MFU | 123885 tok/s step 5640/19560 | loss 3.558447 (+0.35z)| norm 0.2620 (-0.11z)| lr 6.83e-04 | 4189.91 ms | 32.2% bf16 MFU | 123947 tok/s step 5641/19560 | loss 3.509487 (-0.90z)| norm 0.2928 (+0.25z)| lr 6.83e-04 | 4430.69 ms | 30.5% bf16 MFU | 123666 tok/s step 5642/19560 | loss 3.545471 (+0.05z)| norm 0.2780 (+0.08z)| lr 6.83e-04 | 4242.19 ms | 31.8% bf16 MFU | 123663 tok/s step 5643/19560 | loss 3.475465 (-1.80z)| norm 0.2690 (-0.03z)| lr 6.83e-04 | 4228.54 ms | 31.9% bf16 MFU | 123679 tok/s step 5644/19560 | loss 3.495523 (-1.25z)| norm 0.2671 (-0.05z)| lr 6.83e-04 | 4292.35 ms | 31.5% bf16 MFU | 123602 tok/s step 5645/19560 | loss 3.508277 (-0.93z)| norm 0.3042 (+0.39z)| lr 6.83e-04 | 4167.68 ms | 32.4% bf16 MFU | 123712 tok/s step 5646/19560 | loss 3.467818 (-1.96z)| norm 0.2792 (+0.09z)| lr 6.83e-04 | 4184.86 ms | 32.3% bf16 MFU | 123790 tok/s step 5647/19560 | loss 3.518604 (-0.62z)| norm 0.2763 (+0.05z)| lr 6.83e-04 | 4238.27 ms | 31.9% bf16 MFU | 123786 tok/s step 5648/19560 | loss 3.520707 (-0.56z)| norm 0.2633 (-0.10z)| lr 6.83e-04 | 4547.60 ms | 29.7% bf16 MFU | 123361 tok/s step 5649/19560 | loss 3.480840 (-1.60z)| norm 0.2341 (-0.44z)| lr 6.82e-04 | 4170.74 ms | 32.4% bf16 MFU | 123479 tok/s step 5650/19560 | loss 3.456134 (-2.18z)| norm 0.2552 (-0.19z)| lr 6.82e-04 | 4203.52 ms | 32.1% bf16 MFU | 123541 tok/s step 5651/19560 | loss 3.549092 (+0.22z)| norm 0.2729 (+0.02z)| lr 6.82e-04 | 4223.68 ms | 32.0% bf16 MFU | 123570 tok/s step 5652/19560 | loss 3.475557 (-1.66z)| norm 0.2439 (-0.31z)| lr 6.82e-04 | 4222.65 ms | 32.0% bf16 MFU | 123600 tok/s step 5653/19560 | loss 3.476667 (-1.60z)| norm 0.2408 (-0.35z)| lr 6.82e-04 | 4212.02 ms | 32.1% bf16 MFU | 123644 tok/s step 5654/19560 | loss 3.482579 (-1.44z)| norm 0.2446 (-0.30z)| lr 6.82e-04 | 4165.06 ms | 32.4% bf16 MFU | 123755 tok/s step 5655/19560 | loss 3.498133 (-1.03z)| norm 0.2515 (-0.22z)| lr 6.82e-04 | 4183.19 ms | 32.3% bf16 MFU | 123834 tok/s step 5656/19560 | loss 3.536298 (-0.05z)| norm 0.2407 (-0.34z)| lr 6.82e-04 | 4224.74 ms | 32.0% bf16 MFU | 123847 tok/s step 5657/19560 | loss 3.560647 (+0.56z)| norm 0.2271 (-0.50z)| lr 6.82e-04 | 4208.82 ms | 32.1% bf16 MFU | 123883 tok/s step 5658/19560 | loss 3.466132 (-1.84z)| norm 0.2563 (-0.15z)| lr 6.82e-04 | 4215.96 ms | 32.0% bf16 MFU | 123907 tok/s step 5659/19560 | loss 3.620514 (+2.04z)| norm 0.2355 (-0.40z)| lr 6.82e-04 | 4173.29 ms | 32.4% bf16 MFU | 123993 tok/s step 5660/19560 | loss 3.489236 (-1.22z)| norm 0.2414 (-0.32z)| lr 6.82e-04 | 4224.69 ms | 32.0% bf16 MFU | 123999 tok/s step 5661/19560 | loss 3.483861 (-1.34z)| norm 0.2462 (-0.27z)| lr 6.82e-04 | 4186.69 ms | 32.2% bf16 MFU | 124060 tok/s step 5662/19560 | loss 3.531400 (-0.17z)| norm 0.2754 (+0.07z)| lr 6.82e-04 | 4173.56 ms | 32.4% bf16 MFU | 124138 tok/s step 5663/19560 | loss 3.533720 (-0.09z)| norm 0.2711 (+0.02z)| lr 6.82e-04 | 4233.57 ms | 31.9% bf16 MFU | 124123 tok/s step 5664/19560 | loss 3.459815 (-1.95z)| norm 0.2533 (-0.18z)| lr 6.82e-04 | 4215.94 ms | 32.0% bf16 MFU | 124135 tok/s step 5665/19560 | loss 3.506678 (-0.76z)| norm 0.3430 (+0.86z)| lr 6.82e-04 | 4171.65 ms | 32.4% bf16 MFU | 124212 tok/s step 5666/19560 | loss 3.413156 (-3.01z)| norm 0.2803 (+0.13z)| lr 6.82e-04 | 4181.41 ms | 32.3% bf16 MFU | 124271 tok/s step 5667/19560 | loss 3.512680 (-0.55z)| norm 0.2702 (+0.01z)| lr 6.82e-04 | 4165.83 ms | 32.4% bf16 MFU | 124350 tok/s step 5668/19560 | loss 3.570823 (+0.87z)| norm 0.2619 (-0.09z)| lr 6.82e-04 | 4180.70 ms | 32.3% bf16 MFU | 124403 tok/s step 5669/19560 | loss 3.551215 (+0.40z)| norm 0.2369 (-0.38z)| lr 6.81e-04 | 4220.16 ms | 32.0% bf16 MFU | 124394 tok/s step 5670/19560 | loss 3.533674 (-0.03z)| norm 0.2470 (-0.25z)| lr 6.81e-04 | 4170.49 ms | 32.4% bf16 MFU | 124460 tok/s step 5671/19560 | loss 3.562841 (+0.69z)| norm 0.2723 (+0.04z)| lr 6.81e-04 | 4168.31 ms | 32.4% bf16 MFU | 124526 tok/s step 5672/19560 | loss 3.482071 (-1.30z)| norm 0.3063 (+0.43z)| lr 6.81e-04 | 4171.36 ms | 32.4% bf16 MFU | 124584 tok/s step 5673/19560 | loss 3.684272 (+3.52z)| norm 0.2282 (-0.47z)| lr 6.81e-04 | 4194.62 ms | 32.2% bf16 MFU | 124605 tok/s step 5674/19560 | loss 3.537745 (+0.06z)| norm 0.3040 (+0.41z)| lr 6.81e-04 | 4166.12 ms | 32.4% bf16 MFU | 124667 tok/s step 5675/19560 | loss 3.534070 (-0.03z)| norm 0.3305 (+0.71z)| lr 6.81e-04 | 4231.90 ms | 31.9% bf16 MFU | 124628 tok/s step 5676/19560 | loss 3.497137 (-0.90z)| norm 0.2702 (+0.01z)| lr 6.81e-04 | 4179.76 ms | 32.3% bf16 MFU | 124668 tok/s step 5677/19560 | loss 3.508307 (-0.62z)| norm 0.2797 (+0.12z)| lr 6.81e-04 | 4175.54 ms | 32.3% bf16 MFU | 124713 tok/s step 5678/19560 | loss 3.514297 (-0.48z)| norm 0.2862 (+0.19z)| lr 6.81e-04 | 4176.71 ms | 32.3% bf16 MFU | 124754 tok/s step 5679/19560 | loss 3.535128 (+0.01z)| norm 0.2720 (+0.03z)| lr 6.81e-04 | 4165.84 ms | 32.4% bf16 MFU | 124809 tok/s step 5680/19560 | loss 3.461429 (-1.71z)| norm 0.2539 (-0.19z)| lr 6.81e-04 | 4193.52 ms | 32.2% bf16 MFU | 124819 tok/s step 5681/19560 | loss 3.553275 (+0.46z)| norm 0.2742 (+0.06z)| lr 6.81e-04 | 4194.00 ms | 32.2% bf16 MFU | 124829 tok/s step 5682/19560 | loss 3.498037 (-0.84z)| norm 0.2661 (-0.04z)| lr 6.81e-04 | 4169.80 ms | 32.4% bf16 MFU | 124874 tok/s step 5683/19560 | loss 3.505634 (-0.66z)| norm 0.2679 (-0.02z)| lr 6.81e-04 | 4224.11 ms | 32.0% bf16 MFU | 124836 tok/s step 5684/19560 | loss 3.550875 (+0.40z)| norm 0.2321 (-0.44z)| lr 6.81e-04 | 4193.48 ms | 32.2% bf16 MFU | 124846 tok/s step 5685/19560 | loss 3.512488 (-0.50z)| norm 0.2513 (-0.21z)| lr 6.81e-04 | 4181.48 ms | 32.3% bf16 MFU | 124873 tok/s step 5686/19560 | loss 3.382173 (-3.38z)| norm 0.2490 (-0.24z)| lr 6.81e-04 | 4177.94 ms | 32.3% bf16 MFU | 124903 tok/s step 5687/19560 | loss 3.584896 (+1.17z)| norm 0.2447 (-0.29z)| lr 6.81e-04 | 4162.67 ms | 32.4% bf16 MFU | 124956 tok/s step 5688/19560 | loss 3.494688 (-0.86z)| norm 0.2520 (-0.20z)| lr 6.81e-04 | 4172.04 ms | 32.4% bf16 MFU | 124991 tok/s step 5689/19560 | loss 3.475535 (-1.27z)| norm 0.2356 (-0.39z)| lr 6.81e-04 | 4195.87 ms | 32.2% bf16 MFU | 124989 tok/s step 5690/19560 | loss 3.533332 (+0.02z)| norm 0.2597 (-0.11z)| lr 6.80e-04 | 4179.03 ms | 32.3% bf16 MFU | 125013 tok/s step 5691/19560 | loss 3.514308 (-0.39z)| norm 0.2529 (-0.20z)| lr 6.80e-04 | 4172.42 ms | 32.4% bf16 MFU | 125045 tok/s step 5692/19560 | loss 3.499570 (-0.72z)| norm 0.2892 (+0.22z)| lr 6.80e-04 | 4227.00 ms | 31.9% bf16 MFU | 124994 tok/s step 5693/19560 | loss 3.492085 (-0.89z)| norm 0.2466 (-0.28z)| lr 6.80e-04 | 4168.99 ms | 32.4% bf16 MFU | 125033 tok/s step 5694/19560 | loss 3.503327 (-0.63z)| norm 0.2879 (+0.20z)| lr 6.80e-04 | 4173.36 ms | 32.4% bf16 MFU | 125062 tok/s step 5695/19560 | loss 3.557095 (+0.57z)| norm 0.2864 (+0.19z)| lr 6.80e-04 | 4178.79 ms | 32.3% bf16 MFU | 125082 tok/s step 5696/19560 | loss 3.489376 (-0.94z)| norm 0.2944 (+0.28z)| lr 6.80e-04 | 4173.08 ms | 32.4% bf16 MFU | 125110 tok/s step 5697/19560 | loss 3.510431 (-0.46z)| norm 0.3330 (+0.72z)| lr 6.80e-04 | 4173.76 ms | 32.3% bf16 MFU | 125135 tok/s step 5698/19560 | loss 3.477337 (-1.20z)| norm 0.2919 (+0.24z)| lr 6.80e-04 | 4196.84 ms | 32.2% bf16 MFU | 125125 tok/s step 5699/19560 | loss 3.507292 (-0.50z)| norm 0.2877 (+0.19z)| lr 6.80e-04 | 4194.15 ms | 32.2% bf16 MFU | 125119 tok/s step 5700/19560 | loss 3.616356 (+1.94z)| norm 0.3004 (+0.33z)| lr 6.80e-04 | 4169.92 ms | 32.4% bf16 MFU | 125149 tok/s step 5701/19560 | loss 3.522268 (-0.17z)| norm 0.2816 (+0.11z)| lr 6.80e-04 | 4188.00 ms | 32.2% bf16 MFU | 125151 tok/s step 5702/19560 | loss 3.518484 (-0.26z)| norm 0.2724 (+0.00z)| lr 6.80e-04 | 4185.18 ms | 32.3% bf16 MFU | 125157 tok/s step 5703/19560 | loss 3.537658 (+0.17z)| norm 0.2579 (-0.16z)| lr 6.80e-04 | 4166.68 ms | 32.4% bf16 MFU | 125191 tok/s step 5704/19560 | loss 3.470466 (-1.33z)| norm 0.2518 (-0.24z)| lr 6.80e-04 | 4192.40 ms | 32.2% bf16 MFU | 125184 tok/s step 5705/19560 | loss 3.535604 (+0.13z)| norm 0.2682 (-0.04z)| lr 6.80e-04 | 4171.99 ms | 32.4% bf16 MFU | 125208 tok/s step 5706/19560 | loss 3.480333 (-1.10z)| norm 0.2549 (-0.20z)| lr 6.80e-04 | 4178.18 ms | 32.3% bf16 MFU | 125222 tok/s step 5707/19560 | loss 3.492891 (-0.82z)| norm 0.2666 (-0.06z)| lr 6.80e-04 | 4254.20 ms | 31.7% bf16 MFU | 125123 tok/s step 5708/19560 | loss 3.426818 (-2.24z)| norm 0.2445 (-0.32z)| lr 6.80e-04 | 4185.89 ms | 32.3% bf16 MFU | 125129 tok/s step 5709/19560 | loss 3.500288 (-0.63z)| norm 0.2499 (-0.26z)| lr 6.80e-04 | 4173.87 ms | 32.3% bf16 MFU | 125154 tok/s step 5710/19560 | loss 3.492249 (-0.79z)| norm 0.2619 (-0.12z)| lr 6.80e-04 | 4192.20 ms | 32.2% bf16 MFU | 125149 tok/s step 5711/19560 | loss 3.472158 (-1.21z)| norm 0.2895 (+0.20z)| lr 6.79e-04 | 4204.96 ms | 32.1% bf16 MFU | 125126 tok/s step 5712/19560 | loss 3.489435 (-0.82z)| norm 0.3094 (+0.43z)| lr 6.79e-04 | 4172.89 ms | 32.4% bf16 MFU | 125152 tok/s step 5713/19560 | loss 3.450474 (-1.65z)| norm 0.2371 (-0.42z)| lr 6.79e-04 | 4184.23 ms | 32.3% bf16 MFU | 125159 tok/s step 5714/19560 | loss 3.467567 (-1.25z)| norm 0.2781 (+0.06z)| lr 6.79e-04 | 4187.04 ms | 32.2% bf16 MFU | 125162 tok/s step 5715/19560 | loss 3.563648 (+0.82z)| norm 0.3028 (+0.35z)| lr 6.79e-04 | 4178.56 ms | 32.3% bf16 MFU | 125177 tok/s step 5716/19560 | loss 3.490541 (-0.75z)| norm 0.3278 (+0.63z)| lr 6.79e-04 | 4188.36 ms | 32.2% bf16 MFU | 125177 tok/s step 5717/19560 | loss 3.528006 (+0.06z)| norm 0.2835 (+0.11z)| lr 6.79e-04 | 4173.30 ms | 32.4% bf16 MFU | 125200 tok/s step 5718/19560 | loss 3.530189 (+0.10z)| norm 0.2559 (-0.21z)| lr 6.79e-04 | 4183.34 ms | 32.3% bf16 MFU | 125206 tok/s step 5719/19560 | loss 3.475097 (-1.10z)| norm 0.2607 (-0.16z)| lr 6.79e-04 | 4183.98 ms | 32.3% bf16 MFU | 125211 tok/s step 5720/19560 | loss 3.495305 (-0.65z)| norm 0.2655 (-0.10z)| lr 6.79e-04 | 4178.06 ms | 32.3% bf16 MFU | 125225 tok/s step 5721/19560 | loss 3.450917 (-1.59z)| norm 0.2924 (+0.20z)| lr 6.79e-04 | 4183.75 ms | 32.3% bf16 MFU | 125230 tok/s step 5722/19560 | loss 3.519606 (-0.10z)| norm 0.2782 (+0.04z)| lr 6.79e-04 | 4181.79 ms | 32.3% bf16 MFU | 125237 tok/s step 5723/19560 | loss 3.566168 (+0.90z)| norm 0.2689 (-0.07z)| lr 6.79e-04 | 4196.82 ms | 32.2% bf16 MFU | 125221 tok/s step 5724/19560 | loss 3.526136 (+0.04z)| norm 0.2887 (+0.15z)| lr 6.79e-04 | 4175.31 ms | 32.3% bf16 MFU | 125239 tok/s step 5725/19560 | loss 3.542280 (+0.38z)| norm 0.2950 (+0.22z)| lr 6.79e-04 | 4179.75 ms | 32.3% bf16 MFU | 125248 tok/s step 5726/19560 | loss 3.553220 (+0.62z)| norm 0.2838 (+0.09z)| lr 6.79e-04 | 4164.67 ms | 32.4% bf16 MFU | 125281 tok/s step 5727/19560 | loss 3.462995 (-1.31z)| norm 0.2435 (-0.38z)| lr 6.79e-04 | 4201.39 ms | 32.1% bf16 MFU | 125256 tok/s step 5728/19560 | loss 3.501795 (-0.46z)| norm 0.2851 (+0.11z)| lr 6.79e-04 | 4174.86 ms | 32.3% bf16 MFU | 125272 tok/s step 5729/19560 | loss 3.534585 (+0.26z)| norm 0.2501 (-0.30z)| lr 6.79e-04 | 4183.29 ms | 32.3% bf16 MFU | 125275 tok/s step 5730/19560 | loss 3.544862 (+0.48z)| norm 0.2697 (-0.07z)| lr 6.79e-04 | 4177.10 ms | 32.3% bf16 MFU | 125287 tok/s step 5731/19560 | loss 3.488727 (-0.74z)| norm 0.2413 (-0.40z)| lr 6.79e-04 | 4185.11 ms | 32.3% bf16 MFU | 125286 tok/s step 5732/19560 | loss 3.514939 (-0.16z)| norm 0.2464 (-0.34z)| lr 6.78e-04 | 4181.32 ms | 32.3% bf16 MFU | 125292 tok/s step 5733/19560 | loss 3.518579 (-0.08z)| norm 0.2583 (-0.20z)| lr 6.78e-04 | 4176.38 ms | 32.3% bf16 MFU | 125304 tok/s step 5734/19560 | loss 3.470663 (-1.11z)| norm 0.2707 (+0.09z)| lr 6.78e-04 | 4179.76 ms | 32.3% bf16 MFU | 125310 tok/s step 5735/19560 | loss 3.518951 (-0.03z)| norm 0.2499 (-0.66z)| lr 6.78e-04 | 4180.75 ms | 32.3% bf16 MFU | 125315 tok/s step 5736/19560 | loss 3.465498 (-1.22z)| norm 0.2556 (-0.44z)| lr 6.78e-04 | 4189.35 ms | 32.2% bf16 MFU | 125307 tok/s step 5737/19560 | loss 3.470918 (-1.08z)| norm 0.2576 (-0.34z)| lr 6.78e-04 | 4182.04 ms | 32.3% bf16 MFU | 125310 tok/s step 5738/19560 | loss 3.499192 (-0.44z)| norm 0.2409 (-1.02z)| lr 6.78e-04 | 4195.80 ms | 32.2% bf16 MFU | 125292 tok/s step 5739/19560 | loss 3.530290 (+0.26z)| norm 0.2465 (-0.78z)| lr 6.78e-04 | 4189.88 ms | 32.2% bf16 MFU | 125284 tok/s step 5740/19560 | loss 3.638507 (+2.67z)| norm 0.2630 (-0.06z)| lr 6.78e-04 | 4166.56 ms | 32.4% bf16 MFU | 125311 tok/s step 5741/19560 | loss 3.530136 (+0.28z)| norm 0.2521 (-0.53z)| lr 6.78e-04 | 4190.99 ms | 32.2% bf16 MFU | 125301 tok/s step 5742/19560 | loss 3.571644 (+1.22z)| norm 0.2433 (-0.90z)| lr 6.78e-04 | 4180.82 ms | 32.3% bf16 MFU | 125306 tok/s step 5743/19560 | loss 3.438373 (-1.78z)| norm 0.2632 (-0.04z)| lr 6.78e-04 | 4171.94 ms | 32.4% bf16 MFU | 125324 tok/s step 5744/19560 | loss 3.514193 (-0.05z)| norm 0.2345 (-1.29z)| lr 6.78e-04 | 4172.68 ms | 32.4% bf16 MFU | 125340 tok/s step 5745/19560 | loss 3.484830 (-0.72z)| norm 0.2642 (+0.00z)| lr 6.78e-04 | 4175.09 ms | 32.3% bf16 MFU | 125352 tok/s step 5746/19560 | loss 3.496619 (-0.44z)| norm 0.2613 (-0.14z)| lr 6.78e-04 | 4183.26 ms | 32.3% bf16 MFU | 125351 tok/s step 5747/19560 | loss 3.490442 (-0.58z)| norm 0.2445 (-0.87z)| lr 6.78e-04 | 4167.33 ms | 32.4% bf16 MFU | 125374 tok/s step 5748/19560 | loss 3.516754 (+0.04z)| norm 0.2638 (-0.02z)| lr 6.78e-04 | 4180.06 ms | 32.3% bf16 MFU | 125376 tok/s step 5749/19560 | loss 3.490499 (-0.56z)| norm 0.2632 (-0.04z)| lr 6.78e-04 | 4179.42 ms | 32.3% bf16 MFU | 125380 tok/s step 5750/19560 | loss 3.458975 (-1.27z)| norm 0.2561 (-0.36z)| lr 6.78e-04 | 4165.89 ms | 32.4% bf16 MFU | 125404 tok/s val loss 3.510865 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2780/10042 = 0.276837 step 5751/19560 | loss 3.480041 (-0.78z)| norm 0.2552 (-0.39z)| lr 6.78e-04 | 4607.05 ms | 29.3% bf16 MFU | 124823 tok/s step 5752/19560 | loss 3.437034 (-1.73z)| norm 0.2642 (-0.01z)| lr 6.77e-04 | 4441.34 ms | 30.4% bf16 MFU | 124485 tok/s step 5753/19560 | loss 3.496174 (-0.38z)| norm 0.2380 (-1.15z)| lr 6.77e-04 | 4260.70 ms | 31.7% bf16 MFU | 124413 tok/s step 5754/19560 | loss 3.508479 (-0.09z)| norm 0.2491 (-0.67z)| lr 6.77e-04 | 4329.41 ms | 31.2% bf16 MFU | 124247 tok/s step 5755/19560 | loss 3.530799 (+0.42z)| norm 0.2364 (-1.21z)| lr 6.77e-04 | 4240.01 ms | 31.8% bf16 MFU | 124218 tok/s step 5756/19560 | loss 3.428898 (-1.87z)| norm 0.2580 (-0.26z)| lr 6.77e-04 | 4168.39 ms | 32.4% bf16 MFU | 124296 tok/s step 5757/19560 | loss 3.467159 (-0.99z)| norm 0.2694 (+0.23z)| lr 6.77e-04 | 4315.93 ms | 31.3% bf16 MFU | 124155 tok/s step 5758/19560 | loss 3.487258 (-0.52z)| norm 0.2394 (-1.09z)| lr 6.77e-04 | 4161.93 ms | 32.4% bf16 MFU | 124245 tok/s step 5759/19560 | loss 3.538454 (+0.68z)| norm 0.2646 (+0.01z)| lr 6.77e-04 | 4202.27 ms | 32.1% bf16 MFU | 124271 tok/s step 5760/19560 | loss 3.490306 (-0.44z)| norm 0.2703 (+0.26z)| lr 6.77e-04 | 4169.44 ms | 32.4% bf16 MFU | 124345 tok/s step 5761/19560 | loss 3.495199 (-0.33z)| norm 0.2682 (+0.16z)| lr 6.77e-04 | 4198.77 ms | 32.2% bf16 MFU | 124371 tok/s step 5762/19560 | loss 3.521590 (+0.31z)| norm 0.2456 (-0.84z)| lr 6.77e-04 | 4172.37 ms | 32.4% bf16 MFU | 124435 tok/s step 5763/19560 | loss 3.451698 (-1.32z)| norm 0.2664 (+0.08z)| lr 6.77e-04 | 4166.25 ms | 32.4% bf16 MFU | 124506 tok/s step 5764/19560 | loss 3.533448 (+0.60z)| norm 0.2821 (+0.77z)| lr 6.77e-04 | 4294.00 ms | 31.4% bf16 MFU | 124385 tok/s step 5765/19560 | loss 3.476026 (-0.75z)| norm 0.2897 (+1.09z)| lr 6.77e-04 | 4207.67 ms | 32.1% bf16 MFU | 124396 tok/s step 5766/19560 | loss 3.436760 (-1.64z)| norm 0.2769 (+0.51z)| lr 6.77e-04 | 4224.55 ms | 32.0% bf16 MFU | 124382 tok/s step 5767/19560 | loss 3.483493 (-0.54z)| norm 0.2478 (-0.80z)| lr 6.77e-04 | 4184.71 ms | 32.3% bf16 MFU | 124427 tok/s step 5768/19560 | loss 3.535707 (+0.68z)| norm 0.2730 (+0.33z)| lr 6.77e-04 | 4181.27 ms | 32.3% bf16 MFU | 124475 tok/s step 5769/19560 | loss 3.516657 (+0.23z)| norm 0.2625 (-0.13z)| lr 6.77e-04 | 4172.91 ms | 32.4% bf16 MFU | 124533 tok/s step 5770/19560 | loss 3.505120 (-0.03z)| norm 0.2650 (-0.02z)| lr 6.77e-04 | 4178.51 ms | 32.3% bf16 MFU | 124580 tok/s step 5771/19560 | loss 3.542454 (+0.83z)| norm 0.2572 (-0.36z)| lr 6.77e-04 | 4197.89 ms | 32.2% bf16 MFU | 124596 tok/s step 5772/19560 | loss 3.550071 (+1.00z)| norm 0.3036 (+1.70z)| lr 6.77e-04 | 4167.49 ms | 32.4% bf16 MFU | 124656 tok/s step 5773/19560 | loss 3.583622 (+1.74z)| norm 0.2654 (+0.01z)| lr 6.76e-04 | 4178.91 ms | 32.3% bf16 MFU | 124697 tok/s step 5774/19560 | loss 3.473289 (-0.80z)| norm 0.2638 (-0.06z)| lr 6.76e-04 | 4271.87 ms | 31.6% bf16 MFU | 124598 tok/s step 5775/19560 | loss 3.557588 (+1.13z)| norm 0.2786 (+0.61z)| lr 6.76e-04 | 4172.64 ms | 32.4% bf16 MFU | 124651 tok/s step 5776/19560 | loss 3.528425 (+0.46z)| norm 0.3085 (+1.92z)| lr 6.76e-04 | 4402.34 ms | 30.7% bf16 MFU | 124373 tok/s step 5777/19560 | loss 3.518939 (+0.24z)| norm 0.2902 (+1.08z)| lr 6.76e-04 | 4170.51 ms | 32.4% bf16 MFU | 124440 tok/s step 5778/19560 | loss 3.469132 (-0.91z)| norm 0.2742 (+0.36z)| lr 6.76e-04 | 4168.35 ms | 32.4% bf16 MFU | 124507 tok/s step 5779/19560 | loss 3.508148 (-0.01z)| norm 0.2584 (-0.34z)| lr 6.76e-04 | 4174.11 ms | 32.3% bf16 MFU | 124562 tok/s step 5780/19560 | loss 3.525830 (+0.39z)| norm 0.2724 (+0.28z)| lr 6.76e-04 | 4318.62 ms | 31.3% bf16 MFU | 124404 tok/s step 5781/19560 | loss 3.568530 (+1.36z)| norm 0.2936 (+1.21z)| lr 6.76e-04 | 4163.57 ms | 32.4% bf16 MFU | 124480 tok/s step 5782/19560 | loss 3.499599 (-0.23z)| norm 0.2842 (+0.77z)| lr 6.76e-04 | 4183.21 ms | 32.3% bf16 MFU | 124522 tok/s step 5783/19560 | loss 3.506863 (-0.07z)| norm 0.2560 (-0.49z)| lr 6.76e-04 | 4174.38 ms | 32.3% bf16 MFU | 124576 tok/s step 5784/19560 | loss 3.452523 (-1.30z)| norm 0.2744 (+0.32z)| lr 6.76e-04 | 4185.54 ms | 32.3% bf16 MFU | 124610 tok/s step 5785/19560 | loss 3.540528 (+0.73z)| norm 0.2854 (+0.81z)| lr 6.76e-04 | 4183.43 ms | 32.3% bf16 MFU | 124646 tok/s step 5786/19560 | loss 3.477273 (-0.73z)| norm 0.2882 (+0.93z)| lr 6.76e-04 | 4179.30 ms | 32.3% bf16 MFU | 124686 tok/s step 5787/19560 | loss 3.530809 (+0.53z)| norm 0.2630 (-0.24z)| lr 6.76e-04 | 4168.61 ms | 32.4% bf16 MFU | 124740 tok/s step 5788/19560 | loss 3.561169 (+1.23z)| norm 0.2571 (-0.52z)| lr 6.76e-04 | 4161.39 ms | 32.4% bf16 MFU | 124803 tok/s step 5789/19560 | loss 3.537997 (+0.68z)| norm 0.2692 (+0.04z)| lr 6.76e-04 | 4178.87 ms | 32.3% bf16 MFU | 124836 tok/s step 5790/19560 | loss 3.485029 (-0.56z)| norm 0.2576 (-0.50z)| lr 6.76e-04 | 4179.83 ms | 32.3% bf16 MFU | 124866 tok/s step 5791/19560 | loss 3.412243 (-2.21z)| norm 0.2835 (+0.70z)| lr 6.76e-04 | 4188.16 ms | 32.2% bf16 MFU | 124881 tok/s step 5792/19560 | loss 3.503632 (-0.11z)| norm 0.2644 (-0.19z)| lr 6.76e-04 | 4183.12 ms | 32.3% bf16 MFU | 124904 tok/s step 5793/19560 | loss 3.431365 (-1.75z)| norm 0.2532 (-0.71z)| lr 6.75e-04 | 4171.06 ms | 32.4% bf16 MFU | 124944 tok/s step 5794/19560 | loss 3.590997 (+1.89z)| norm 0.2833 (+0.75z)| lr 6.75e-04 | 4654.30 ms | 29.0% bf16 MFU | 124329 tok/s step 5795/19560 | loss 3.541639 (+0.74z)| norm 0.3214 (+2.52z)| lr 6.75e-04 | 4175.51 ms | 32.3% bf16 MFU | 124391 tok/s step 5796/19560 | loss 3.467550 (-0.95z)| norm 0.2605 (-0.36z)| lr 6.75e-04 | 4171.79 ms | 32.4% bf16 MFU | 124455 tok/s step 5797/19560 | loss 3.477080 (-0.72z)| norm 0.2625 (-0.28z)| lr 6.75e-04 | 4300.50 ms | 31.4% bf16 MFU | 124328 tok/s step 5798/19560 | loss 3.489904 (-0.41z)| norm 0.2582 (-0.49z)| lr 6.75e-04 | 4182.07 ms | 32.3% bf16 MFU | 124380 tok/s step 5799/19560 | loss 3.584110 (+1.76z)| norm 0.2408 (-1.31z)| lr 6.75e-04 | 4182.80 ms | 32.3% bf16 MFU | 124428 tok/s step 5800/19560 | loss 3.506513 (-0.03z)| norm 0.2385 (-1.40z)| lr 6.75e-04 | 4167.36 ms | 32.4% bf16 MFU | 124497 tok/s step 5801/19560 | loss 3.509503 (+0.07z)| norm 0.2399 (-1.35z)| lr 6.75e-04 | 4165.91 ms | 32.4% bf16 MFU | 124565 tok/s step 5802/19560 | loss 3.471539 (-0.85z)| norm 0.2593 (-0.40z)| lr 6.75e-04 | 4432.48 ms | 30.5% bf16 MFU | 124250 tok/s step 5803/19560 | loss 3.446118 (-1.46z)| norm 0.2473 (-0.98z)| lr 6.75e-04 | 4184.92 ms | 32.3% bf16 MFU | 124302 tok/s step 5804/19560 | loss 3.499747 (-0.14z)| norm 0.2399 (-1.34z)| lr 6.75e-04 | 4166.79 ms | 32.4% bf16 MFU | 124378 tok/s step 5805/19560 | loss 3.502294 (-0.08z)| norm 0.2701 (+0.18z)| lr 6.75e-04 | 4175.01 ms | 32.3% bf16 MFU | 124438 tok/s step 5806/19560 | loss 3.521351 (+0.39z)| norm 0.2762 (+0.49z)| lr 6.75e-04 | 4186.50 ms | 32.3% bf16 MFU | 124478 tok/s step 5807/19560 | loss 3.507033 (+0.04z)| norm 0.2371 (-1.45z)| lr 6.75e-04 | 4156.56 ms | 32.5% bf16 MFU | 124561 tok/s step 5808/19560 | loss 3.550584 (+1.10z)| norm 0.2965 (+1.49z)| lr 6.75e-04 | 4179.72 ms | 32.3% bf16 MFU | 124604 tok/s step 5809/19560 | loss 3.448346 (-1.39z)| norm 0.3366 (+3.31z)| lr 6.75e-04 | 4175.55 ms | 32.3% bf16 MFU | 124652 tok/s step 5810/19560 | loss 3.540010 (+0.85z)| norm 0.3165 (+2.29z)| lr 6.75e-04 | 4185.42 ms | 32.3% bf16 MFU | 124683 tok/s step 5811/19560 | loss 3.494683 (-0.26z)| norm 0.2704 (+0.14z)| lr 6.75e-04 | 4168.99 ms | 32.4% bf16 MFU | 124737 tok/s step 5812/19560 | loss 3.470864 (-0.83z)| norm 0.2624 (-0.25z)| lr 6.75e-04 | 4171.00 ms | 32.4% bf16 MFU | 124785 tok/s step 5813/19560 | loss 3.516931 (+0.30z)| norm 0.2562 (-0.54z)| lr 6.74e-04 | 4197.52 ms | 32.2% bf16 MFU | 124791 tok/s step 5814/19560 | loss 3.519265 (+0.34z)| norm 0.3001 (+1.50z)| lr 6.74e-04 | 4169.91 ms | 32.4% bf16 MFU | 124838 tok/s step 5815/19560 | loss 3.531376 (+0.67z)| norm 0.2745 (+0.29z)| lr 6.74e-04 | 4393.33 ms | 30.7% bf16 MFU | 124563 tok/s step 5816/19560 | loss 3.566857 (+1.55z)| norm 0.2929 (+1.14z)| lr 6.74e-04 | 4176.56 ms | 32.3% bf16 MFU | 124611 tok/s step 5817/19560 | loss 3.552948 (+1.18z)| norm 0.3028 (+1.58z)| lr 6.74e-04 | 4174.43 ms | 32.3% bf16 MFU | 124660 tok/s step 5818/19560 | loss 3.477743 (-0.72z)| norm 0.2691 (-0.00z)| lr 6.74e-04 | 4166.97 ms | 32.4% bf16 MFU | 124718 tok/s step 5819/19560 | loss 3.562373 (+1.41z)| norm 0.2570 (-0.58z)| lr 6.74e-04 | 4302.72 ms | 31.4% bf16 MFU | 124575 tok/s step 5820/19560 | loss 3.497588 (-0.23z)| norm 0.2606 (-0.40z)| lr 6.74e-04 | 4164.80 ms | 32.4% bf16 MFU | 124641 tok/s step 5821/19560 | loss 3.462272 (-1.11z)| norm 0.2536 (-0.73z)| lr 6.74e-04 | 4171.56 ms | 32.4% bf16 MFU | 124693 tok/s step 5822/19560 | loss 3.540894 (+0.86z)| norm 0.2590 (-0.46z)| lr 6.74e-04 | 4187.30 ms | 32.2% bf16 MFU | 124718 tok/s step 5823/19560 | loss 3.399801 (-2.59z)| norm 0.2486 (-0.94z)| lr 6.74e-04 | 4173.44 ms | 32.4% bf16 MFU | 124764 tok/s step 5824/19560 | loss 3.508398 (+0.07z)| norm 0.2451 (-1.09z)| lr 6.74e-04 | 4165.58 ms | 32.4% bf16 MFU | 124819 tok/s step 5825/19560 | loss 3.439933 (-1.58z)| norm 0.2453 (-1.09z)| lr 6.74e-04 | 4194.23 ms | 32.2% bf16 MFU | 124828 tok/s step 5826/19560 | loss 3.535130 (+0.72z)| norm 0.2415 (-1.25z)| lr 6.74e-04 | 4166.60 ms | 32.4% bf16 MFU | 124878 tok/s step 5827/19560 | loss 3.503788 (-0.04z)| norm 0.2713 (+0.21z)| lr 6.74e-04 | 5132.72 ms | 26.3% bf16 MFU | 123741 tok/s step 5828/19560 | loss 3.574804 (+1.73z)| norm 0.2493 (-0.85z)| lr 6.74e-04 | 4574.88 ms | 29.5% bf16 MFU | 123284 tok/s step 5829/19560 | loss 3.521705 (+0.41z)| norm 0.2476 (-0.93z)| lr 6.74e-04 | 4868.79 ms | 27.7% bf16 MFU | 122504 tok/s step 5830/19560 | loss 3.564211 (+1.44z)| norm 0.2803 (+0.69z)| lr 6.74e-04 | 4500.81 ms | 30.0% bf16 MFU | 122203 tok/s step 5831/19560 | loss 3.479162 (-0.63z)| norm 0.2651 (-0.06z)| lr 6.74e-04 | 4873.05 ms | 27.7% bf16 MFU | 121473 tok/s step 5832/19560 | loss 3.515393 (+0.25z)| norm 0.2517 (-0.73z)| lr 6.74e-04 | 4295.81 ms | 31.4% bf16 MFU | 121501 tok/s step 5833/19560 | loss 3.485781 (-0.47z)| norm 0.2495 (-0.82z)| lr 6.74e-04 | 4510.10 ms | 29.9% bf16 MFU | 121239 tok/s step 5834/19560 | loss 3.461202 (-1.07z)| norm 0.2266 (-1.92z)| lr 6.73e-04 | 4163.45 ms | 32.4% bf16 MFU | 121473 tok/s step 5835/19560 | loss 3.575829 (+1.71z)| norm 0.2282 (-1.80z)| lr 6.73e-04 | 4337.95 ms | 31.1% bf16 MFU | 121443 tok/s step 5836/19560 | loss 3.555198 (+1.20z)| norm 0.2415 (-1.16z)| lr 6.73e-04 | 4246.54 ms | 31.8% bf16 MFU | 121544 tok/s step 5837/19560 | loss 3.546130 (+0.96z)| norm 0.2160 (-2.32z)| lr 6.73e-04 | 4221.56 ms | 32.0% bf16 MFU | 121676 tok/s step 5838/19560 | loss 3.490802 (-0.39z)| norm 0.2222 (-1.99z)| lr 6.73e-04 | 4169.62 ms | 32.4% bf16 MFU | 121879 tok/s step 5839/19560 | loss 3.486995 (-0.49z)| norm 0.2447 (-0.93z)| lr 6.73e-04 | 4376.38 ms | 30.9% bf16 MFU | 121775 tok/s step 5840/19560 | loss 3.469915 (-0.90z)| norm 0.2253 (-1.80z)| lr 6.73e-04 | 4276.28 ms | 31.6% bf16 MFU | 121817 tok/s step 5841/19560 | loss 3.507236 (+0.00z)| norm 0.2433 (-0.97z)| lr 6.73e-04 | 4220.94 ms | 32.0% bf16 MFU | 121936 tok/s step 5842/19560 | loss 3.500569 (-0.17z)| norm 0.2598 (-0.20z)| lr 6.73e-04 | 4163.80 ms | 32.4% bf16 MFU | 122135 tok/s step 5843/19560 | loss 3.452330 (-1.34z)| norm 0.2411 (-1.06z)| lr 6.73e-04 | 4226.50 ms | 31.9% bf16 MFU | 122231 tok/s step 5844/19560 | loss 3.497571 (-0.22z)| norm 0.2509 (-0.59z)| lr 6.73e-04 | 4167.48 ms | 32.4% bf16 MFU | 122410 tok/s step 5845/19560 | loss 3.549845 (+1.06z)| norm 0.2532 (-0.46z)| lr 6.73e-04 | 4355.85 ms | 31.0% bf16 MFU | 122307 tok/s step 5846/19560 | loss 3.527425 (+0.51z)| norm 0.2507 (-0.59z)| lr 6.73e-04 | 4173.11 ms | 32.4% bf16 MFU | 122474 tok/s step 5847/19560 | loss 3.505733 (-0.03z)| norm 0.2646 (+0.09z)| lr 6.73e-04 | 4301.93 ms | 31.4% bf16 MFU | 122444 tok/s step 5848/19560 | loss 3.527540 (+0.50z)| norm 0.2476 (-0.73z)| lr 6.73e-04 | 6333.66 ms | 21.3% bf16 MFU | 120460 tok/s step 5849/19560 | loss 3.505027 (-0.07z)| norm 0.2488 (-0.66z)| lr 6.73e-04 | 4210.57 ms | 32.1% bf16 MFU | 120663 tok/s step 5850/19560 | loss 3.465067 (-1.05z)| norm 0.2392 (-1.11z)| lr 6.73e-04 | 4172.85 ms | 32.4% bf16 MFU | 120912 tok/s step 5851/19560 | loss 3.534954 (+0.70z)| norm 0.2572 (-0.23z)| lr 6.73e-04 | 4162.02 ms | 32.4% bf16 MFU | 121165 tok/s step 5852/19560 | loss 3.487915 (-0.47z)| norm 0.2571 (-0.22z)| lr 6.73e-04 | 4161.56 ms | 32.4% bf16 MFU | 121406 tok/s step 5853/19560 | loss 3.484807 (-0.54z)| norm 0.2391 (-1.09z)| lr 6.73e-04 | 4163.44 ms | 32.4% bf16 MFU | 121632 tok/s step 5854/19560 | loss 3.511951 (+0.15z)| norm 0.2436 (-0.86z)| lr 6.72e-04 | 4244.25 ms | 31.8% bf16 MFU | 121727 tok/s step 5855/19560 | loss 3.493214 (-0.33z)| norm 0.2408 (-1.00z)| lr 6.72e-04 | 4176.27 ms | 32.3% bf16 MFU | 121917 tok/s step 5856/19560 | loss 3.586521 (+1.98z)| norm 0.2292 (-1.54z)| lr 6.72e-04 | 4257.06 ms | 31.7% bf16 MFU | 121979 tok/s step 5857/19560 | loss 3.555982 (+1.21z)| norm 0.2790 (+0.91z)| lr 6.72e-04 | 4167.85 ms | 32.4% bf16 MFU | 122170 tok/s step 5858/19560 | loss 3.557569 (+1.25z)| norm 0.2675 (+0.34z)| lr 6.72e-04 | 4164.19 ms | 32.4% bf16 MFU | 122357 tok/s step 5859/19560 | loss 3.552057 (+1.09z)| norm 0.2840 (+1.14z)| lr 6.72e-04 | 4158.94 ms | 32.5% bf16 MFU | 122542 tok/s step 5860/19560 | loss 3.485913 (-0.53z)| norm 0.2497 (-0.55z)| lr 6.72e-04 | 4184.68 ms | 32.3% bf16 MFU | 122679 tok/s step 5861/19560 | loss 3.492213 (-0.37z)| norm 0.2755 (+0.71z)| lr 6.72e-04 | 4215.84 ms | 32.0% bf16 MFU | 122764 tok/s step 5862/19560 | loss 3.463822 (-1.06z)| norm 0.2550 (-0.30z)| lr 6.72e-04 | 4173.30 ms | 32.4% bf16 MFU | 122907 tok/s step 5863/19560 | loss 3.483755 (-0.57z)| norm 0.2230 (-1.84z)| lr 6.72e-04 | 4181.34 ms | 32.3% bf16 MFU | 123031 tok/s step 5864/19560 | loss 3.449607 (-1.39z)| norm 0.2417 (-0.92z)| lr 6.72e-04 | 4181.11 ms | 32.3% bf16 MFU | 123149 tok/s step 5865/19560 | loss 3.464556 (-1.03z)| norm 0.2398 (-1.00z)| lr 6.72e-04 | 4198.03 ms | 32.2% bf16 MFU | 123236 tok/s step 5866/19560 | loss 3.482359 (-0.59z)| norm 0.2366 (-1.15z)| lr 6.72e-04 | 4180.48 ms | 32.3% bf16 MFU | 123345 tok/s step 5867/19560 | loss 3.464361 (-1.01z)| norm 0.2290 (-1.50z)| lr 6.72e-04 | 4173.69 ms | 32.3% bf16 MFU | 123458 tok/s step 5868/19560 | loss 3.525395 (+0.51z)| norm 0.2331 (-1.29z)| lr 6.72e-04 | 4177.17 ms | 32.3% bf16 MFU | 123561 tok/s step 5869/19560 | loss 3.497186 (-0.19z)| norm 0.2466 (-0.64z)| lr 6.72e-04 | 4192.58 ms | 32.2% bf16 MFU | 123636 tok/s step 5870/19560 | loss 3.511919 (+0.19z)| norm 0.2575 (-0.13z)| lr 6.72e-04 | 4160.88 ms | 32.4% bf16 MFU | 123754 tok/s step 5871/19560 | loss 3.510469 (+0.14z)| norm 0.2623 (+0.10z)| lr 6.72e-04 | 4188.80 ms | 32.2% bf16 MFU | 123825 tok/s step 5872/19560 | loss 3.537347 (+0.83z)| norm 0.2632 (+0.13z)| lr 6.72e-04 | 4180.20 ms | 32.3% bf16 MFU | 123904 tok/s step 5873/19560 | loss 3.551845 (+1.19z)| norm 0.2519 (-0.40z)| lr 6.72e-04 | 4184.10 ms | 32.3% bf16 MFU | 123974 tok/s step 5874/19560 | loss 3.550695 (+1.14z)| norm 0.2589 (-0.07z)| lr 6.71e-04 | 4173.19 ms | 32.4% bf16 MFU | 124057 tok/s step 5875/19560 | loss 3.490734 (-0.39z)| norm 0.2847 (+1.15z)| lr 6.71e-04 | 4189.55 ms | 32.2% bf16 MFU | 124112 tok/s step 5876/19560 | loss 3.494222 (-0.30z)| norm 0.2810 (+0.97z)| lr 6.71e-04 | 4190.53 ms | 32.2% bf16 MFU | 124162 tok/s step 5877/19560 | loss 3.491428 (-0.37z)| norm 0.2411 (-0.92z)| lr 6.71e-04 | 4181.01 ms | 32.3% bf16 MFU | 124223 tok/s step 5878/19560 | loss 3.506217 (-0.00z)| norm 0.2291 (-1.47z)| lr 6.71e-04 | 4171.32 ms | 32.4% bf16 MFU | 124297 tok/s step 5879/19560 | loss 3.489730 (-0.43z)| norm 0.2768 (+0.77z)| lr 6.71e-04 | 4199.13 ms | 32.2% bf16 MFU | 124325 tok/s step 5880/19560 | loss 3.481883 (-0.65z)| norm 0.2467 (-0.64z)| lr 6.71e-04 | 4190.38 ms | 32.2% bf16 MFU | 124364 tok/s step 5881/19560 | loss 3.538719 (+0.82z)| norm 0.2516 (-0.42z)| lr 6.71e-04 | 4178.77 ms | 32.3% bf16 MFU | 124419 tok/s step 5882/19560 | loss 3.504975 (-0.05z)| norm 0.2682 (+0.36z)| lr 6.71e-04 | 4197.85 ms | 32.2% bf16 MFU | 124443 tok/s step 5883/19560 | loss 3.513208 (+0.16z)| norm 0.2693 (+0.40z)| lr 6.71e-04 | 4181.77 ms | 32.3% bf16 MFU | 124490 tok/s step 5884/19560 | loss 3.470127 (-0.98z)| norm 0.3028 (+1.94z)| lr 6.71e-04 | 4190.44 ms | 32.2% bf16 MFU | 124521 tok/s step 5885/19560 | loss 3.478275 (-0.77z)| norm 0.2489 (-0.57z)| lr 6.71e-04 | 4182.34 ms | 32.3% bf16 MFU | 124563 tok/s step 5886/19560 | loss 3.475143 (-0.85z)| norm 0.2418 (-0.90z)| lr 6.71e-04 | 4183.61 ms | 32.3% bf16 MFU | 124601 tok/s step 5887/19560 | loss 3.486907 (-0.53z)| norm 0.2422 (-0.87z)| lr 6.71e-04 | 4181.17 ms | 32.3% bf16 MFU | 124640 tok/s step 5888/19560 | loss 3.476004 (-0.81z)| norm 0.2447 (-0.75z)| lr 6.71e-04 | 4198.91 ms | 32.2% bf16 MFU | 124651 tok/s step 5889/19560 | loss 3.440832 (-1.71z)| norm 0.2615 (+0.04z)| lr 6.71e-04 | 4188.58 ms | 32.2% bf16 MFU | 124677 tok/s step 5890/19560 | loss 3.473814 (-0.84z)| norm 0.2417 (-0.88z)| lr 6.71e-04 | 4194.27 ms | 32.2% bf16 MFU | 124693 tok/s step 5891/19560 | loss 3.568574 (+1.61z)| norm 0.2411 (-0.89z)| lr 6.71e-04 | 4176.64 ms | 32.3% bf16 MFU | 124735 tok/s step 5892/19560 | loss 3.515918 (+0.24z)| norm 0.2392 (-0.97z)| lr 6.71e-04 | 4184.52 ms | 32.3% bf16 MFU | 124763 tok/s step 5893/19560 | loss 3.485346 (-0.56z)| norm 0.2260 (-1.55z)| lr 6.71e-04 | 4181.63 ms | 32.3% bf16 MFU | 124794 tok/s step 5894/19560 | loss 3.462912 (-1.16z)| norm 0.2534 (-0.28z)| lr 6.70e-04 | 4179.14 ms | 32.3% bf16 MFU | 124827 tok/s step 5895/19560 | loss 3.524799 (+0.46z)| norm 0.2499 (-0.44z)| lr 6.70e-04 | 4185.76 ms | 32.3% bf16 MFU | 124848 tok/s step 5896/19560 | loss 3.523594 (+0.43z)| norm 0.2709 (+0.53z)| lr 6.70e-04 | 4179.28 ms | 32.3% bf16 MFU | 124878 tok/s step 5897/19560 | loss 3.510506 (+0.09z)| norm 0.3069 (+2.14z)| lr 6.70e-04 | 4181.71 ms | 32.3% bf16 MFU | 124903 tok/s step 5898/19560 | loss 3.568824 (+1.60z)| norm 0.3380 (+3.37z)| lr 6.70e-04 | 4189.94 ms | 32.2% bf16 MFU | 124915 tok/s step 5899/19560 | loss 3.526659 (+0.50z)| norm 0.3319 (+2.97z)| lr 6.70e-04 | 4191.54 ms | 32.2% bf16 MFU | 124923 tok/s step 5900/19560 | loss 3.551352 (+1.15z)| norm 0.2713 (+0.45z)| lr 6.70e-04 | 4179.30 ms | 32.3% bf16 MFU | 124949 tok/s step 5901/19560 | loss 3.516294 (+0.25z)| norm 0.2682 (+0.32z)| lr 6.70e-04 | 4179.81 ms | 32.3% bf16 MFU | 124973 tok/s step 5902/19560 | loss 3.489403 (-0.47z)| norm 0.2733 (+0.53z)| lr 6.70e-04 | 4189.82 ms | 32.2% bf16 MFU | 124981 tok/s step 5903/19560 | loss 3.474780 (-0.85z)| norm 0.2712 (+0.44z)| lr 6.70e-04 | 4169.28 ms | 32.4% bf16 MFU | 125020 tok/s step 5904/19560 | loss 3.527575 (+0.56z)| norm 0.2599 (-0.02z)| lr 6.70e-04 | 4181.03 ms | 32.3% bf16 MFU | 125039 tok/s step 5905/19560 | loss 3.505522 (-0.02z)| norm 0.2280 (-1.37z)| lr 6.70e-04 | 4183.96 ms | 32.3% bf16 MFU | 125052 tok/s step 5906/19560 | loss 3.531381 (+0.66z)| norm 0.2848 (+1.07z)| lr 6.70e-04 | 4173.89 ms | 32.3% bf16 MFU | 125080 tok/s step 5907/19560 | loss 3.529076 (+0.59z)| norm 0.2585 (-0.06z)| lr 6.70e-04 | 4181.23 ms | 32.3% bf16 MFU | 125096 tok/s step 5908/19560 | loss 3.525250 (+0.49z)| norm 0.2605 (+0.03z)| lr 6.70e-04 | 4229.35 ms | 31.9% bf16 MFU | 125039 tok/s step 5909/19560 | loss 3.527755 (+0.57z)| norm 0.2813 (+0.93z)| lr 6.70e-04 | 4196.99 ms | 32.2% bf16 MFU | 125033 tok/s step 5910/19560 | loss 3.532854 (+0.70z)| norm 0.2720 (+0.53z)| lr 6.70e-04 | 4187.81 ms | 32.2% bf16 MFU | 125041 tok/s step 5911/19560 | loss 3.524424 (+0.47z)| norm 0.2595 (-0.01z)| lr 6.70e-04 | 4180.73 ms | 32.3% bf16 MFU | 125059 tok/s step 5912/19560 | loss 3.465439 (-1.13z)| norm 0.2664 (+0.30z)| lr 6.70e-04 | 4184.30 ms | 32.3% bf16 MFU | 125071 tok/s step 5913/19560 | loss 3.544393 (+1.00z)| norm 0.2601 (+0.03z)| lr 6.70e-04 | 4177.75 ms | 32.3% bf16 MFU | 125093 tok/s step 5914/19560 | loss 3.486492 (-0.56z)| norm 0.2536 (-0.24z)| lr 6.69e-04 | 4256.58 ms | 31.7% bf16 MFU | 124997 tok/s step 5915/19560 | loss 3.579760 (+1.92z)| norm 0.2522 (-0.30z)| lr 6.69e-04 | 4179.55 ms | 32.3% bf16 MFU | 125019 tok/s step 5916/19560 | loss 3.503573 (-0.10z)| norm 0.2449 (-0.62z)| lr 6.69e-04 | 4228.91 ms | 31.9% bf16 MFU | 124967 tok/s step 5917/19560 | loss 3.513813 (+0.18z)| norm 0.2678 (+0.39z)| lr 6.69e-04 | 4178.78 ms | 32.3% bf16 MFU | 124992 tok/s step 5918/19560 | loss 3.464900 (-1.13z)| norm 0.2574 (-0.07z)| lr 6.69e-04 | 4174.77 ms | 32.3% bf16 MFU | 125021 tok/s step 5919/19560 | loss 3.474543 (-0.90z)| norm 0.2646 (+0.25z)| lr 6.69e-04 | 4182.84 ms | 32.3% bf16 MFU | 125037 tok/s step 5920/19560 | loss 3.557312 (+1.35z)| norm 0.2639 (+0.23z)| lr 6.69e-04 | 4174.69 ms | 32.3% bf16 MFU | 125065 tok/s step 5921/19560 | loss 3.471495 (-1.01z)| norm 0.2633 (+0.20z)| lr 6.69e-04 | 4181.81 ms | 32.3% bf16 MFU | 125080 tok/s step 5922/19560 | loss 3.486198 (-0.59z)| norm 0.2758 (+0.75z)| lr 6.69e-04 | 4181.96 ms | 32.3% bf16 MFU | 125095 tok/s step 5923/19560 | loss 3.494021 (-0.36z)| norm 0.2573 (-0.05z)| lr 6.69e-04 | 4176.44 ms | 32.3% bf16 MFU | 125117 tok/s step 5924/19560 | loss 3.566398 (+1.65z)| norm 0.2454 (-0.58z)| lr 6.69e-04 | 4188.43 ms | 32.2% bf16 MFU | 125120 tok/s step 5925/19560 | loss 3.496752 (-0.31z)| norm 0.2579 (-0.01z)| lr 6.69e-04 | 4197.50 ms | 32.2% bf16 MFU | 125109 tok/s step 5926/19560 | loss 3.495887 (-0.34z)| norm 0.2858 (+1.24z)| lr 6.69e-04 | 4181.30 ms | 32.3% bf16 MFU | 125123 tok/s step 5927/19560 | loss 3.476773 (-0.87z)| norm 0.2504 (-0.36z)| lr 6.69e-04 | 4216.77 ms | 32.0% bf16 MFU | 125083 tok/s step 5928/19560 | loss 3.477828 (-0.83z)| norm 0.2480 (-0.48z)| lr 6.69e-04 | 4186.53 ms | 32.3% bf16 MFU | 125091 tok/s step 5929/19560 | loss 3.527796 (+0.60z)| norm 0.2529 (-0.26z)| lr 6.69e-04 | 4179.65 ms | 32.3% bf16 MFU | 125108 tok/s step 5930/19560 | loss 3.637380 (+3.52z)| norm 0.2654 (+0.30z)| lr 6.69e-04 | 4206.81 ms | 32.1% bf16 MFU | 125084 tok/s step 5931/19560 | loss 3.486187 (-0.62z)| norm 0.2535 (-0.24z)| lr 6.69e-04 | 4175.11 ms | 32.3% bf16 MFU | 125109 tok/s step 5932/19560 | loss 3.542256 (+0.92z)| norm 0.2730 (+0.64z)| lr 6.69e-04 | 4167.10 ms | 32.4% bf16 MFU | 125144 tok/s step 5933/19560 | loss 3.455895 (-1.43z)| norm 0.2651 (+0.28z)| lr 6.69e-04 | 4181.23 ms | 32.3% bf16 MFU | 125156 tok/s step 5934/19560 | loss 3.496079 (-0.33z)| norm 0.2627 (+0.18z)| lr 6.68e-04 | 4234.90 ms | 31.9% bf16 MFU | 125089 tok/s step 5935/19560 | loss 3.479846 (-0.77z)| norm 0.2433 (-0.71z)| lr 6.68e-04 | 4176.26 ms | 32.3% bf16 MFU | 125111 tok/s step 5936/19560 | loss 3.470570 (-1.00z)| norm 0.2475 (-0.51z)| lr 6.68e-04 | 4179.09 ms | 32.3% bf16 MFU | 125128 tok/s step 5937/19560 | loss 3.569804 (+1.67z)| norm 0.2560 (-0.09z)| lr 6.68e-04 | 4163.08 ms | 32.4% bf16 MFU | 125169 tok/s step 5938/19560 | loss 3.499765 (-0.23z)| norm 0.2463 (-0.56z)| lr 6.68e-04 | 4181.14 ms | 32.3% bf16 MFU | 125180 tok/s step 5939/19560 | loss 3.518632 (+0.28z)| norm 0.2690 (+0.59z)| lr 6.68e-04 | 4201.46 ms | 32.1% bf16 MFU | 125160 tok/s step 5940/19560 | loss 3.489881 (-0.51z)| norm 0.2878 (+1.52z)| lr 6.68e-04 | 4173.44 ms | 32.4% bf16 MFU | 125184 tok/s step 5941/19560 | loss 3.504143 (-0.12z)| norm 0.2605 (+0.15z)| lr 6.68e-04 | 4165.09 ms | 32.4% bf16 MFU | 125218 tok/s step 5942/19560 | loss 3.463268 (-1.21z)| norm 0.2441 (-0.66z)| lr 6.68e-04 | 4196.03 ms | 32.2% bf16 MFU | 125205 tok/s step 5943/19560 | loss 3.524518 (+0.45z)| norm 0.2531 (-0.19z)| lr 6.68e-04 | 4252.66 ms | 31.7% bf16 MFU | 125109 tok/s step 5944/19560 | loss 3.532186 (+0.67z)| norm 0.2395 (-0.87z)| lr 6.68e-04 | 4174.05 ms | 32.3% bf16 MFU | 125134 tok/s step 5945/19560 | loss 3.499500 (-0.21z)| norm 0.2656 (+0.49z)| lr 6.68e-04 | 4179.22 ms | 32.3% bf16 MFU | 125150 tok/s step 5946/19560 | loss 3.560194 (+1.44z)| norm 0.2540 (-0.11z)| lr 6.68e-04 | 4168.16 ms | 32.4% bf16 MFU | 125181 tok/s step 5947/19560 | loss 3.480097 (-0.75z)| norm 0.2695 (+0.70z)| lr 6.68e-04 | 4186.25 ms | 32.3% bf16 MFU | 125184 tok/s step 5948/19560 | loss 3.505604 (-0.05z)| norm 0.2836 (+1.42z)| lr 6.68e-04 | 4179.31 ms | 32.3% bf16 MFU | 125198 tok/s step 5949/19560 | loss 3.512826 (+0.14z)| norm 0.2579 (+0.07z)| lr 6.68e-04 | 4184.42 ms | 32.3% bf16 MFU | 125202 tok/s step 5950/19560 | loss 3.519186 (+0.33z)| norm 0.2587 (+0.12z)| lr 6.68e-04 | 4172.92 ms | 32.4% bf16 MFU | 125224 tok/s step 5951/19560 | loss 3.537817 (+0.84z)| norm 0.2579 (+0.07z)| lr 6.68e-04 | 4196.61 ms | 32.2% bf16 MFU | 125210 tok/s step 5952/19560 | loss 3.499357 (-0.26z)| norm 0.2416 (-0.78z)| lr 6.68e-04 | 4183.29 ms | 32.3% bf16 MFU | 125216 tok/s step 5953/19560 | loss 3.461647 (-1.37z)| norm 0.2565 (-0.00z)| lr 6.68e-04 | 4182.02 ms | 32.3% bf16 MFU | 125223 tok/s step 5954/19560 | loss 3.466909 (-1.19z)| norm 0.2675 (+0.56z)| lr 6.67e-04 | 4170.58 ms | 32.4% bf16 MFU | 125248 tok/s step 5955/19560 | loss 3.455663 (-1.50z)| norm 0.2640 (+0.38z)| lr 6.67e-04 | 4204.44 ms | 32.1% bf16 MFU | 125220 tok/s step 5956/19560 | loss 3.481246 (-0.75z)| norm 0.2600 (+0.17z)| lr 6.67e-04 | 4185.21 ms | 32.3% bf16 MFU | 125223 tok/s step 5957/19560 | loss 3.467818 (-1.12z)| norm 0.2791 (+1.15z)| lr 6.67e-04 | 4180.34 ms | 32.3% bf16 MFU | 125232 tok/s step 5958/19560 | loss 3.510972 (+0.14z)| norm 0.2728 (+0.83z)| lr 6.67e-04 | 4173.06 ms | 32.4% bf16 MFU | 125253 tok/s step 5959/19560 | loss 3.512628 (+0.18z)| norm 0.2819 (+1.29z)| lr 6.67e-04 | 4187.55 ms | 32.2% bf16 MFU | 125250 tok/s step 5960/19560 | loss 3.518976 (+0.37z)| norm 0.2974 (+2.04z)| lr 6.67e-04 | 4167.52 ms | 32.4% bf16 MFU | 125278 tok/s step 5961/19560 | loss 3.484561 (-0.64z)| norm 0.2724 (+0.75z)| lr 6.67e-04 | 4212.52 ms | 32.1% bf16 MFU | 125237 tok/s step 5962/19560 | loss 3.584684 (+2.24z)| norm 0.2696 (+0.60z)| lr 6.67e-04 | 4179.74 ms | 32.3% bf16 MFU | 125247 tok/s step 5963/19560 | loss 3.559518 (+1.52z)| norm 0.3015 (+2.19z)| lr 6.67e-04 | 4175.19 ms | 32.3% bf16 MFU | 125263 tok/s step 5964/19560 | loss 3.533117 (+0.76z)| norm 0.2472 (-0.58z)| lr 6.67e-04 | 4171.89 ms | 32.4% bf16 MFU | 125283 tok/s step 5965/19560 | loss 3.479748 (-0.79z)| norm 0.2708 (+0.61z)| lr 6.67e-04 | 4179.26 ms | 32.3% bf16 MFU | 125292 tok/s step 5966/19560 | loss 3.536314 (+0.86z)| norm 0.2635 (+0.22z)| lr 6.67e-04 | 4174.61 ms | 32.3% bf16 MFU | 125307 tok/s step 5967/19560 | loss 3.470777 (-1.05z)| norm 0.2732 (+0.72z)| lr 6.67e-04 | 4184.46 ms | 32.3% bf16 MFU | 125306 tok/s step 5968/19560 | loss 3.471662 (-1.03z)| norm 0.2533 (-0.35z)| lr 6.67e-04 | 4188.40 ms | 32.2% bf16 MFU | 125300 tok/s step 5969/19560 | loss 3.483243 (-0.68z)| norm 0.2685 (+0.46z)| lr 6.67e-04 | 4164.50 ms | 32.4% bf16 MFU | 125329 tok/s step 5970/19560 | loss 3.422940 (-2.37z)| norm 0.2505 (-0.50z)| lr 6.67e-04 | 4362.03 ms | 31.0% bf16 MFU | 125073 tok/s step 5971/19560 | loss 3.500168 (-0.18z)| norm 0.2732 (+0.70z)| lr 6.67e-04 | 4186.57 ms | 32.3% bf16 MFU | 125080 tok/s step 5972/19560 | loss 3.443690 (-1.77z)| norm 0.2810 (+1.10z)| lr 6.67e-04 | 4188.18 ms | 32.2% bf16 MFU | 125086 tok/s step 5973/19560 | loss 3.598932 (+2.58z)| norm 0.2625 (+0.11z)| lr 6.67e-04 | 4271.25 ms | 31.6% bf16 MFU | 124969 tok/s step 5974/19560 | loss 3.486004 (-0.56z)| norm 0.2589 (-0.09z)| lr 6.66e-04 | 4161.61 ms | 32.4% bf16 MFU | 125019 tok/s step 5975/19560 | loss 3.549267 (+1.19z)| norm 0.2435 (-0.90z)| lr 6.66e-04 | 4171.69 ms | 32.4% bf16 MFU | 125052 tok/s step 5976/19560 | loss 3.510280 (+0.11z)| norm 0.3024 (+2.18z)| lr 6.66e-04 | 4173.20 ms | 32.4% bf16 MFU | 125081 tok/s step 5977/19560 | loss 3.505135 (-0.03z)| norm 0.2835 (+1.17z)| lr 6.66e-04 | 4182.35 ms | 32.3% bf16 MFU | 125095 tok/s step 5978/19560 | loss 3.492358 (-0.39z)| norm 0.2502 (-0.57z)| lr 6.66e-04 | 4175.45 ms | 32.3% bf16 MFU | 125119 tok/s step 5979/19560 | loss 3.567386 (+1.68z)| norm 0.2785 (+0.90z)| lr 6.66e-04 | 4180.83 ms | 32.3% bf16 MFU | 125133 tok/s step 5980/19560 | loss 3.498176 (-0.24z)| norm 0.2614 (+0.00z)| lr 6.66e-04 | 4167.35 ms | 32.4% bf16 MFU | 125167 tok/s step 5981/19560 | loss 3.454255 (-1.44z)| norm 0.2412 (-1.06z)| lr 6.66e-04 | 4175.47 ms | 32.3% bf16 MFU | 125186 tok/s step 5982/19560 | loss 3.512957 (+0.18z)| norm 0.2680 (+0.34z)| lr 6.66e-04 | 4192.69 ms | 32.2% bf16 MFU | 125179 tok/s step 5983/19560 | loss 3.534290 (+0.75z)| norm 0.2540 (-0.40z)| lr 6.66e-04 | 4178.20 ms | 32.3% bf16 MFU | 125195 tok/s step 5984/19560 | loss 3.462085 (-1.22z)| norm 0.2560 (-0.32z)| lr 6.66e-04 | 4181.10 ms | 32.3% bf16 MFU | 125205 tok/s step 5985/19560 | loss 3.491282 (-0.40z)| norm 0.2485 (-0.70z)| lr 6.66e-04 | 4174.61 ms | 32.3% bf16 MFU | 125224 tok/s step 5986/19560 | loss 3.533458 (+0.80z)| norm 0.2554 (-0.33z)| lr 6.66e-04 | 4181.57 ms | 32.3% bf16 MFU | 125232 tok/s step 5987/19560 | loss 3.555722 (+1.42z)| norm 0.2716 (+0.55z)| lr 6.66e-04 | 4179.85 ms | 32.3% bf16 MFU | 125242 tok/s step 5988/19560 | loss 3.529469 (+0.67z)| norm 0.2441 (-0.93z)| lr 6.66e-04 | 4220.27 ms | 32.0% bf16 MFU | 125191 tok/s step 5989/19560 | loss 3.499338 (-0.18z)| norm 0.2488 (-0.66z)| lr 6.66e-04 | 4179.89 ms | 32.3% bf16 MFU | 125203 tok/s step 5990/19560 | loss 3.469904 (-1.01z)| norm 0.2442 (-0.90z)| lr 6.66e-04 | 4170.50 ms | 32.4% bf16 MFU | 125229 tok/s step 5991/19560 | loss 3.508703 (+0.08z)| norm 0.2234 (-2.02z)| lr 6.66e-04 | 4193.81 ms | 32.2% bf16 MFU | 125218 tok/s step 5992/19560 | loss 3.483921 (-0.63z)| norm 0.2494 (-0.63z)| lr 6.66e-04 | 4175.97 ms | 32.3% bf16 MFU | 125234 tok/s step 5993/19560 | loss 3.459974 (-1.31z)| norm 0.2678 (+0.35z)| lr 6.65e-04 | 4178.81 ms | 32.3% bf16 MFU | 125246 tok/s step 5994/19560 | loss 3.539612 (+0.94z)| norm 0.2527 (-0.48z)| lr 6.65e-04 | 4179.66 ms | 32.3% bf16 MFU | 125256 tok/s step 5995/19560 | loss 3.538558 (+0.89z)| norm 0.2570 (-0.26z)| lr 6.65e-04 | 4178.53 ms | 32.3% bf16 MFU | 125266 tok/s step 5996/19560 | loss 3.502466 (-0.13z)| norm 0.2623 (+0.02z)| lr 6.65e-04 | 4181.23 ms | 32.3% bf16 MFU | 125273 tok/s step 5997/19560 | loss 3.455155 (-1.45z)| norm 0.2634 (+0.07z)| lr 6.65e-04 | 4167.97 ms | 32.4% bf16 MFU | 125298 tok/s step 5998/19560 | loss 3.567337 (+1.68z)| norm 0.2633 (+0.07z)| lr 6.65e-04 | 4175.63 ms | 32.3% bf16 MFU | 125311 tok/s step 5999/19560 | loss 3.513588 (+0.18z)| norm 0.2619 (-0.02z)| lr 6.65e-04 | 4223.15 ms | 32.0% bf16 MFU | 125253 tok/s step 6000/19560 | loss 3.450753 (-1.54z)| norm 0.3005 (+2.08z)| lr 6.65e-04 | 4177.59 ms | 32.3% bf16 MFU | 125266 tok/s val loss 3.496021 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2818/10042 = 0.280621 step 6001/19560 | loss 3.451196 (-1.51z)| norm 0.2718 (+0.51z)| lr 6.65e-04 | 4455.05 ms | 30.3% bf16 MFU | 124886 tok/s step 6002/19560 | loss 3.479228 (-0.72z)| norm 0.2893 (+1.43z)| lr 6.65e-04 | 4246.59 ms | 31.8% bf16 MFU | 124815 tok/s step 6003/19560 | loss 3.543555 (+1.05z)| norm 0.2606 (-0.11z)| lr 6.65e-04 | 4253.23 ms | 31.7% bf16 MFU | 124738 tok/s step 6004/19560 | loss 3.509288 (+0.10z)| norm 0.2940 (+1.69z)| lr 6.65e-04 | 4262.16 ms | 31.7% bf16 MFU | 124651 tok/s step 6005/19560 | loss 3.500550 (-0.14z)| norm 0.2836 (+1.11z)| lr 6.65e-04 | 4288.15 ms | 31.5% bf16 MFU | 124532 tok/s step 6006/19560 | loss 3.532848 (+0.75z)| norm 0.2865 (+1.26z)| lr 6.65e-04 | 4390.31 ms | 30.8% bf16 MFU | 124276 tok/s step 6007/19560 | loss 3.494343 (-0.32z)| norm 0.2833 (+1.07z)| lr 6.65e-04 | 4173.35 ms | 32.4% bf16 MFU | 124344 tok/s step 6008/19560 | loss 3.485169 (-0.57z)| norm 0.2852 (+1.16z)| lr 6.65e-04 | 4202.13 ms | 32.1% bf16 MFU | 124365 tok/s step 6009/19560 | loss 3.467362 (-1.05z)| norm 0.2567 (-0.39z)| lr 6.65e-04 | 4197.28 ms | 32.2% bf16 MFU | 124392 tok/s step 6010/19560 | loss 3.504279 (-0.03z)| norm 0.2342 (-1.59z)| lr 6.65e-04 | 4175.83 ms | 32.3% bf16 MFU | 124451 tok/s step 6011/19560 | loss 3.483162 (-0.61z)| norm 0.2589 (-0.25z)| lr 6.65e-04 | 4357.74 ms | 31.0% bf16 MFU | 124244 tok/s step 6012/19560 | loss 3.542481 (+1.01z)| norm 0.2971 (+1.81z)| lr 6.65e-04 | 4250.36 ms | 31.8% bf16 MFU | 124199 tok/s step 6013/19560 | loss 3.525190 (+0.53z)| norm 0.2647 (+0.06z)| lr 6.64e-04 | 4264.01 ms | 31.7% bf16 MFU | 124137 tok/s step 6014/19560 | loss 3.501683 (-0.13z)| norm 0.2688 (+0.27z)| lr 6.64e-04 | 4259.70 ms | 31.7% bf16 MFU | 124084 tok/s step 6015/19560 | loss 3.470165 (-0.99z)| norm 0.2438 (-1.10z)| lr 6.64e-04 | 4166.85 ms | 32.4% bf16 MFU | 124171 tok/s step 6016/19560 | loss 3.533775 (+0.75z)| norm 0.2450 (-1.03z)| lr 6.64e-04 | 4173.94 ms | 32.3% bf16 MFU | 124243 tok/s step 6017/19560 | loss 3.487572 (-0.54z)| norm 0.2574 (-0.35z)| lr 6.64e-04 | 4408.37 ms | 30.6% bf16 MFU | 123977 tok/s step 6018/19560 | loss 3.481595 (-0.71z)| norm 0.2604 (-0.19z)| lr 6.64e-04 | 4744.46 ms | 28.5% bf16 MFU | 123304 tok/s step 6019/19560 | loss 3.435342 (-1.97z)| norm 0.2512 (-0.71z)| lr 6.64e-04 | 4488.95 ms | 30.1% bf16 MFU | 122978 tok/s step 6020/19560 | loss 3.482227 (-0.66z)| norm 0.2584 (-0.32z)| lr 6.64e-04 | 4225.68 ms | 32.0% bf16 MFU | 123033 tok/s step 6021/19560 | loss 3.480605 (-0.70z)| norm 0.2489 (-0.87z)| lr 6.64e-04 | 4347.42 ms | 31.1% bf16 MFU | 122911 tok/s step 6022/19560 | loss 3.488360 (-0.49z)| norm 0.2516 (-0.72z)| lr 6.64e-04 | 4264.02 ms | 31.7% bf16 MFU | 122913 tok/s step 6023/19560 | loss 3.526655 (+0.58z)| norm 0.2786 (+0.79z)| lr 6.64e-04 | 4239.78 ms | 31.8% bf16 MFU | 122951 tok/s step 6024/19560 | loss 3.483934 (-0.61z)| norm 0.2508 (-0.77z)| lr 6.64e-04 | 4243.95 ms | 31.8% bf16 MFU | 122980 tok/s step 6025/19560 | loss 3.498220 (-0.20z)| norm 0.2320 (-1.81z)| lr 6.64e-04 | 4164.90 ms | 32.4% bf16 MFU | 123125 tok/s step 6026/19560 | loss 3.443430 (-1.71z)| norm 0.2420 (-1.28z)| lr 6.64e-04 | 4233.64 ms | 31.9% bf16 MFU | 123161 tok/s step 6027/19560 | loss 3.582279 (+2.13z)| norm 0.2511 (-0.75z)| lr 6.64e-04 | 4208.26 ms | 32.1% bf16 MFU | 123232 tok/s step 6028/19560 | loss 3.532650 (+0.77z)| norm 0.5845 (+9.92z)| lr 6.64e-04 | 4396.53 ms | 30.7% bf16 MFU | 123033 tok/s step 6029/19560 | loss 3.529582 (+0.68z)| norm 0.3735 (+3.21z)| lr 6.64e-04 | 4388.92 ms | 30.8% bf16 MFU | 122854 tok/s step 6030/19560 | loss 3.530835 (+0.71z)| norm 0.3421 (+2.21z)| lr 6.64e-04 | 4345.92 ms | 31.1% bf16 MFU | 122743 tok/s step 6031/19560 | loss 3.497396 (-0.22z)| norm 0.2912 (+0.72z)| lr 6.64e-04 | 4170.50 ms | 32.4% bf16 MFU | 122892 tok/s step 6032/19560 | loss 3.458117 (-1.29z)| norm 0.2968 (+0.87z)| lr 6.64e-04 | 4159.62 ms | 32.5% bf16 MFU | 123049 tok/s step 6033/19560 | loss 3.464551 (-1.09z)| norm 0.2838 (+0.49z)| lr 6.63e-04 | 4239.97 ms | 31.8% bf16 MFU | 123080 tok/s step 6034/19560 | loss 3.518217 (+0.38z)| norm 0.2810 (+0.41z)| lr 6.63e-04 | 4206.19 ms | 32.1% bf16 MFU | 123158 tok/s step 6035/19560 | loss 3.476180 (-0.76z)| norm 0.2596 (-0.22z)| lr 6.63e-04 | 4162.95 ms | 32.4% bf16 MFU | 123297 tok/s step 6036/19560 | loss 3.453646 (-1.36z)| norm 0.2948 (+0.80z)| lr 6.63e-04 | 4173.31 ms | 32.4% bf16 MFU | 123414 tok/s step 6037/19560 | loss 3.502885 (-0.01z)| norm 0.2797 (+0.36z)| lr 6.63e-04 | 4169.49 ms | 32.4% bf16 MFU | 123530 tok/s step 6038/19560 | loss 3.478352 (-0.67z)| norm 0.2645 (-0.08z)| lr 6.63e-04 | 4206.86 ms | 32.1% bf16 MFU | 123585 tok/s step 6039/19560 | loss 3.527630 (+0.68z)| norm 0.2810 (+0.39z)| lr 6.63e-04 | 4168.26 ms | 32.4% bf16 MFU | 123695 tok/s step 6040/19560 | loss 3.597000 (+2.49z)| norm 0.2582 (-0.27z)| lr 6.63e-04 | 4172.10 ms | 32.4% bf16 MFU | 123793 tok/s step 6041/19560 | loss 3.466391 (-0.99z)| norm 0.2667 (-0.02z)| lr 6.63e-04 | 4228.67 ms | 31.9% bf16 MFU | 123803 tok/s step 6042/19560 | loss 3.542944 (+1.04z)| norm 0.2493 (-0.53z)| lr 6.63e-04 | 4166.50 ms | 32.4% bf16 MFU | 123905 tok/s step 6043/19560 | loss 3.464686 (-1.03z)| norm 0.2736 (+0.17z)| lr 6.63e-04 | 4200.00 ms | 32.1% bf16 MFU | 123951 tok/s step 6044/19560 | loss 3.428733 (-1.95z)| norm 0.2553 (-0.36z)| lr 6.63e-04 | 4174.16 ms | 32.3% bf16 MFU | 124033 tok/s step 6045/19560 | loss 3.435884 (-1.73z)| norm 0.2714 (+0.11z)| lr 6.63e-04 | 4261.78 ms | 31.7% bf16 MFU | 123983 tok/s step 6046/19560 | loss 3.521988 (+0.52z)| norm 0.2613 (-0.19z)| lr 6.63e-04 | 4162.20 ms | 32.4% bf16 MFU | 124082 tok/s step 6047/19560 | loss 3.446481 (-1.45z)| norm 0.2244 (-1.25z)| lr 6.63e-04 | 4304.60 ms | 31.4% bf16 MFU | 123968 tok/s step 6048/19560 | loss 3.480018 (-0.56z)| norm 0.2669 (-0.02z)| lr 6.63e-04 | 4159.73 ms | 32.5% bf16 MFU | 124071 tok/s step 6049/19560 | loss 3.582199 (+2.08z)| norm 0.2387 (-0.83z)| lr 6.63e-04 | 4171.83 ms | 32.4% bf16 MFU | 124151 tok/s step 6050/19560 | loss 3.523049 (+0.54z)| norm 0.2704 (+0.09z)| lr 6.63e-04 | 4164.34 ms | 32.4% bf16 MFU | 124239 tok/s step 6051/19560 | loss 3.488500 (-0.36z)| norm 0.2589 (-0.24z)| lr 6.63e-04 | 4169.78 ms | 32.4% bf16 MFU | 124314 tok/s step 6052/19560 | loss 3.569802 (+1.75z)| norm 0.2270 (-1.16z)| lr 6.62e-04 | 4169.05 ms | 32.4% bf16 MFU | 124386 tok/s step 6053/19560 | loss 3.566898 (+1.64z)| norm 0.2599 (-0.21z)| lr 6.62e-04 | 4171.33 ms | 32.4% bf16 MFU | 124451 tok/s step 6054/19560 | loss 3.448297 (-1.38z)| norm 0.2466 (-0.58z)| lr 6.62e-04 | 4176.42 ms | 32.3% bf16 MFU | 124505 tok/s step 6055/19560 | loss 3.488342 (-0.37z)| norm 0.2389 (-0.80z)| lr 6.62e-04 | 4171.47 ms | 32.4% bf16 MFU | 124564 tok/s step 6056/19560 | loss 3.435791 (-1.68z)| norm 0.2765 (+0.27z)| lr 6.62e-04 | 4168.51 ms | 32.4% bf16 MFU | 124624 tok/s step 6057/19560 | loss 3.497318 (-0.12z)| norm 0.2390 (-0.80z)| lr 6.62e-04 | 4167.66 ms | 32.4% bf16 MFU | 124683 tok/s step 6058/19560 | loss 3.450463 (-1.32z)| norm 0.2405 (-0.75z)| lr 6.62e-04 | 4179.87 ms | 32.3% bf16 MFU | 124721 tok/s step 6059/19560 | loss 3.612236 (+2.82z)| norm 0.2459 (-0.59z)| lr 6.62e-04 | 4197.50 ms | 32.2% bf16 MFU | 124730 tok/s step 6060/19560 | loss 3.476525 (-0.63z)| norm 0.2713 (+0.13z)| lr 6.62e-04 | 4176.49 ms | 32.3% bf16 MFU | 124770 tok/s step 6061/19560 | loss 3.521278 (+0.50z)| norm 0.2703 (+0.11z)| lr 6.62e-04 | 4170.42 ms | 32.4% bf16 MFU | 124817 tok/s step 6062/19560 | loss 3.530393 (+0.73z)| norm 0.2409 (-0.73z)| lr 6.62e-04 | 4183.46 ms | 32.3% bf16 MFU | 124843 tok/s step 6063/19560 | loss 3.501930 (-0.00z)| norm 0.2566 (-0.29z)| lr 6.62e-04 | 4167.93 ms | 32.4% bf16 MFU | 124890 tok/s step 6064/19560 | loss 3.501392 (-0.02z)| norm 0.2731 (+0.18z)| lr 6.62e-04 | 4169.02 ms | 32.4% bf16 MFU | 124933 tok/s step 6065/19560 | loss 3.540290 (+0.99z)| norm 0.2746 (+0.22z)| lr 6.62e-04 | 4175.73 ms | 32.3% bf16 MFU | 124965 tok/s step 6066/19560 | loss 3.488118 (-0.36z)| norm 0.2498 (-0.49z)| lr 6.62e-04 | 4165.86 ms | 32.4% bf16 MFU | 125009 tok/s step 6067/19560 | loss 3.534725 (+0.84z)| norm 0.2467 (-0.57z)| lr 6.62e-04 | 4167.39 ms | 32.4% bf16 MFU | 125049 tok/s step 6068/19560 | loss 3.489759 (-0.32z)| norm 0.2469 (-0.56z)| lr 6.62e-04 | 4167.31 ms | 32.4% bf16 MFU | 125087 tok/s step 6069/19560 | loss 3.491975 (-0.26z)| norm 0.2341 (-0.92z)| lr 6.62e-04 | 4160.38 ms | 32.5% bf16 MFU | 125134 tok/s step 6070/19560 | loss 3.517990 (+0.40z)| norm 0.2574 (-0.26z)| lr 6.62e-04 | 4174.04 ms | 32.3% bf16 MFU | 125157 tok/s step 6071/19560 | loss 3.499213 (-0.08z)| norm 0.2570 (-0.27z)| lr 6.62e-04 | 4173.12 ms | 32.4% bf16 MFU | 125181 tok/s step 6072/19560 | loss 3.468619 (-0.86z)| norm 0.2532 (-0.38z)| lr 6.61e-04 | 4167.27 ms | 32.4% bf16 MFU | 125213 tok/s step 6073/19560 | loss 3.554479 (+1.35z)| norm 0.2456 (-0.59z)| lr 6.61e-04 | 4195.66 ms | 32.2% bf16 MFU | 125200 tok/s step 6074/19560 | loss 3.489012 (-0.33z)| norm 0.2617 (-0.13z)| lr 6.61e-04 | 4168.77 ms | 32.4% bf16 MFU | 125228 tok/s step 6075/19560 | loss 3.453476 (-1.24z)| norm 0.2404 (-0.73z)| lr 6.61e-04 | 4168.98 ms | 32.4% bf16 MFU | 125255 tok/s step 6076/19560 | loss 3.403606 (-2.45z)| norm 0.2604 (-0.16z)| lr 6.61e-04 | 4167.11 ms | 32.4% bf16 MFU | 125283 tok/s step 6077/19560 | loss 3.449030 (-1.28z)| norm 0.2367 (-0.83z)| lr 6.61e-04 | 4175.79 ms | 32.3% bf16 MFU | 125296 tok/s step 6078/19560 | loss 3.517609 (+0.44z)| norm 0.2532 (-0.36z)| lr 6.61e-04 | 4174.15 ms | 32.3% bf16 MFU | 125312 tok/s step 6079/19560 | loss 3.517382 (+0.44z)| norm 0.2798 (+0.39z)| lr 6.61e-04 | 4166.64 ms | 32.4% bf16 MFU | 125338 tok/s step 6080/19560 | loss 3.495785 (-0.10z)| norm 0.2851 (+0.54z)| lr 6.61e-04 | 4167.53 ms | 32.4% bf16 MFU | 125361 tok/s step 6081/19560 | loss 3.416016 (-2.08z)| norm 0.3103 (+1.24z)| lr 6.61e-04 | 4174.28 ms | 32.3% bf16 MFU | 125373 tok/s step 6082/19560 | loss 3.534847 (+0.86z)| norm 0.2653 (-0.04z)| lr 6.61e-04 | 4164.41 ms | 32.4% bf16 MFU | 125399 tok/s step 6083/19560 | loss 3.493318 (-0.18z)| norm 0.2973 (+0.86z)| lr 6.61e-04 | 4171.58 ms | 32.4% bf16 MFU | 125413 tok/s step 6084/19560 | loss 3.458478 (-1.04z)| norm 0.3470 (+2.20z)| lr 6.61e-04 | 4165.66 ms | 32.4% bf16 MFU | 125435 tok/s step 6085/19560 | loss 3.502678 (+0.05z)| norm 0.3187 (+1.40z)| lr 6.61e-04 | 4166.64 ms | 32.4% bf16 MFU | 125455 tok/s step 6086/19560 | loss 3.472884 (-0.68z)| norm 0.2738 (+0.16z)| lr 6.61e-04 | 4165.40 ms | 32.4% bf16 MFU | 125476 tok/s step 6087/19560 | loss 3.478050 (-0.54z)| norm 0.3124 (+1.21z)| lr 6.61e-04 | 4163.75 ms | 32.4% bf16 MFU | 125498 tok/s step 6088/19560 | loss 3.516857 (+0.42z)| norm 0.2923 (+0.66z)| lr 6.61e-04 | 4172.86 ms | 32.4% bf16 MFU | 125505 tok/s step 6089/19560 | loss 3.535740 (+0.88z)| norm 0.2697 (+0.04z)| lr 6.61e-04 | 4171.13 ms | 32.4% bf16 MFU | 125515 tok/s step 6090/19560 | loss 3.469573 (-0.75z)| norm 0.2629 (-0.14z)| lr 6.61e-04 | 4159.45 ms | 32.5% bf16 MFU | 125541 tok/s step 6091/19560 | loss 3.541320 (+1.07z)| norm 0.2279 (-1.08z)| lr 6.60e-04 | 4169.49 ms | 32.4% bf16 MFU | 125551 tok/s step 6092/19560 | loss 3.472136 (-0.68z)| norm 0.2545 (-0.36z)| lr 6.60e-04 | 4176.59 ms | 32.3% bf16 MFU | 125550 tok/s step 6093/19560 | loss 3.501728 (+0.07z)| norm 0.2503 (-0.47z)| lr 6.60e-04 | 4180.54 ms | 32.3% bf16 MFU | 125543 tok/s step 6094/19560 | loss 3.548694 (+1.26z)| norm 0.2344 (-0.89z)| lr 6.60e-04 | 4168.98 ms | 32.4% bf16 MFU | 125554 tok/s step 6095/19560 | loss 3.466403 (-0.82z)| norm 0.2553 (-0.32z)| lr 6.60e-04 | 4169.32 ms | 32.4% bf16 MFU | 125564 tok/s step 6096/19560 | loss 3.488696 (-0.26z)| norm 0.2417 (-0.69z)| lr 6.60e-04 | 4165.59 ms | 32.4% bf16 MFU | 125579 tok/s step 6097/19560 | loss 3.412998 (-2.13z)| norm 0.2381 (-0.77z)| lr 6.60e-04 | 4168.68 ms | 32.4% bf16 MFU | 125588 tok/s step 6098/19560 | loss 3.510296 (+0.28z)| norm 0.2379 (-0.78z)| lr 6.60e-04 | 4163.21 ms | 32.4% bf16 MFU | 125606 tok/s step 6099/19560 | loss 3.509756 (+0.26z)| norm 0.2744 (+0.21z)| lr 6.60e-04 | 4170.42 ms | 32.4% bf16 MFU | 125611 tok/s step 6100/19560 | loss 3.503401 (+0.09z)| norm 0.2465 (-0.54z)| lr 6.60e-04 | 4172.33 ms | 32.4% bf16 MFU | 125613 tok/s step 6101/19560 | loss 3.444896 (-1.39z)| norm 0.2586 (-0.21z)| lr 6.60e-04 | 4179.88 ms | 32.3% bf16 MFU | 125604 tok/s step 6102/19560 | loss 3.568971 (+1.78z)| norm 0.2641 (-0.06z)| lr 6.60e-04 | 4166.53 ms | 32.4% bf16 MFU | 125616 tok/s step 6103/19560 | loss 3.481533 (-0.44z)| norm 0.2603 (-0.17z)| lr 6.60e-04 | 4162.04 ms | 32.4% bf16 MFU | 125633 tok/s step 6104/19560 | loss 3.454985 (-1.11z)| norm 0.2564 (-0.26z)| lr 6.60e-04 | 4180.75 ms | 32.3% bf16 MFU | 125622 tok/s step 6105/19560 | loss 3.499109 (+0.02z)| norm 0.2464 (-0.53z)| lr 6.60e-04 | 4163.36 ms | 32.4% bf16 MFU | 125637 tok/s step 6106/19560 | loss 3.488885 (-0.24z)| norm 0.2821 (+0.44z)| lr 6.60e-04 | 4213.09 ms | 32.0% bf16 MFU | 125578 tok/s step 6107/19560 | loss 3.473643 (-0.62z)| norm 0.3441 (+2.08z)| lr 6.60e-04 | 4164.77 ms | 32.4% bf16 MFU | 125593 tok/s step 6108/19560 | loss 3.474789 (-0.58z)| norm 0.3137 (+1.24z)| lr 6.60e-04 | 4171.00 ms | 32.4% bf16 MFU | 125598 tok/s step 6109/19560 | loss 3.529832 (+0.83z)| norm 0.2740 (+0.18z)| lr 6.60e-04 | 4169.35 ms | 32.4% bf16 MFU | 125606 tok/s step 6110/19560 | loss 3.445051 (-1.35z)| norm 0.2568 (-0.28z)| lr 6.60e-04 | 4180.52 ms | 32.3% bf16 MFU | 125596 tok/s step 6111/19560 | loss 3.475543 (-0.55z)| norm 0.2728 (+0.15z)| lr 6.59e-04 | 4167.23 ms | 32.4% bf16 MFU | 125607 tok/s step 6112/19560 | loss 3.482804 (-0.37z)| norm 0.2422 (-0.66z)| lr 6.59e-04 | 4166.56 ms | 32.4% bf16 MFU | 125618 tok/s step 6113/19560 | loss 3.454110 (-1.10z)| norm 0.2482 (-0.51z)| lr 6.59e-04 | 4167.74 ms | 32.4% bf16 MFU | 125627 tok/s step 6114/19560 | loss 3.463413 (-0.85z)| norm 0.2432 (-0.64z)| lr 6.59e-04 | 4164.91 ms | 32.4% bf16 MFU | 125640 tok/s step 6115/19560 | loss 3.481114 (-0.38z)| norm 0.2533 (-0.36z)| lr 6.59e-04 | 4172.37 ms | 32.4% bf16 MFU | 125641 tok/s step 6116/19560 | loss 3.502449 (+0.18z)| norm 0.2375 (-0.78z)| lr 6.59e-04 | 4170.24 ms | 32.4% bf16 MFU | 125645 tok/s step 6117/19560 | loss 3.493090 (-0.06z)| norm 0.2500 (-0.45z)| lr 6.59e-04 | 4157.28 ms | 32.5% bf16 MFU | 125668 tok/s step 6118/19560 | loss 3.505428 (+0.25z)| norm 0.2896 (+0.59z)| lr 6.59e-04 | 4163.87 ms | 32.4% bf16 MFU | 125680 tok/s step 6119/19560 | loss 3.502774 (+0.19z)| norm 0.2624 (-0.14z)| lr 6.59e-04 | 4178.46 ms | 32.3% bf16 MFU | 125670 tok/s step 6120/19560 | loss 3.480827 (-0.39z)| norm 0.2770 (+0.25z)| lr 6.59e-04 | 4170.55 ms | 32.4% bf16 MFU | 125672 tok/s step 6121/19560 | loss 3.503699 (+0.20z)| norm 0.2748 (+0.19z)| lr 6.59e-04 | 4167.14 ms | 32.4% bf16 MFU | 125679 tok/s step 6122/19560 | loss 3.483685 (-0.31z)| norm 0.2478 (-0.53z)| lr 6.59e-04 | 4166.33 ms | 32.4% bf16 MFU | 125687 tok/s step 6123/19560 | loss 3.491318 (-0.10z)| norm 0.2542 (-0.36z)| lr 6.59e-04 | 4169.44 ms | 32.4% bf16 MFU | 125690 tok/s step 6124/19560 | loss 3.473175 (-0.57z)| norm 0.2522 (-0.41z)| lr 6.59e-04 | 4170.14 ms | 32.4% bf16 MFU | 125692 tok/s step 6125/19560 | loss 3.513884 (+0.49z)| norm 0.2476 (-0.53z)| lr 6.59e-04 | 4164.62 ms | 32.4% bf16 MFU | 125702 tok/s step 6126/19560 | loss 3.507249 (+0.33z)| norm 0.2861 (+0.49z)| lr 6.59e-04 | 4176.16 ms | 32.3% bf16 MFU | 125694 tok/s step 6127/19560 | loss 3.446682 (-1.28z)| norm 0.2667 (-0.03z)| lr 6.59e-04 | 4178.68 ms | 32.3% bf16 MFU | 125683 tok/s step 6128/19560 | loss 3.490849 (-0.10z)| norm 0.2646 (-0.08z)| lr 6.59e-04 | 4170.84 ms | 32.4% bf16 MFU | 125684 tok/s step 6129/19560 | loss 3.485840 (-0.25z)| norm 0.2646 (-0.08z)| lr 6.59e-04 | 4167.25 ms | 32.4% bf16 MFU | 125690 tok/s step 6130/19560 | loss 3.451477 (-1.16z)| norm 0.2483 (-0.50z)| lr 6.58e-04 | 4166.22 ms | 32.4% bf16 MFU | 125698 tok/s step 6131/19560 | loss 3.561560 (+1.78z)| norm 0.2657 (-0.04z)| lr 6.58e-04 | 4166.05 ms | 32.4% bf16 MFU | 125705 tok/s step 6132/19560 | loss 3.455614 (-1.04z)| norm 0.2458 (-0.56z)| lr 6.58e-04 | 4172.93 ms | 32.4% bf16 MFU | 125702 tok/s step 6133/19560 | loss 3.473980 (-0.54z)| norm 0.2517 (-0.40z)| lr 6.58e-04 | 4166.54 ms | 32.4% bf16 MFU | 125708 tok/s step 6134/19560 | loss 3.470662 (-0.62z)| norm 0.2443 (-0.59z)| lr 6.58e-04 | 4167.79 ms | 32.4% bf16 MFU | 125713 tok/s step 6135/19560 | loss 3.462966 (-0.81z)| norm 0.2465 (-0.52z)| lr 6.58e-04 | 4178.16 ms | 32.3% bf16 MFU | 125701 tok/s step 6136/19560 | loss 3.509827 (+0.43z)| norm 0.2468 (-0.50z)| lr 6.58e-04 | 4175.32 ms | 32.3% bf16 MFU | 125695 tok/s step 6137/19560 | loss 3.521960 (+0.74z)| norm 0.2328 (-0.87z)| lr 6.58e-04 | 4173.58 ms | 32.4% bf16 MFU | 125691 tok/s step 6138/19560 | loss 3.455348 (-1.02z)| norm 0.2378 (-0.74z)| lr 6.58e-04 | 4168.88 ms | 32.4% bf16 MFU | 125695 tok/s step 6139/19560 | loss 3.488101 (-0.15z)| norm 0.2649 (-0.02z)| lr 6.58e-04 | 4170.81 ms | 32.4% bf16 MFU | 125695 tok/s step 6140/19560 | loss 3.560723 (+1.76z)| norm 0.2500 (-0.41z)| lr 6.58e-04 | 4166.67 ms | 32.4% bf16 MFU | 125702 tok/s step 6141/19560 | loss 3.458594 (-0.92z)| norm 0.2611 (-0.11z)| lr 6.58e-04 | 4168.80 ms | 32.4% bf16 MFU | 125705 tok/s step 6142/19560 | loss 3.487721 (-0.15z)| norm 0.2937 (+0.76z)| lr 6.58e-04 | 4168.32 ms | 32.4% bf16 MFU | 125709 tok/s step 6143/19560 | loss 3.465270 (-0.74z)| norm 0.2768 (+0.30z)| lr 6.58e-04 | 4164.74 ms | 32.4% bf16 MFU | 125718 tok/s step 6144/19560 | loss 3.474629 (-0.48z)| norm 0.2539 (-0.31z)| lr 6.58e-04 | 4159.56 ms | 32.5% bf16 MFU | 125734 tok/s step 6145/19560 | loss 3.537622 (+1.17z)| norm 0.2451 (-0.55z)| lr 6.58e-04 | 4171.93 ms | 32.4% bf16 MFU | 125731 tok/s step 6146/19560 | loss 3.450375 (-1.11z)| norm 0.2460 (-0.52z)| lr 6.58e-04 | 4167.76 ms | 32.4% bf16 MFU | 125734 tok/s step 6147/19560 | loss 3.517045 (+0.62z)| norm 0.2618 (-0.10z)| lr 6.58e-04 | 4183.26 ms | 32.3% bf16 MFU | 125714 tok/s step 6148/19560 | loss 3.496005 (+0.06z)| norm 0.2474 (-0.48z)| lr 6.58e-04 | 4172.20 ms | 32.4% bf16 MFU | 125711 tok/s step 6149/19560 | loss 3.510091 (+0.43z)| norm 0.2380 (-0.73z)| lr 6.58e-04 | 4175.49 ms | 32.3% bf16 MFU | 125704 tok/s step 6150/19560 | loss 3.450720 (-1.13z)| norm 0.2492 (-0.43z)| lr 6.57e-04 | 4166.27 ms | 32.4% bf16 MFU | 125711 tok/s step 6151/19560 | loss 3.476388 (-0.44z)| norm 0.2790 (+0.36z)| lr 6.57e-04 | 4168.66 ms | 32.4% bf16 MFU | 125714 tok/s step 6152/19560 | loss 3.401692 (-2.34z)| norm 0.2732 (+0.21z)| lr 6.57e-04 | 4176.40 ms | 32.3% bf16 MFU | 125705 tok/s step 6153/19560 | loss 3.492550 (-0.00z)| norm 0.2427 (-0.61z)| lr 6.57e-04 | 4167.63 ms | 32.4% bf16 MFU | 125709 tok/s step 6154/19560 | loss 3.504489 (+0.30z)| norm 0.2808 (+0.40z)| lr 6.57e-04 | 4162.39 ms | 32.4% bf16 MFU | 125722 tok/s step 6155/19560 | loss 3.528296 (+0.94z)| norm 0.2921 (+0.69z)| lr 6.57e-04 | 4164.67 ms | 32.4% bf16 MFU | 125730 tok/s step 6156/19560 | loss 3.531624 (+1.03z)| norm 0.2756 (+0.48z)| lr 6.57e-04 | 4167.72 ms | 32.4% bf16 MFU | 125734 tok/s step 6157/19560 | loss 3.554195 (+1.61z)| norm 0.2960 (+1.43z)| lr 6.57e-04 | 4170.62 ms | 32.4% bf16 MFU | 125732 tok/s step 6158/19560 | loss 3.492571 (+0.00z)| norm 0.3147 (+2.32z)| lr 6.57e-04 | 4164.03 ms | 32.4% bf16 MFU | 125741 tok/s step 6159/19560 | loss 3.453005 (-1.03z)| norm 0.2624 (-0.02z)| lr 6.57e-04 | 4172.02 ms | 32.4% bf16 MFU | 125738 tok/s step 6160/19560 | loss 3.609090 (+2.93z)| norm 0.3147 (+2.31z)| lr 6.57e-04 | 4172.37 ms | 32.4% bf16 MFU | 125734 tok/s step 6161/19560 | loss 3.522757 (+0.73z)| norm 0.2613 (-0.06z)| lr 6.57e-04 | 4170.67 ms | 32.4% bf16 MFU | 125732 tok/s step 6162/19560 | loss 3.485279 (-0.21z)| norm 0.2801 (+0.78z)| lr 6.57e-04 | 4175.47 ms | 32.3% bf16 MFU | 125724 tok/s step 6163/19560 | loss 3.544753 (+1.28z)| norm 0.2692 (+0.29z)| lr 6.57e-04 | 4159.57 ms | 32.5% bf16 MFU | 125740 tok/s step 6164/19560 | loss 3.502372 (+0.20z)| norm 0.2440 (-0.83z)| lr 6.57e-04 | 4162.42 ms | 32.4% bf16 MFU | 125751 tok/s step 6165/19560 | loss 3.461210 (-0.83z)| norm 0.2458 (-0.73z)| lr 6.57e-04 | 4208.01 ms | 32.1% bf16 MFU | 125693 tok/s step 6166/19560 | loss 3.462482 (-0.80z)| norm 0.2259 (-1.60z)| lr 6.57e-04 | 4171.03 ms | 32.4% bf16 MFU | 125693 tok/s step 6167/19560 | loss 3.483537 (-0.26z)| norm 0.2274 (-1.51z)| lr 6.57e-04 | 4156.21 ms | 32.5% bf16 MFU | 125716 tok/s step 6168/19560 | loss 3.510730 (+0.46z)| norm 0.2406 (-0.91z)| lr 6.57e-04 | 4173.14 ms | 32.4% bf16 MFU | 125712 tok/s step 6169/19560 | loss 3.487523 (-0.15z)| norm 0.2711 (+0.43z)| lr 6.56e-04 | 4163.47 ms | 32.4% bf16 MFU | 125722 tok/s step 6170/19560 | loss 3.467045 (-0.67z)| norm 0.2761 (+0.64z)| lr 6.56e-04 | 4169.18 ms | 32.4% bf16 MFU | 125724 tok/s step 6171/19560 | loss 3.465348 (-0.71z)| norm 0.2486 (-0.56z)| lr 6.56e-04 | 4163.76 ms | 32.4% bf16 MFU | 125733 tok/s step 6172/19560 | loss 3.445816 (-1.23z)| norm 0.2718 (+0.46z)| lr 6.56e-04 | 4173.09 ms | 32.4% bf16 MFU | 125729 tok/s step 6173/19560 | loss 3.434091 (-1.54z)| norm 0.2612 (-0.01z)| lr 6.56e-04 | 4168.29 ms | 32.4% bf16 MFU | 125731 tok/s step 6174/19560 | loss 3.529830 (+0.97z)| norm 0.2466 (-0.65z)| lr 6.56e-04 | 4179.71 ms | 32.3% bf16 MFU | 125716 tok/s step 6175/19560 | loss 3.560818 (+1.75z)| norm 0.2600 (-0.07z)| lr 6.56e-04 | 4173.30 ms | 32.4% bf16 MFU | 125712 tok/s step 6176/19560 | loss 3.475711 (-0.47z)| norm 0.2305 (-1.36z)| lr 6.56e-04 | 4159.05 ms | 32.5% bf16 MFU | 125729 tok/s step 6177/19560 | loss 3.518622 (+0.68z)| norm 0.2639 (+0.11z)| lr 6.56e-04 | 4165.50 ms | 32.4% bf16 MFU | 125736 tok/s step 6178/19560 | loss 3.504024 (+0.29z)| norm 0.2567 (-0.21z)| lr 6.56e-04 | 4166.68 ms | 32.4% bf16 MFU | 125741 tok/s step 6179/19560 | loss 3.490676 (-0.06z)| norm 0.2659 (+0.20z)| lr 6.56e-04 | 4165.67 ms | 32.4% bf16 MFU | 125747 tok/s step 6180/19560 | loss 3.457677 (-0.93z)| norm 0.2263 (-1.56z)| lr 6.56e-04 | 4170.95 ms | 32.4% bf16 MFU | 125744 tok/s step 6181/19560 | loss 3.520957 (+0.80z)| norm 0.2725 (+0.49z)| lr 6.56e-04 | 4162.18 ms | 32.4% bf16 MFU | 125755 tok/s step 6182/19560 | loss 3.500132 (+0.22z)| norm 0.2661 (+0.20z)| lr 6.56e-04 | 4181.65 ms | 32.3% bf16 MFU | 125737 tok/s step 6183/19560 | loss 3.514273 (+0.60z)| norm 0.2405 (-0.94z)| lr 6.56e-04 | 4163.30 ms | 32.4% bf16 MFU | 125746 tok/s step 6184/19560 | loss 3.455462 (-1.02z)| norm 0.2737 (+0.54z)| lr 6.56e-04 | 4164.87 ms | 32.4% bf16 MFU | 125753 tok/s step 6185/19560 | loss 3.426988 (-1.77z)| norm 0.2627 (+0.04z)| lr 6.56e-04 | 4168.79 ms | 32.4% bf16 MFU | 125754 tok/s step 6186/19560 | loss 3.476649 (-0.43z)| norm 0.2566 (-0.24z)| lr 6.56e-04 | 4172.86 ms | 32.4% bf16 MFU | 125748 tok/s step 6187/19560 | loss 3.513179 (+0.62z)| norm 0.2665 (+0.20z)| lr 6.56e-04 | 4163.07 ms | 32.4% bf16 MFU | 125758 tok/s step 6188/19560 | loss 3.489664 (-0.05z)| norm 0.2480 (-0.62z)| lr 6.55e-04 | 4159.97 ms | 32.5% bf16 MFU | 125771 tok/s step 6189/19560 | loss 3.496567 (+0.15z)| norm 0.2267 (-1.55z)| lr 6.55e-04 | 4163.93 ms | 32.4% bf16 MFU | 125778 tok/s step 6190/19560 | loss 3.579165 (+2.46z)| norm 0.2703 (+0.38z)| lr 6.55e-04 | 4171.10 ms | 32.4% bf16 MFU | 125774 tok/s step 6191/19560 | loss 3.485874 (-0.16z)| norm 0.2565 (-0.24z)| lr 6.55e-04 | 4162.46 ms | 32.4% bf16 MFU | 125783 tok/s step 6192/19560 | loss 3.468531 (-0.64z)| norm 0.2452 (-0.73z)| lr 6.55e-04 | 4167.85 ms | 32.4% bf16 MFU | 125784 tok/s step 6193/19560 | loss 3.566191 (+2.08z)| norm 0.2604 (-0.05z)| lr 6.55e-04 | 4166.77 ms | 32.4% bf16 MFU | 125786 tok/s step 6194/19560 | loss 3.470951 (-0.57z)| norm 0.2676 (+0.27z)| lr 6.55e-04 | 4182.56 ms | 32.3% bf16 MFU | 125764 tok/s step 6195/19560 | loss 3.464662 (-0.73z)| norm 0.2617 (-0.00z)| lr 6.55e-04 | 4163.85 ms | 32.4% bf16 MFU | 125772 tok/s step 6196/19560 | loss 3.506720 (+0.44z)| norm 0.2760 (+0.63z)| lr 6.55e-04 | 4169.77 ms | 32.4% bf16 MFU | 125770 tok/s step 6197/19560 | loss 3.438635 (-1.44z)| norm 0.2880 (+1.14z)| lr 6.55e-04 | 4165.74 ms | 32.4% bf16 MFU | 125774 tok/s step 6198/19560 | loss 3.446226 (-1.21z)| norm 0.2769 (+0.64z)| lr 6.55e-04 | 4168.24 ms | 32.4% bf16 MFU | 125775 tok/s step 6199/19560 | loss 3.480601 (-0.25z)| norm 0.2398 (-1.00z)| lr 6.55e-04 | 4161.06 ms | 32.4% bf16 MFU | 125786 tok/s step 6200/19560 | loss 3.512272 (+0.61z)| norm 0.2810 (+0.82z)| lr 6.55e-04 | 4159.61 ms | 32.5% bf16 MFU | 125799 tok/s step 6201/19560 | loss 3.550808 (+1.67z)| norm 0.2709 (+0.36z)| lr 6.55e-04 | 4170.79 ms | 32.4% bf16 MFU | 125794 tok/s step 6202/19560 | loss 3.483271 (-0.19z)| norm 0.2631 (+0.01z)| lr 6.55e-04 | 4168.99 ms | 32.4% bf16 MFU | 125792 tok/s step 6203/19560 | loss 3.443372 (-1.28z)| norm 0.2467 (-0.72z)| lr 6.55e-04 | 4158.54 ms | 32.5% bf16 MFU | 125806 tok/s step 6204/19560 | loss 3.466898 (-0.66z)| norm 0.2583 (-0.20z)| lr 6.55e-04 | 4174.43 ms | 32.3% bf16 MFU | 125796 tok/s step 6205/19560 | loss 3.563869 (+2.02z)| norm 0.2423 (-0.92z)| lr 6.55e-04 | 4171.69 ms | 32.4% bf16 MFU | 125790 tok/s step 6206/19560 | loss 3.488466 (-0.08z)| norm 0.2504 (-0.56z)| lr 6.55e-04 | 4173.43 ms | 32.4% bf16 MFU | 125782 tok/s step 6207/19560 | loss 3.462162 (-0.80z)| norm 0.2480 (-0.65z)| lr 6.54e-04 | 4158.07 ms | 32.5% bf16 MFU | 125797 tok/s step 6208/19560 | loss 3.513274 (+0.62z)| norm 0.2656 (+0.14z)| lr 6.54e-04 | 4555.04 ms | 29.6% bf16 MFU | 125262 tok/s step 6209/19560 | loss 3.543828 (+1.46z)| norm 0.3029 (+1.82z)| lr 6.54e-04 | 4671.71 ms | 28.9% bf16 MFU | 124610 tok/s step 6210/19560 | loss 3.499437 (+0.22z)| norm 0.2583 (-0.18z)| lr 6.54e-04 | 4676.51 ms | 28.9% bf16 MFU | 123985 tok/s step 6211/19560 | loss 3.523243 (+0.88z)| norm 0.2614 (-0.03z)| lr 6.54e-04 | 4598.73 ms | 29.4% bf16 MFU | 123486 tok/s step 6212/19560 | loss 3.433565 (-1.62z)| norm 0.2475 (-0.67z)| lr 6.54e-04 | 4318.32 ms | 31.3% bf16 MFU | 123383 tok/s step 6213/19560 | loss 3.501189 (+0.27z)| norm 0.2295 (-1.54z)| lr 6.54e-04 | 4167.42 ms | 32.4% bf16 MFU | 123504 tok/s step 6214/19560 | loss 3.538738 (+1.29z)| norm 0.2658 (+0.26z)| lr 6.54e-04 | 4261.69 ms | 31.7% bf16 MFU | 123480 tok/s step 6215/19560 | loss 3.536099 (+1.20z)| norm 0.2342 (-1.30z)| lr 6.54e-04 | 4266.75 ms | 31.6% bf16 MFU | 123450 tok/s step 6216/19560 | loss 3.590589 (+2.62z)| norm 0.2461 (-0.68z)| lr 6.54e-04 | 4436.21 ms | 30.4% bf16 MFU | 123186 tok/s step 6217/19560 | loss 3.537384 (+1.19z)| norm 0.2497 (-0.50z)| lr 6.54e-04 | 4229.63 ms | 31.9% bf16 MFU | 123225 tok/s step 6218/19560 | loss 3.537933 (+1.19z)| norm 0.2610 (+0.08z)| lr 6.54e-04 | 4243.10 ms | 31.8% bf16 MFU | 123242 tok/s step 6219/19560 | loss 3.518659 (+0.68z)| norm 0.2526 (-0.36z)| lr 6.54e-04 | 4238.15 ms | 31.9% bf16 MFU | 123265 tok/s step 6220/19560 | loss 3.505378 (+0.31z)| norm 0.2640 (+0.22z)| lr 6.54e-04 | 4223.47 ms | 32.0% bf16 MFU | 123309 tok/s step 6221/19560 | loss 3.541742 (+1.28z)| norm 0.2213 (-1.94z)| lr 6.54e-04 | 4158.32 ms | 32.5% bf16 MFU | 123447 tok/s step 6222/19560 | loss 3.507513 (+0.37z)| norm 0.2563 (-0.17z)| lr 6.54e-04 | 4267.73 ms | 31.6% bf16 MFU | 123417 tok/s step 6223/19560 | loss 3.519654 (+0.69z)| norm 0.2791 (+0.98z)| lr 6.54e-04 | 4163.39 ms | 32.4% bf16 MFU | 123543 tok/s step 6224/19560 | loss 3.565875 (+1.89z)| norm 0.3090 (+2.43z)| lr 6.54e-04 | 4198.27 ms | 32.2% bf16 MFU | 123610 tok/s step 6225/19560 | loss 3.511813 (+0.44z)| norm 0.2829 (+1.11z)| lr 6.54e-04 | 4222.40 ms | 32.0% bf16 MFU | 123638 tok/s step 6226/19560 | loss 3.483573 (-0.32z)| norm 0.2505 (-0.51z)| lr 6.53e-04 | 4234.71 ms | 31.9% bf16 MFU | 123646 tok/s step 6227/19560 | loss 3.490984 (-0.12z)| norm 0.2647 (+0.20z)| lr 6.53e-04 | 4175.79 ms | 32.3% bf16 MFU | 123742 tok/s step 6228/19560 | loss 3.480117 (-0.41z)| norm 0.2750 (+0.71z)| lr 6.53e-04 | 4272.21 ms | 31.6% bf16 MFU | 123691 tok/s step 6229/19560 | loss 3.497111 (+0.05z)| norm 0.2577 (-0.16z)| lr 6.53e-04 | 4257.73 ms | 31.7% bf16 MFU | 123663 tok/s step 6230/19560 | loss 3.483122 (-0.32z)| norm 0.2399 (-1.04z)| lr 6.53e-04 | 4218.19 ms | 32.0% bf16 MFU | 123694 tok/s step 6231/19560 | loss 3.465748 (-0.80z)| norm 0.2529 (-0.39z)| lr 6.53e-04 | 4165.44 ms | 32.4% bf16 MFU | 123803 tok/s step 6232/19560 | loss 3.473938 (-0.58z)| norm 0.2544 (-0.31z)| lr 6.53e-04 | 4170.51 ms | 32.4% bf16 MFU | 123899 tok/s step 6233/19560 | loss 3.509395 (+0.40z)| norm 0.2560 (-0.23z)| lr 6.53e-04 | 4224.83 ms | 32.0% bf16 MFU | 123908 tok/s step 6234/19560 | loss 3.532277 (+1.03z)| norm 0.2585 (-0.10z)| lr 6.53e-04 | 4163.81 ms | 32.4% bf16 MFU | 124009 tok/s step 6235/19560 | loss 3.518793 (+0.64z)| norm 0.2489 (-0.59z)| lr 6.53e-04 | 4252.20 ms | 31.8% bf16 MFU | 123973 tok/s step 6236/19560 | loss 3.439967 (-1.52z)| norm 0.2623 (+0.16z)| lr 6.53e-04 | 4169.65 ms | 32.4% bf16 MFU | 124062 tok/s step 6237/19560 | loss 3.525177 (+0.82z)| norm 0.2185 (-2.22z)| lr 6.53e-04 | 4156.96 ms | 32.5% bf16 MFU | 124165 tok/s step 6238/19560 | loss 3.541761 (+1.26z)| norm 0.2922 (+1.79z)| lr 6.53e-04 | 4171.46 ms | 32.4% bf16 MFU | 124241 tok/s step 6239/19560 | loss 3.508545 (+0.34z)| norm 0.2606 (+0.08z)| lr 6.53e-04 | 4168.75 ms | 32.4% bf16 MFU | 124317 tok/s step 6240/19560 | loss 3.495942 (-0.01z)| norm 0.2284 (-1.65z)| lr 6.53e-04 | 4164.68 ms | 32.4% bf16 MFU | 124396 tok/s step 6241/19560 | loss 3.534675 (+1.04z)| norm 0.2774 (+0.98z)| lr 6.53e-04 | 4163.30 ms | 32.4% bf16 MFU | 124472 tok/s step 6242/19560 | loss 3.471758 (-0.70z)| norm 0.2567 (-0.15z)| lr 6.53e-04 | 4167.36 ms | 32.4% bf16 MFU | 124539 tok/s step 6243/19560 | loss 3.575263 (+2.10z)| norm 0.2356 (-1.27z)| lr 6.53e-04 | 4165.89 ms | 32.4% bf16 MFU | 124605 tok/s step 6244/19560 | loss 3.534353 (+0.98z)| norm 0.2628 (+0.18z)| lr 6.53e-04 | 4165.16 ms | 32.4% bf16 MFU | 124668 tok/s step 6245/19560 | loss 3.497590 (-0.01z)| norm 0.2592 (-0.02z)| lr 6.52e-04 | 4161.75 ms | 32.4% bf16 MFU | 124734 tok/s step 6246/19560 | loss 3.549788 (+1.38z)| norm 0.2893 (+1.61z)| lr 6.52e-04 | 4156.35 ms | 32.5% bf16 MFU | 124804 tok/s step 6247/19560 | loss 3.541469 (+1.14z)| norm 0.2565 (-0.16z)| lr 6.52e-04 | 4174.72 ms | 32.3% bf16 MFU | 124843 tok/s step 6248/19560 | loss 3.550770 (+1.37z)| norm 0.2491 (-0.55z)| lr 6.52e-04 | 4253.45 ms | 31.7% bf16 MFU | 124764 tok/s step 6249/19560 | loss 3.497063 (-0.06z)| norm 0.2719 (+0.69z)| lr 6.52e-04 | 4168.68 ms | 32.4% bf16 MFU | 124814 tok/s step 6250/19560 | loss 3.469687 (-0.78z)| norm 0.2493 (-0.54z)| lr 6.52e-04 | 4168.54 ms | 32.4% bf16 MFU | 124862 tok/s val loss 3.487836 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2828/10042 = 0.281617 step 6251/19560 | loss 3.524477 (+0.66z)| norm 0.2730 (+0.74z)| lr 6.52e-04 | 4167.58 ms | 32.4% bf16 MFU | 124909 tok/s step 6252/19560 | loss 3.601072 (+2.60z)| norm 0.2664 (+0.37z)| lr 6.52e-04 | 4499.14 ms | 30.0% bf16 MFU | 124490 tok/s step 6253/19560 | loss 3.547642 (+1.21z)| norm 0.2792 (+1.05z)| lr 6.52e-04 | 4230.56 ms | 31.9% bf16 MFU | 124462 tok/s step 6254/19560 | loss 3.462753 (-0.96z)| norm 0.2834 (+1.28z)| lr 6.52e-04 | 4162.70 ms | 32.4% bf16 MFU | 124537 tok/s step 6255/19560 | loss 3.505776 (+0.13z)| norm 0.3379 (+3.94z)| lr 6.52e-04 | 4196.83 ms | 32.2% bf16 MFU | 124556 tok/s step 6256/19560 | loss 3.519055 (+0.47z)| norm 0.3291 (+3.31z)| lr 6.52e-04 | 4197.55 ms | 32.2% bf16 MFU | 124573 tok/s step 6257/19560 | loss 3.511083 (+0.25z)| norm 0.2913 (+1.46z)| lr 6.52e-04 | 4339.39 ms | 31.1% bf16 MFU | 124386 tok/s step 6258/19560 | loss 3.464149 (-0.96z)| norm 0.2828 (+1.03z)| lr 6.52e-04 | 4256.12 ms | 31.7% bf16 MFU | 124326 tok/s step 6259/19560 | loss 3.530180 (+0.76z)| norm 0.2691 (+0.38z)| lr 6.52e-04 | 4159.33 ms | 32.5% bf16 MFU | 124412 tok/s step 6260/19560 | loss 3.512653 (+0.29z)| norm 0.2530 (-0.40z)| lr 6.52e-04 | 4158.02 ms | 32.5% bf16 MFU | 124496 tok/s step 6261/19560 | loss 3.504610 (+0.08z)| norm 0.2637 (+0.11z)| lr 6.52e-04 | 4168.25 ms | 32.4% bf16 MFU | 124560 tok/s step 6262/19560 | loss 3.535585 (+0.87z)| norm 0.2373 (-1.15z)| lr 6.52e-04 | 4159.60 ms | 32.5% bf16 MFU | 124634 tok/s step 6263/19560 | loss 3.479986 (-0.59z)| norm 0.2540 (-0.35z)| lr 6.52e-04 | 4280.96 ms | 31.5% bf16 MFU | 124526 tok/s step 6264/19560 | loss 3.467915 (-0.89z)| norm 0.2300 (-1.49z)| lr 6.52e-04 | 4178.24 ms | 32.3% bf16 MFU | 124574 tok/s step 6265/19560 | loss 3.480633 (-0.55z)| norm 0.2245 (-1.74z)| lr 6.51e-04 | 4164.90 ms | 32.4% bf16 MFU | 124639 tok/s step 6266/19560 | loss 3.503880 (+0.05z)| norm 0.2495 (-0.56z)| lr 6.51e-04 | 4167.63 ms | 32.4% bf16 MFU | 124697 tok/s step 6267/19560 | loss 3.517157 (+0.39z)| norm 0.2248 (-1.70z)| lr 6.51e-04 | 4215.02 ms | 32.0% bf16 MFU | 124682 tok/s step 6268/19560 | loss 3.485319 (-0.44z)| norm 0.2307 (-1.41z)| lr 6.51e-04 | 4160.31 ms | 32.5% bf16 MFU | 124749 tok/s step 6269/19560 | loss 3.503377 (+0.03z)| norm 0.2479 (-0.60z)| lr 6.51e-04 | 4267.06 ms | 31.6% bf16 MFU | 124655 tok/s step 6270/19560 | loss 3.518555 (+0.43z)| norm 0.2639 (+0.16z)| lr 6.51e-04 | 4163.80 ms | 32.4% bf16 MFU | 124718 tok/s step 6271/19560 | loss 3.535848 (+0.88z)| norm 0.2541 (-0.30z)| lr 6.51e-04 | 4165.26 ms | 32.4% bf16 MFU | 124775 tok/s step 6272/19560 | loss 3.486996 (-0.43z)| norm 0.2607 (+0.01z)| lr 6.51e-04 | 4169.99 ms | 32.4% bf16 MFU | 124823 tok/s step 6273/19560 | loss 3.482876 (-0.53z)| norm 0.2554 (-0.24z)| lr 6.51e-04 | 4163.33 ms | 32.4% bf16 MFU | 124878 tok/s step 6274/19560 | loss 3.518424 (+0.41z)| norm 0.2966 (+1.67z)| lr 6.51e-04 | 4168.98 ms | 32.4% bf16 MFU | 124922 tok/s step 6275/19560 | loss 3.490732 (-0.33z)| norm 0.2457 (-0.70z)| lr 6.51e-04 | 4192.83 ms | 32.2% bf16 MFU | 124928 tok/s step 6276/19560 | loss 3.459435 (-1.16z)| norm 0.2592 (-0.07z)| lr 6.51e-04 | 4222.23 ms | 32.0% bf16 MFU | 124891 tok/s step 6277/19560 | loss 3.457848 (-1.19z)| norm 0.2420 (-0.88z)| lr 6.51e-04 | 4165.43 ms | 32.4% bf16 MFU | 124940 tok/s step 6278/19560 | loss 3.483545 (-0.51z)| norm 0.2610 (+0.00z)| lr 6.51e-04 | 4173.40 ms | 32.4% bf16 MFU | 124974 tok/s step 6279/19560 | loss 3.472381 (-0.81z)| norm 0.2617 (+0.04z)| lr 6.51e-04 | 4160.39 ms | 32.5% bf16 MFU | 125026 tok/s step 6280/19560 | loss 3.496396 (-0.19z)| norm 0.2294 (-1.45z)| lr 6.51e-04 | 4185.39 ms | 32.3% bf16 MFU | 125038 tok/s step 6281/19560 | loss 3.543724 (+1.11z)| norm 0.2692 (+0.40z)| lr 6.51e-04 | 4173.28 ms | 32.4% bf16 MFU | 125068 tok/s step 6282/19560 | loss 3.463932 (-1.08z)| norm 0.2786 (+0.84z)| lr 6.51e-04 | 4170.21 ms | 32.4% bf16 MFU | 125100 tok/s step 6283/19560 | loss 3.497809 (-0.14z)| norm 0.2838 (+1.09z)| lr 6.50e-04 | 4167.99 ms | 32.4% bf16 MFU | 125135 tok/s step 6284/19560 | loss 3.514958 (+0.33z)| norm 0.2638 (+0.15z)| lr 6.50e-04 | 4165.43 ms | 32.4% bf16 MFU | 125171 tok/s step 6285/19560 | loss 3.499927 (-0.07z)| norm 0.2357 (-1.15z)| lr 6.50e-04 | 4224.17 ms | 32.0% bf16 MFU | 125119 tok/s step 6286/19560 | loss 3.515491 (+0.36z)| norm 0.2425 (-0.82z)| lr 6.50e-04 | 4166.40 ms | 32.4% bf16 MFU | 125155 tok/s step 6287/19560 | loss 3.497028 (-0.17z)| norm 0.2396 (-0.95z)| lr 6.50e-04 | 4228.53 ms | 31.9% bf16 MFU | 125096 tok/s step 6288/19560 | loss 3.513870 (+0.34z)| norm 0.2679 (+0.45z)| lr 6.50e-04 | 4192.50 ms | 32.2% bf16 MFU | 125094 tok/s step 6289/19560 | loss 3.515109 (+0.38z)| norm 0.2409 (-0.89z)| lr 6.50e-04 | 4158.30 ms | 32.5% bf16 MFU | 125144 tok/s step 6290/19560 | loss 3.491404 (-0.31z)| norm 0.2406 (-0.89z)| lr 6.50e-04 | 4209.48 ms | 32.1% bf16 MFU | 125114 tok/s step 6291/19560 | loss 3.612889 (+3.10z)| norm 0.2676 (+0.46z)| lr 6.50e-04 | 4191.69 ms | 32.2% bf16 MFU | 125112 tok/s step 6292/19560 | loss 3.531350 (+0.80z)| norm 0.2468 (-0.58z)| lr 6.50e-04 | 4169.96 ms | 32.4% bf16 MFU | 125143 tok/s step 6293/19560 | loss 3.514588 (+0.32z)| norm 0.2623 (+0.19z)| lr 6.50e-04 | 4163.58 ms | 32.4% bf16 MFU | 125182 tok/s step 6294/19560 | loss 3.447346 (-1.57z)| norm 0.2632 (+0.22z)| lr 6.50e-04 | 4168.43 ms | 32.4% bf16 MFU | 125212 tok/s step 6295/19560 | loss 3.506348 (+0.08z)| norm 0.2255 (-1.68z)| lr 6.50e-04 | 4163.96 ms | 32.4% bf16 MFU | 125247 tok/s step 6296/19560 | loss 3.499736 (-0.10z)| norm 0.2351 (-1.19z)| lr 6.50e-04 | 4158.46 ms | 32.5% bf16 MFU | 125288 tok/s step 6297/19560 | loss 3.500141 (-0.09z)| norm 0.2192 (-1.95z)| lr 6.50e-04 | 4164.54 ms | 32.4% bf16 MFU | 125318 tok/s step 6298/19560 | loss 3.523928 (+0.56z)| norm 0.3966 (+5.84z)| lr 6.50e-04 | 4160.38 ms | 32.5% bf16 MFU | 125353 tok/s step 6299/19560 | loss 3.503265 (-0.03z)| norm 0.2884 (+1.21z)| lr 6.50e-04 | 4204.46 ms | 32.1% bf16 MFU | 125321 tok/s step 6300/19560 | loss 3.526477 (+0.62z)| norm 0.2756 (+0.67z)| lr 6.50e-04 | 4161.23 ms | 32.4% bf16 MFU | 125354 tok/s step 6301/19560 | loss 3.542497 (+1.06z)| norm 0.2781 (+0.77z)| lr 6.50e-04 | 4168.34 ms | 32.4% bf16 MFU | 125375 tok/s step 6302/19560 | loss 3.479646 (-0.74z)| norm 0.2849 (+1.04z)| lr 6.49e-04 | 4162.53 ms | 32.4% bf16 MFU | 125404 tok/s step 6303/19560 | loss 3.513523 (+0.25z)| norm 0.2664 (+0.26z)| lr 6.49e-04 | 4171.08 ms | 32.4% bf16 MFU | 125419 tok/s step 6304/19560 | loss 3.510847 (+0.17z)| norm 0.2809 (+0.86z)| lr 6.49e-04 | 4166.92 ms | 32.4% bf16 MFU | 125439 tok/s step 6305/19560 | loss 3.508191 (+0.09z)| norm 0.2270 (-1.40z)| lr 6.49e-04 | 4176.12 ms | 32.3% bf16 MFU | 125444 tok/s step 6306/19560 | loss 3.470287 (-1.01z)| norm 0.2672 (+0.28z)| lr 6.49e-04 | 4175.88 ms | 32.3% bf16 MFU | 125450 tok/s step 6307/19560 | loss 3.531089 (+0.76z)| norm 0.2323 (-1.16z)| lr 6.49e-04 | 4162.02 ms | 32.4% bf16 MFU | 125476 tok/s step 6308/19560 | loss 3.542285 (+1.07z)| norm 0.2867 (+1.09z)| lr 6.49e-04 | 4191.68 ms | 32.2% bf16 MFU | 125456 tok/s step 6309/19560 | loss 3.478616 (-0.78z)| norm 0.2517 (-0.36z)| lr 6.49e-04 | 4167.59 ms | 32.4% bf16 MFU | 125473 tok/s step 6310/19560 | loss 3.565113 (+1.71z)| norm 0.2434 (-0.70z)| lr 6.49e-04 | 4165.63 ms | 32.4% bf16 MFU | 125492 tok/s step 6311/19560 | loss 3.500359 (-0.16z)| norm 0.2880 (+1.14z)| lr 6.49e-04 | 4162.01 ms | 32.4% bf16 MFU | 125516 tok/s step 6312/19560 | loss 3.461221 (-1.29z)| norm 0.2838 (+0.96z)| lr 6.49e-04 | 4283.29 ms | 31.5% bf16 MFU | 125361 tok/s step 6313/19560 | loss 3.487051 (-0.57z)| norm 0.2677 (+0.29z)| lr 6.49e-04 | 4164.01 ms | 32.4% bf16 MFU | 125388 tok/s step 6314/19560 | loss 3.470426 (-1.06z)| norm 0.2527 (-0.33z)| lr 6.49e-04 | 4266.34 ms | 31.6% bf16 MFU | 125263 tok/s step 6315/19560 | loss 3.499122 (-0.21z)| norm 0.2740 (+0.55z)| lr 6.49e-04 | 4427.41 ms | 30.5% bf16 MFU | 124921 tok/s step 6316/19560 | loss 3.491771 (-0.43z)| norm 0.2460 (-0.61z)| lr 6.49e-04 | 4163.16 ms | 32.4% bf16 MFU | 124972 tok/s step 6317/19560 | loss 3.425741 (-2.31z)| norm 0.2662 (+0.21z)| lr 6.49e-04 | 4181.01 ms | 32.3% bf16 MFU | 124993 tok/s step 6318/19560 | loss 3.490236 (-0.43z)| norm 0.2824 (+0.89z)| lr 6.49e-04 | 4361.77 ms | 31.0% bf16 MFU | 124753 tok/s step 6319/19560 | loss 3.469540 (-1.03z)| norm 0.2500 (-0.46z)| lr 6.49e-04 | 4165.55 ms | 32.4% bf16 MFU | 124809 tok/s step 6320/19560 | loss 3.532351 (+0.79z)| norm 0.2370 (-1.00z)| lr 6.49e-04 | 4166.64 ms | 32.4% bf16 MFU | 124860 tok/s step 6321/19560 | loss 3.465471 (-1.16z)| norm 0.2444 (-0.69z)| lr 6.48e-04 | 4163.56 ms | 32.4% bf16 MFU | 124913 tok/s step 6322/19560 | loss 3.548692 (+1.28z)| norm 0.2505 (-0.42z)| lr 6.48e-04 | 4194.13 ms | 32.2% bf16 MFU | 124918 tok/s step 6323/19560 | loss 3.484216 (-0.63z)| norm 0.2547 (-0.25z)| lr 6.48e-04 | 4163.18 ms | 32.4% bf16 MFU | 124968 tok/s step 6324/19560 | loss 3.504002 (-0.04z)| norm 0.2324 (-1.16z)| lr 6.48e-04 | 4205.44 ms | 32.1% bf16 MFU | 124953 tok/s step 6325/19560 | loss 3.448720 (-1.68z)| norm 0.2302 (-1.23z)| lr 6.48e-04 | 4175.28 ms | 32.3% bf16 MFU | 124984 tok/s step 6326/19560 | loss 3.512967 (+0.21z)| norm 0.2221 (-1.53z)| lr 6.48e-04 | 4184.41 ms | 32.3% bf16 MFU | 125000 tok/s step 6327/19560 | loss 3.479922 (-0.78z)| norm 0.2349 (-1.01z)| lr 6.48e-04 | 4178.91 ms | 32.3% bf16 MFU | 125023 tok/s step 6328/19560 | loss 3.519127 (+0.39z)| norm 0.2510 (-0.34z)| lr 6.48e-04 | 4160.20 ms | 32.5% bf16 MFU | 125073 tok/s step 6329/19560 | loss 3.476759 (-0.86z)| norm 0.2628 (+0.15z)| lr 6.48e-04 | 4156.97 ms | 32.5% bf16 MFU | 125125 tok/s step 6330/19560 | loss 3.451220 (-1.61z)| norm 0.2369 (-0.90z)| lr 6.48e-04 | 4165.38 ms | 32.4% bf16 MFU | 125163 tok/s step 6331/19560 | loss 3.545591 (+1.19z)| norm 0.2604 (+0.05z)| lr 6.48e-04 | 4162.24 ms | 32.4% bf16 MFU | 125203 tok/s step 6332/19560 | loss 3.487921 (-0.55z)| norm 0.2846 (+1.03z)| lr 6.48e-04 | 4171.97 ms | 32.4% bf16 MFU | 125226 tok/s step 6333/19560 | loss 3.491062 (-0.44z)| norm 0.2583 (-0.05z)| lr 6.48e-04 | 4172.78 ms | 32.4% bf16 MFU | 125247 tok/s step 6334/19560 | loss 3.571358 (+1.97z)| norm 0.2320 (-1.11z)| lr 6.48e-04 | 4167.35 ms | 32.4% bf16 MFU | 125275 tok/s step 6335/19560 | loss 3.446160 (-1.80z)| norm 0.2557 (-0.15z)| lr 6.48e-04 | 4162.70 ms | 32.4% bf16 MFU | 125309 tok/s step 6336/19560 | loss 3.475932 (-0.89z)| norm 0.2737 (+0.58z)| lr 6.48e-04 | 4168.62 ms | 32.4% bf16 MFU | 125332 tok/s step 6337/19560 | loss 3.532030 (+0.79z)| norm 0.2448 (-0.58z)| lr 6.48e-04 | 4160.16 ms | 32.5% bf16 MFU | 125366 tok/s step 6338/19560 | loss 3.436933 (-2.02z)| norm 0.2432 (-0.64z)| lr 6.48e-04 | 4170.70 ms | 32.4% bf16 MFU | 125383 tok/s step 6339/19560 | loss 3.509594 (+0.13z)| norm 0.2576 (-0.05z)| lr 6.48e-04 | 4167.07 ms | 32.4% bf16 MFU | 125405 tok/s step 6340/19560 | loss 3.500216 (-0.16z)| norm 0.2446 (-0.58z)| lr 6.47e-04 | 4170.04 ms | 32.4% bf16 MFU | 125421 tok/s step 6341/19560 | loss 3.551044 (+1.35z)| norm 0.2495 (-0.39z)| lr 6.47e-04 | 4168.52 ms | 32.4% bf16 MFU | 125439 tok/s step 6342/19560 | loss 3.486561 (-0.57z)| norm 0.2761 (+0.70z)| lr 6.47e-04 | 4165.42 ms | 32.4% bf16 MFU | 125460 tok/s step 6343/19560 | loss 3.474839 (-0.91z)| norm 0.2889 (+1.21z)| lr 6.47e-04 | 4167.52 ms | 32.4% bf16 MFU | 125477 tok/s step 6344/19560 | loss 3.505630 (+0.04z)| norm 0.2697 (+0.42z)| lr 6.47e-04 | 4158.92 ms | 32.5% bf16 MFU | 125507 tok/s step 6345/19560 | loss 3.476063 (-0.86z)| norm 0.2813 (+0.88z)| lr 6.47e-04 | 4170.08 ms | 32.4% bf16 MFU | 125518 tok/s step 6346/19560 | loss 3.480156 (-0.72z)| norm 0.2649 (+0.20z)| lr 6.47e-04 | 4313.25 ms | 31.3% bf16 MFU | 125319 tok/s step 6347/19560 | loss 3.502212 (-0.04z)| norm 0.2630 (+0.12z)| lr 6.47e-04 | 4163.35 ms | 32.4% bf16 MFU | 125350 tok/s step 6348/19560 | loss 3.519553 (+0.50z)| norm 0.3042 (+1.78z)| lr 6.47e-04 | 4181.15 ms | 32.3% bf16 MFU | 125352 tok/s step 6349/19560 | loss 3.482682 (-0.63z)| norm 0.2624 (+0.07z)| lr 6.47e-04 | 4165.00 ms | 32.4% bf16 MFU | 125378 tok/s step 6350/19560 | loss 3.504514 (+0.05z)| norm 0.2740 (+0.54z)| lr 6.47e-04 | 4165.30 ms | 32.4% bf16 MFU | 125403 tok/s step 6351/19560 | loss 3.518329 (+0.47z)| norm 0.2770 (+0.66z)| lr 6.47e-04 | 4173.88 ms | 32.3% bf16 MFU | 125413 tok/s step 6352/19560 | loss 3.562180 (+1.84z)| norm 0.2645 (+0.17z)| lr 6.47e-04 | 4166.39 ms | 32.4% bf16 MFU | 125435 tok/s step 6353/19560 | loss 3.466750 (-1.11z)| norm 0.2844 (+1.00z)| lr 6.47e-04 | 4167.68 ms | 32.4% bf16 MFU | 125453 tok/s step 6354/19560 | loss 3.493647 (-0.28z)| norm 0.2636 (+0.13z)| lr 6.47e-04 | 4166.84 ms | 32.4% bf16 MFU | 125471 tok/s step 6355/19560 | loss 3.615772 (+3.31z)| norm 0.2594 (-0.04z)| lr 6.47e-04 | 4169.10 ms | 32.4% bf16 MFU | 125486 tok/s step 6356/19560 | loss 3.474208 (-0.87z)| norm 0.2926 (+1.32z)| lr 6.47e-04 | 4162.50 ms | 32.4% bf16 MFU | 125509 tok/s step 6357/19560 | loss 3.491636 (-0.35z)| norm 0.2710 (+0.43z)| lr 6.47e-04 | 4158.57 ms | 32.5% bf16 MFU | 125537 tok/s step 6358/19560 | loss 3.454124 (-1.44z)| norm 0.2384 (-0.92z)| lr 6.47e-04 | 4162.40 ms | 32.4% bf16 MFU | 125558 tok/s step 6359/19560 | loss 3.520787 (+0.50z)| norm 0.2663 (+0.23z)| lr 6.46e-04 | 4171.94 ms | 32.4% bf16 MFU | 125564 tok/s step 6360/19560 | loss 3.581733 (+2.23z)| norm 0.2617 (+0.04z)| lr 6.46e-04 | 4159.05 ms | 32.5% bf16 MFU | 125589 tok/s step 6361/19560 | loss 3.508935 (+0.12z)| norm 0.2384 (-0.92z)| lr 6.46e-04 | 4172.81 ms | 32.4% bf16 MFU | 125592 tok/s step 6362/19560 | loss 3.526321 (+0.63z)| norm 0.2711 (+0.42z)| lr 6.46e-04 | 4167.43 ms | 32.4% bf16 MFU | 125602 tok/s step 6363/19560 | loss 3.408463 (-2.68z)| norm 0.2619 (+0.04z)| lr 6.46e-04 | 4178.71 ms | 32.3% bf16 MFU | 125595 tok/s step 6364/19560 | loss 3.558317 (+1.52z)| norm 0.2639 (+0.12z)| lr 6.46e-04 | 4173.12 ms | 32.4% bf16 MFU | 125597 tok/s step 6365/19560 | loss 3.480710 (-0.67z)| norm 0.2573 (-0.16z)| lr 6.46e-04 | 4163.95 ms | 32.4% bf16 MFU | 125613 tok/s step 6366/19560 | loss 3.456561 (-1.33z)| norm 0.2734 (+0.51z)| lr 6.46e-04 | 4177.09 ms | 32.3% bf16 MFU | 125608 tok/s step 6367/19560 | loss 3.533858 (+0.84z)| norm 0.2749 (+0.57z)| lr 6.46e-04 | 4174.62 ms | 32.3% bf16 MFU | 125607 tok/s step 6368/19560 | loss 3.598490 (+2.57z)| norm 0.2770 (+0.65z)| lr 6.46e-04 | 4169.64 ms | 32.4% bf16 MFU | 125614 tok/s step 6369/19560 | loss 3.502886 (-0.04z)| norm 0.2763 (+0.62z)| lr 6.46e-04 | 4164.13 ms | 32.4% bf16 MFU | 125628 tok/s step 6370/19560 | loss 3.472166 (-0.88z)| norm 0.2723 (+0.45z)| lr 6.46e-04 | 4170.87 ms | 32.4% bf16 MFU | 125632 tok/s step 6371/19560 | loss 3.527848 (+0.66z)| norm 0.2475 (-0.60z)| lr 6.46e-04 | 4195.27 ms | 32.2% bf16 MFU | 125599 tok/s step 6372/19560 | loss 3.492093 (-0.32z)| norm 0.3116 (+2.05z)| lr 6.46e-04 | 4161.37 ms | 32.4% bf16 MFU | 125619 tok/s step 6373/19560 | loss 3.583294 (+2.16z)| norm 0.2907 (+1.17z)| lr 6.46e-04 | 4173.87 ms | 32.3% bf16 MFU | 125618 tok/s step 6374/19560 | loss 3.432747 (-1.92z)| norm 0.2555 (-0.27z)| lr 6.46e-04 | 4284.52 ms | 31.5% bf16 MFU | 125456 tok/s step 6375/19560 | loss 3.543180 (+1.08z)| norm 0.2446 (-0.72z)| lr 6.46e-04 | 4170.48 ms | 32.4% bf16 MFU | 125469 tok/s step 6376/19560 | loss 3.464108 (-1.05z)| norm 0.2372 (-1.02z)| lr 6.46e-04 | 4158.72 ms | 32.5% bf16 MFU | 125499 tok/s step 6377/19560 | loss 3.444430 (-1.56z)| norm 0.2321 (-1.21z)| lr 6.46e-04 | 4166.49 ms | 32.4% bf16 MFU | 125516 tok/s step 6378/19560 | loss 3.531463 (+0.77z)| norm 0.2513 (-0.42z)| lr 6.45e-04 | 4162.17 ms | 32.4% bf16 MFU | 125538 tok/s step 6379/19560 | loss 3.495033 (-0.20z)| norm 0.2365 (-1.02z)| lr 6.45e-04 | 4174.41 ms | 32.3% bf16 MFU | 125541 tok/s step 6380/19560 | loss 3.571071 (+1.88z)| norm 0.2438 (-0.71z)| lr 6.45e-04 | 4163.84 ms | 32.4% bf16 MFU | 125560 tok/s step 6381/19560 | loss 3.475344 (-0.73z)| norm 0.2636 (+0.10z)| lr 6.45e-04 | 4166.46 ms | 32.4% bf16 MFU | 125573 tok/s step 6382/19560 | loss 3.513756 (+0.32z)| norm 0.2257 (-1.42z)| lr 6.45e-04 | 4164.55 ms | 32.4% bf16 MFU | 125589 tok/s step 6383/19560 | loss 3.520216 (+0.49z)| norm 0.2380 (-0.92z)| lr 6.45e-04 | 4215.81 ms | 32.0% bf16 MFU | 125528 tok/s step 6384/19560 | loss 3.571033 (+1.86z)| norm 0.2540 (-0.23z)| lr 6.45e-04 | 4173.21 ms | 32.4% bf16 MFU | 125533 tok/s step 6385/19560 | loss 3.502868 (+0.01z)| norm 0.2452 (-0.61z)| lr 6.45e-04 | 4167.85 ms | 32.4% bf16 MFU | 125546 tok/s step 6386/19560 | loss 3.445325 (-1.55z)| norm 0.2243 (-1.50z)| lr 6.45e-04 | 4160.07 ms | 32.5% bf16 MFU | 125570 tok/s step 6387/19560 | loss 3.515414 (+0.35z)| norm 0.2683 (+0.43z)| lr 6.45e-04 | 4198.25 ms | 32.2% bf16 MFU | 125536 tok/s step 6388/19560 | loss 3.501450 (-0.02z)| norm 0.2493 (-0.40z)| lr 6.45e-04 | 4159.84 ms | 32.5% bf16 MFU | 125561 tok/s step 6389/19560 | loss 3.447720 (-1.46z)| norm 0.2386 (-0.86z)| lr 6.45e-04 | 4218.41 ms | 32.0% bf16 MFU | 125497 tok/s step 6390/19560 | loss 3.522571 (+0.56z)| norm 0.2397 (-0.81z)| lr 6.45e-04 | 4210.97 ms | 32.1% bf16 MFU | 125448 tok/s step 6391/19560 | loss 3.491063 (-0.29z)| norm 0.2404 (-0.78z)| lr 6.45e-04 | 4162.70 ms | 32.4% bf16 MFU | 125473 tok/s step 6392/19560 | loss 3.531300 (+0.78z)| norm 0.2388 (-0.85z)| lr 6.45e-04 | 4167.85 ms | 32.4% bf16 MFU | 125489 tok/s step 6393/19560 | loss 3.495449 (-0.19z)| norm 0.2503 (-0.36z)| lr 6.45e-04 | 4218.86 ms | 32.0% bf16 MFU | 125428 tok/s step 6394/19560 | loss 3.483480 (-0.51z)| norm 0.2448 (-0.60z)| lr 6.45e-04 | 4163.33 ms | 32.4% bf16 MFU | 125453 tok/s step 6395/19560 | loss 3.509030 (+0.18z)| norm 0.2682 (+0.42z)| lr 6.45e-04 | 4228.88 ms | 31.9% bf16 MFU | 125379 tok/s step 6396/19560 | loss 3.474036 (-0.76z)| norm 0.2610 (+0.09z)| lr 6.44e-04 | 4222.97 ms | 32.0% bf16 MFU | 125318 tok/s step 6397/19560 | loss 3.497125 (-0.13z)| norm 0.2370 (-0.98z)| lr 6.44e-04 | 4164.35 ms | 32.4% bf16 MFU | 125347 tok/s step 6398/19560 | loss 3.440032 (-1.64z)| norm 0.2506 (-0.36z)| lr 6.44e-04 | 4195.09 ms | 32.2% bf16 MFU | 125328 tok/s step 6399/19560 | loss 3.463815 (-0.99z)| norm 0.2668 (+0.35z)| lr 6.44e-04 | 4809.25 ms | 28.1% bf16 MFU | 124513 tok/s step 6400/19560 | loss 3.456046 (-1.19z)| norm 0.2486 (-0.45z)| lr 6.44e-04 | 4670.85 ms | 28.9% bf16 MFU | 123899 tok/s step 6401/19560 | loss 3.474460 (-0.69z)| norm 0.2350 (-1.05z)| lr 6.44e-04 | 4415.27 ms | 30.6% bf16 MFU | 123642 tok/s step 6402/19560 | loss 3.471526 (-0.76z)| norm 0.2335 (-1.10z)| lr 6.44e-04 | 4269.30 ms | 31.6% bf16 MFU | 123600 tok/s step 6403/19560 | loss 3.503210 (+0.08z)| norm 0.2413 (-0.75z)| lr 6.44e-04 | 4199.69 ms | 32.1% bf16 MFU | 123662 tok/s step 6404/19560 | loss 3.419070 (-2.12z)| norm 0.2881 (+1.32z)| lr 6.44e-04 | 4237.42 ms | 31.9% bf16 MFU | 123665 tok/s step 6405/19560 | loss 3.450325 (-1.30z)| norm 0.2600 (+0.07z)| lr 6.44e-04 | 4349.81 ms | 31.0% bf16 MFU | 123508 tok/s step 6406/19560 | loss 3.458610 (-1.07z)| norm 0.2296 (-1.26z)| lr 6.44e-04 | 4177.40 ms | 32.3% bf16 MFU | 123608 tok/s step 6407/19560 | loss 3.464887 (-0.91z)| norm 0.2674 (+0.41z)| lr 6.44e-04 | 4212.83 ms | 32.0% bf16 MFU | 123650 tok/s step 6408/19560 | loss 3.541791 (+1.08z)| norm 0.2512 (-0.32z)| lr 6.44e-04 | 4551.64 ms | 29.7% bf16 MFU | 123227 tok/s step 6409/19560 | loss 3.433361 (-1.69z)| norm 0.2571 (-0.05z)| lr 6.44e-04 | 4209.99 ms | 32.1% bf16 MFU | 123293 tok/s step 6410/19560 | loss 3.508456 (+0.23z)| norm 0.2699 (+0.52z)| lr 6.44e-04 | 4216.83 ms | 32.0% bf16 MFU | 123345 tok/s step 6411/19560 | loss 3.438939 (-1.54z)| norm 0.2912 (+1.46z)| lr 6.44e-04 | 4239.01 ms | 31.9% bf16 MFU | 123361 tok/s step 6412/19560 | loss 3.487298 (-0.30z)| norm 0.2564 (-0.08z)| lr 6.44e-04 | 4160.32 ms | 32.5% bf16 MFU | 123494 tok/s step 6413/19560 | loss 3.489334 (-0.24z)| norm 0.2496 (-0.39z)| lr 6.44e-04 | 4543.85 ms | 29.7% bf16 MFU | 123089 tok/s step 6414/19560 | loss 3.504262 (+0.14z)| norm 0.2477 (-0.47z)| lr 6.44e-04 | 4280.12 ms | 31.5% bf16 MFU | 123059 tok/s step 6415/19560 | loss 3.482696 (-0.41z)| norm 0.2447 (-0.61z)| lr 6.43e-04 | 4173.59 ms | 32.4% bf16 MFU | 123187 tok/s step 6416/19560 | loss 3.505960 (+0.19z)| norm 0.2653 (+0.31z)| lr 6.43e-04 | 4164.50 ms | 32.4% bf16 MFU | 123323 tok/s step 6417/19560 | loss 3.429894 (-1.72z)| norm 0.2423 (-0.72z)| lr 6.43e-04 | 4184.25 ms | 32.3% bf16 MFU | 123421 tok/s step 6418/19560 | loss 3.451483 (-1.16z)| norm 0.2406 (-0.79z)| lr 6.43e-04 | 4224.29 ms | 32.0% bf16 MFU | 123456 tok/s step 6419/19560 | loss 3.478272 (-0.47z)| norm 0.2292 (-1.28z)| lr 6.43e-04 | 4177.92 ms | 32.3% bf16 MFU | 123558 tok/s step 6420/19560 | loss 3.457466 (-1.00z)| norm 0.2370 (-0.93z)| lr 6.43e-04 | 4170.42 ms | 32.4% bf16 MFU | 123666 tok/s step 6421/19560 | loss 3.445173 (-1.30z)| norm 0.2343 (-1.04z)| lr 6.43e-04 | 4172.74 ms | 32.4% bf16 MFU | 123765 tok/s step 6422/19560 | loss 3.501620 (+0.15z)| norm 0.2356 (-0.97z)| lr 6.43e-04 | 4227.07 ms | 31.9% bf16 MFU | 123778 tok/s step 6423/19560 | loss 3.526280 (+0.79z)| norm 0.2319 (-1.14z)| lr 6.43e-04 | 4176.65 ms | 32.3% bf16 MFU | 123865 tok/s step 6424/19560 | loss 3.498071 (+0.05z)| norm 0.2357 (-0.97z)| lr 6.43e-04 | 4167.93 ms | 32.4% bf16 MFU | 123962 tok/s step 6425/19560 | loss 3.481584 (-0.37z)| norm 0.2414 (-0.73z)| lr 6.43e-04 | 4180.52 ms | 32.3% bf16 MFU | 124034 tok/s step 6426/19560 | loss 3.493279 (-0.06z)| norm 0.2669 (+0.54z)| lr 6.43e-04 | 4178.70 ms | 32.3% bf16 MFU | 124106 tok/s step 6427/19560 | loss 3.493398 (-0.06z)| norm 0.2542 (-0.12z)| lr 6.43e-04 | 4225.14 ms | 32.0% bf16 MFU | 124105 tok/s step 6428/19560 | loss 3.435915 (-1.52z)| norm 0.2677 (+0.60z)| lr 6.43e-04 | 4177.70 ms | 32.3% bf16 MFU | 124175 tok/s step 6429/19560 | loss 3.503153 (+0.22z)| norm 0.2647 (+0.45z)| lr 6.43e-04 | 4182.72 ms | 32.3% bf16 MFU | 124233 tok/s step 6430/19560 | loss 3.463662 (-0.80z)| norm 0.2426 (-0.73z)| lr 6.43e-04 | 4192.02 ms | 32.2% bf16 MFU | 124275 tok/s step 6431/19560 | loss 3.463792 (-0.78z)| norm 0.2621 (+0.33z)| lr 6.43e-04 | 4177.08 ms | 32.3% bf16 MFU | 124337 tok/s step 6432/19560 | loss 3.490942 (-0.07z)| norm 0.2612 (+0.29z)| lr 6.43e-04 | 4199.48 ms | 32.2% bf16 MFU | 124362 tok/s step 6433/19560 | loss 3.454768 (-1.00z)| norm 0.2717 (+0.86z)| lr 6.43e-04 | 4222.30 ms | 32.0% bf16 MFU | 124353 tok/s step 6434/19560 | loss 3.396274 (-2.44z)| norm 0.2545 (-0.09z)| lr 6.42e-04 | 4253.70 ms | 31.7% bf16 MFU | 124298 tok/s step 6435/19560 | loss 3.484686 (-0.20z)| norm 0.2904 (+1.86z)| lr 6.42e-04 | 4198.24 ms | 32.2% bf16 MFU | 124327 tok/s step 6436/19560 | loss 3.447689 (-1.12z)| norm 0.2548 (-0.09z)| lr 6.42e-04 | 4176.66 ms | 32.3% bf16 MFU | 124387 tok/s step 6437/19560 | loss 3.454965 (-0.92z)| norm 0.2425 (-0.76z)| lr 6.42e-04 | 4180.33 ms | 32.3% bf16 MFU | 124439 tok/s step 6438/19560 | loss 3.482161 (-0.22z)| norm 0.2572 (+0.05z)| lr 6.42e-04 | 4178.07 ms | 32.3% bf16 MFU | 124491 tok/s step 6439/19560 | loss 3.509884 (+0.48z)| norm 0.2347 (-1.19z)| lr 6.42e-04 | 4210.77 ms | 32.1% bf16 MFU | 124492 tok/s step 6440/19560 | loss 3.456673 (-0.88z)| norm 0.2657 (+0.56z)| lr 6.42e-04 | 4168.19 ms | 32.4% bf16 MFU | 124557 tok/s step 6441/19560 | loss 3.484689 (-0.16z)| norm 0.2639 (+0.46z)| lr 6.42e-04 | 4181.29 ms | 32.3% bf16 MFU | 124598 tok/s step 6442/19560 | loss 3.454924 (-0.92z)| norm 0.2503 (-0.31z)| lr 6.42e-04 | 4164.17 ms | 32.4% bf16 MFU | 124664 tok/s step 6443/19560 | loss 3.470344 (-0.52z)| norm 0.2391 (-0.93z)| lr 6.42e-04 | 4178.80 ms | 32.3% bf16 MFU | 124704 tok/s step 6444/19560 | loss 3.440367 (-1.26z)| norm 0.2828 (+1.52z)| lr 6.42e-04 | 4160.76 ms | 32.5% bf16 MFU | 124769 tok/s step 6445/19560 | loss 3.506151 (+0.39z)| norm 0.2710 (+0.85z)| lr 6.42e-04 | 4220.45 ms | 32.0% bf16 MFU | 124742 tok/s step 6446/19560 | loss 3.456427 (-0.87z)| norm 0.2515 (-0.23z)| lr 6.42e-04 | 4220.83 ms | 32.0% bf16 MFU | 124715 tok/s step 6447/19560 | loss 3.471113 (-0.50z)| norm 0.2500 (-0.31z)| lr 6.42e-04 | 4180.58 ms | 32.3% bf16 MFU | 124750 tok/s step 6448/19560 | loss 3.431825 (-1.47z)| norm 0.2742 (+1.04z)| lr 6.42e-04 | 4174.82 ms | 32.3% bf16 MFU | 124792 tok/s step 6449/19560 | loss 3.461711 (-0.71z)| norm 0.2707 (+0.83z)| lr 6.42e-04 | 4174.49 ms | 32.3% bf16 MFU | 124832 tok/s step 6450/19560 | loss 3.519903 (+0.78z)| norm 0.2641 (+0.45z)| lr 6.42e-04 | 4207.98 ms | 32.1% bf16 MFU | 124820 tok/s step 6451/19560 | loss 3.438484 (-1.29z)| norm 0.2734 (+0.96z)| lr 6.42e-04 | 4210.70 ms | 32.1% bf16 MFU | 124805 tok/s step 6452/19560 | loss 3.500644 (+0.29z)| norm 0.2489 (-0.42z)| lr 6.41e-04 | 4199.70 ms | 32.1% bf16 MFU | 124806 tok/s step 6453/19560 | loss 3.463427 (-0.66z)| norm 0.2882 (+1.77z)| lr 6.41e-04 | 4173.87 ms | 32.3% bf16 MFU | 124847 tok/s step 6454/19560 | loss 3.432240 (-1.43z)| norm 0.2938 (+2.05z)| lr 6.41e-04 | 4208.53 ms | 32.1% bf16 MFU | 124833 tok/s step 6455/19560 | loss 3.454844 (-0.85z)| norm 0.2880 (+1.69z)| lr 6.41e-04 | 4190.08 ms | 32.2% bf16 MFU | 124848 tok/s step 6456/19560 | loss 3.419919 (-1.69z)| norm 0.2431 (-0.82z)| lr 6.41e-04 | 4176.55 ms | 32.3% bf16 MFU | 124882 tok/s step 6457/19560 | loss 3.472727 (-0.37z)| norm 0.2794 (+1.19z)| lr 6.41e-04 | 4170.24 ms | 32.4% bf16 MFU | 124924 tok/s step 6458/19560 | loss 3.492853 (+0.12z)| norm 0.2652 (+0.40z)| lr 6.41e-04 | 4173.81 ms | 32.3% bf16 MFU | 124958 tok/s step 6459/19560 | loss 3.485324 (-0.05z)| norm 0.2557 (-0.14z)| lr 6.41e-04 | 4176.33 ms | 32.3% bf16 MFU | 124987 tok/s step 6460/19560 | loss 3.458078 (-0.74z)| norm 0.2703 (+0.69z)| lr 6.41e-04 | 4184.54 ms | 32.3% bf16 MFU | 125003 tok/s step 6461/19560 | loss 3.453557 (-0.84z)| norm 0.2593 (+0.07z)| lr 6.41e-04 | 4170.24 ms | 32.4% bf16 MFU | 125039 tok/s step 6462/19560 | loss 3.489849 (+0.09z)| norm 0.2645 (+0.36z)| lr 6.41e-04 | 4172.80 ms | 32.4% bf16 MFU | 125069 tok/s step 6463/19560 | loss 3.484710 (-0.05z)| norm 0.2841 (+1.44z)| lr 6.41e-04 | 4164.23 ms | 32.4% bf16 MFU | 125110 tok/s step 6464/19560 | loss 3.587882 (+2.52z)| norm 0.2983 (+2.19z)| lr 6.41e-04 | 4194.20 ms | 32.2% bf16 MFU | 125105 tok/s step 6465/19560 | loss 3.437885 (-1.22z)| norm 0.2445 (-0.79z)| lr 6.41e-04 | 4162.29 ms | 32.4% bf16 MFU | 125148 tok/s step 6466/19560 | loss 3.472110 (-0.38z)| norm 0.2644 (+0.31z)| lr 6.41e-04 | 4183.69 ms | 32.3% bf16 MFU | 125156 tok/s step 6467/19560 | loss 3.487510 (+0.02z)| norm 0.2624 (+0.20z)| lr 6.41e-04 | 4228.31 ms | 31.9% bf16 MFU | 125098 tok/s step 6468/19560 | loss 3.470727 (-0.40z)| norm 0.2673 (+0.46z)| lr 6.41e-04 | 4172.55 ms | 32.4% bf16 MFU | 125126 tok/s step 6469/19560 | loss 3.458515 (-0.70z)| norm 0.2686 (+0.52z)| lr 6.41e-04 | 4174.82 ms | 32.3% bf16 MFU | 125149 tok/s step 6470/19560 | loss 3.455281 (-0.77z)| norm 0.3021 (+2.33z)| lr 6.41e-04 | 4160.16 ms | 32.5% bf16 MFU | 125193 tok/s step 6471/19560 | loss 3.532509 (+1.17z)| norm 0.2982 (+2.10z)| lr 6.40e-04 | 4171.70 ms | 32.4% bf16 MFU | 125217 tok/s step 6472/19560 | loss 3.441787 (-1.10z)| norm 0.2614 (+0.11z)| lr 6.40e-04 | 4174.81 ms | 32.3% bf16 MFU | 125235 tok/s step 6473/19560 | loss 3.418252 (-1.67z)| norm 0.2658 (+0.36z)| lr 6.40e-04 | 4183.93 ms | 32.3% bf16 MFU | 125239 tok/s step 6474/19560 | loss 3.467588 (-0.43z)| norm 0.2523 (-0.37z)| lr 6.40e-04 | 4174.34 ms | 32.3% bf16 MFU | 125257 tok/s step 6475/19560 | loss 3.507188 (+0.55z)| norm 0.2337 (-1.37z)| lr 6.40e-04 | 4197.59 ms | 32.2% bf16 MFU | 125239 tok/s step 6476/19560 | loss 3.444100 (-1.00z)| norm 0.2586 (+0.00z)| lr 6.40e-04 | 4165.40 ms | 32.4% bf16 MFU | 125271 tok/s step 6477/19560 | loss 3.449312 (-0.86z)| norm 0.2641 (+0.31z)| lr 6.40e-04 | 4203.23 ms | 32.1% bf16 MFU | 125244 tok/s step 6478/19560 | loss 3.495686 (+0.29z)| norm 0.2124 (-2.48z)| lr 6.40e-04 | 4159.15 ms | 32.5% bf16 MFU | 125284 tok/s step 6479/19560 | loss 3.429557 (-1.33z)| norm 0.2595 (+0.08z)| lr 6.40e-04 | 4257.71 ms | 31.7% bf16 MFU | 125177 tok/s step 6480/19560 | loss 3.390266 (-2.25z)| norm 0.2408 (-0.92z)| lr 6.40e-04 | 4178.32 ms | 32.3% bf16 MFU | 125192 tok/s step 6481/19560 | loss 3.528385 (+1.12z)| norm 0.2461 (-0.62z)| lr 6.40e-04 | 4183.61 ms | 32.3% bf16 MFU | 125199 tok/s step 6482/19560 | loss 3.488878 (+0.15z)| norm 0.2558 (-0.09z)| lr 6.40e-04 | 4166.03 ms | 32.4% bf16 MFU | 125231 tok/s step 6483/19560 | loss 3.474367 (-0.18z)| norm 0.2549 (-0.14z)| lr 6.40e-04 | 4195.99 ms | 32.2% bf16 MFU | 125217 tok/s step 6484/19560 | loss 3.479699 (-0.05z)| norm 0.2605 (+0.18z)| lr 6.40e-04 | 4167.37 ms | 32.4% bf16 MFU | 125247 tok/s step 6485/19560 | loss 3.453489 (-0.71z)| norm 0.2433 (-0.76z)| lr 6.40e-04 | 4158.55 ms | 32.5% bf16 MFU | 125288 tok/s step 6486/19560 | loss 3.490932 (+0.24z)| norm 0.2745 (+0.96z)| lr 6.40e-04 | 4185.59 ms | 32.3% bf16 MFU | 125287 tok/s step 6487/19560 | loss 3.522637 (+1.05z)| norm 0.2902 (+1.80z)| lr 6.40e-04 | 4179.65 ms | 32.3% bf16 MFU | 125294 tok/s step 6488/19560 | loss 3.457258 (-0.61z)| norm 0.2642 (+0.37z)| lr 6.40e-04 | 4172.64 ms | 32.4% bf16 MFU | 125312 tok/s step 6489/19560 | loss 3.486676 (+0.17z)| norm 0.2660 (+0.46z)| lr 6.39e-04 | 4168.87 ms | 32.4% bf16 MFU | 125334 tok/s step 6490/19560 | loss 3.566965 (+2.23z)| norm 0.2719 (+0.79z)| lr 6.39e-04 | 4186.07 ms | 32.3% bf16 MFU | 125330 tok/s step 6491/19560 | loss 3.507345 (+0.68z)| norm 0.2499 (-0.42z)| lr 6.39e-04 | 4173.20 ms | 32.4% bf16 MFU | 125345 tok/s step 6492/19560 | loss 3.432608 (-1.26z)| norm 0.2455 (-0.66z)| lr 6.39e-04 | 4172.48 ms | 32.4% bf16 MFU | 125361 tok/s step 6493/19560 | loss 3.482848 (+0.06z)| norm 0.2182 (-2.10z)| lr 6.39e-04 | 4178.32 ms | 32.3% bf16 MFU | 125366 tok/s step 6494/19560 | loss 3.468525 (-0.32z)| norm 0.2573 (+0.02z)| lr 6.39e-04 | 4178.60 ms | 32.3% bf16 MFU | 125372 tok/s step 6495/19560 | loss 3.481889 (+0.05z)| norm 0.2476 (-0.50z)| lr 6.39e-04 | 4193.22 ms | 32.2% bf16 MFU | 125355 tok/s step 6496/19560 | loss 3.589429 (+2.93z)| norm 0.2694 (+0.69z)| lr 6.39e-04 | 4165.34 ms | 32.4% bf16 MFU | 125380 tok/s step 6497/19560 | loss 3.397875 (-2.15z)| norm 0.2542 (-0.13z)| lr 6.39e-04 | 4183.34 ms | 32.3% bf16 MFU | 125378 tok/s step 6498/19560 | loss 3.431341 (-1.25z)| norm 0.2766 (+1.09z)| lr 6.39e-04 | 4161.71 ms | 32.4% bf16 MFU | 125408 tok/s step 6499/19560 | loss 3.455506 (-0.60z)| norm 0.2260 (-1.64z)| lr 6.39e-04 | 4162.64 ms | 32.4% bf16 MFU | 125435 tok/s step 6500/19560 | loss 3.531797 (+1.39z)| norm 0.2570 (+0.06z)| lr 6.39e-04 | 4200.79 ms | 32.1% bf16 MFU | 125404 tok/s val loss 3.479999 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2844/10042 = 0.283211 step 6501/19560 | loss 3.465091 (-0.34z)| norm 0.2572 (+0.08z)| lr 6.39e-04 | 4178.10 ms | 32.3% bf16 MFU | 125408 tok/s step 6502/19560 | loss 3.509966 (+0.85z)| norm 0.2484 (-0.41z)| lr 6.39e-04 | 4186.81 ms | 32.2% bf16 MFU | 125398 tok/s step 6503/19560 | loss 3.450166 (-0.75z)| norm 0.2654 (+0.54z)| lr 6.39e-04 | 4169.88 ms | 32.4% bf16 MFU | 125415 tok/s step 6504/19560 | loss 3.454930 (-0.62z)| norm 0.2464 (-0.54z)| lr 6.39e-04 | 4246.39 ms | 31.8% bf16 MFU | 125318 tok/s step 6505/19560 | loss 3.491292 (+0.36z)| norm 0.2341 (-1.25z)| lr 6.39e-04 | 4197.85 ms | 32.2% bf16 MFU | 125296 tok/s step 6506/19560 | loss 3.461261 (-0.45z)| norm 0.2730 (+0.96z)| lr 6.39e-04 | 4243.37 ms | 31.8% bf16 MFU | 125209 tok/s step 6507/19560 | loss 3.493787 (+0.45z)| norm 0.2506 (-0.32z)| lr 6.38e-04 | 4159.84 ms | 32.5% bf16 MFU | 125251 tok/s step 6508/19560 | loss 3.419111 (-1.60z)| norm 0.2374 (-1.07z)| lr 6.38e-04 | 4176.87 ms | 32.3% bf16 MFU | 125264 tok/s step 6509/19560 | loss 3.457033 (-0.53z)| norm 0.2486 (-0.42z)| lr 6.38e-04 | 4164.28 ms | 32.4% bf16 MFU | 125296 tok/s step 6510/19560 | loss 3.482942 (+0.20z)| norm 0.2483 (-0.46z)| lr 6.38e-04 | 4178.69 ms | 32.3% bf16 MFU | 125305 tok/s step 6511/19560 | loss 3.449266 (-0.73z)| norm 0.2536 (-0.16z)| lr 6.38e-04 | 4215.60 ms | 32.0% bf16 MFU | 125258 tok/s step 6512/19560 | loss 3.433285 (-1.18z)| norm 0.2376 (-1.08z)| lr 6.38e-04 | 4226.28 ms | 31.9% bf16 MFU | 125198 tok/s step 6513/19560 | loss 3.517394 (+1.24z)| norm 0.2561 (-0.01z)| lr 6.38e-04 | 4171.98 ms | 32.4% bf16 MFU | 125221 tok/s step 6514/19560 | loss 3.456891 (-0.51z)| norm 0.2523 (-0.25z)| lr 6.38e-04 | 4184.17 ms | 32.3% bf16 MFU | 125225 tok/s step 6515/19560 | loss 3.419708 (-1.55z)| norm 0.2467 (-0.57z)| lr 6.38e-04 | 4243.57 ms | 31.8% bf16 MFU | 125141 tok/s step 6516/19560 | loss 3.424088 (-1.40z)| norm 0.2947 (+2.18z)| lr 6.38e-04 | 4194.68 ms | 32.2% bf16 MFU | 125134 tok/s step 6517/19560 | loss 3.535803 (+1.76z)| norm 0.2735 (+0.95z)| lr 6.38e-04 | 4291.12 ms | 31.5% bf16 MFU | 124986 tok/s step 6518/19560 | loss 3.394945 (-2.18z)| norm 0.2330 (-1.37z)| lr 6.38e-04 | 4172.19 ms | 32.4% bf16 MFU | 125020 tok/s step 6519/19560 | loss 3.495023 (+0.62z)| norm 0.2595 (+0.14z)| lr 6.38e-04 | 4181.95 ms | 32.3% bf16 MFU | 125037 tok/s step 6520/19560 | loss 3.440302 (-0.90z)| norm 0.2631 (+0.34z)| lr 6.38e-04 | 4172.11 ms | 32.4% bf16 MFU | 125069 tok/s step 6521/19560 | loss 3.515504 (+1.21z)| norm 0.2523 (-0.29z)| lr 6.38e-04 | 4165.42 ms | 32.4% bf16 MFU | 125109 tok/s step 6522/19560 | loss 3.441109 (-0.86z)| norm 0.2532 (-0.24z)| lr 6.38e-04 | 4172.66 ms | 32.4% bf16 MFU | 125136 tok/s step 6523/19560 | loss 3.381589 (-2.45z)| norm 0.2686 (+0.65z)| lr 6.38e-04 | 4175.03 ms | 32.3% bf16 MFU | 125158 tok/s step 6524/19560 | loss 3.471099 (+0.01z)| norm 0.2697 (+0.71z)| lr 6.38e-04 | 4171.16 ms | 32.4% bf16 MFU | 125184 tok/s step 6525/19560 | loss 3.504914 (+0.93z)| norm 0.2486 (-0.52z)| lr 6.38e-04 | 4179.08 ms | 32.3% bf16 MFU | 125198 tok/s step 6526/19560 | loss 3.531789 (+1.63z)| norm 0.3090 (+2.87z)| lr 6.37e-04 | 4170.87 ms | 32.4% bf16 MFU | 125223 tok/s step 6527/19560 | loss 3.525221 (+1.43z)| norm 0.2587 (+0.04z)| lr 6.37e-04 | 4169.73 ms | 32.4% bf16 MFU | 125249 tok/s step 6528/19560 | loss 3.468710 (-0.10z)| norm 0.2814 (+1.30z)| lr 6.37e-04 | 4169.32 ms | 32.4% bf16 MFU | 125274 tok/s step 6529/19560 | loss 3.462375 (-0.26z)| norm 0.2620 (+0.21z)| lr 6.37e-04 | 4165.46 ms | 32.4% bf16 MFU | 125304 tok/s step 6530/19560 | loss 3.449628 (-0.60z)| norm 0.2552 (-0.19z)| lr 6.37e-04 | 4216.05 ms | 32.0% bf16 MFU | 125256 tok/s step 6531/19560 | loss 3.435129 (-0.98z)| norm 0.2690 (+0.58z)| lr 6.37e-04 | 4196.83 ms | 32.2% bf16 MFU | 125240 tok/s step 6532/19560 | loss 3.535027 (+1.68z)| norm 0.8869 (+10.75z)| lr 6.37e-04 | 4173.64 ms | 32.4% bf16 MFU | 125258 tok/s step 6533/19560 | loss 3.471133 (-0.04z)| norm 0.4224 (+2.64z)| lr 6.37e-04 | 4167.85 ms | 32.4% bf16 MFU | 125285 tok/s step 6534/19560 | loss 3.552897 (+2.10z)| norm 0.3765 (+1.83z)| lr 6.37e-04 | 4171.87 ms | 32.4% bf16 MFU | 125305 tok/s step 6535/19560 | loss 3.470356 (-0.08z)| norm 0.3507 (+1.38z)| lr 6.37e-04 | 4181.41 ms | 32.3% bf16 MFU | 125309 tok/s step 6536/19560 | loss 3.501698 (+0.77z)| norm 0.3237 (+0.93z)| lr 6.37e-04 | 4184.26 ms | 32.3% bf16 MFU | 125308 tok/s step 6537/19560 | loss 3.498328 (+0.67z)| norm 0.3179 (+0.82z)| lr 6.37e-04 | 4167.69 ms | 32.4% bf16 MFU | 125333 tok/s step 6538/19560 | loss 3.485299 (+0.32z)| norm 0.3291 (+0.99z)| lr 6.37e-04 | 4172.76 ms | 32.4% bf16 MFU | 125348 tok/s step 6539/19560 | loss 3.468605 (-0.13z)| norm 0.3190 (+0.83z)| lr 6.37e-04 | 4176.27 ms | 32.3% bf16 MFU | 125358 tok/s step 6540/19560 | loss 3.479007 (+0.15z)| norm 0.2730 (+0.08z)| lr 6.37e-04 | 4172.22 ms | 32.4% bf16 MFU | 125373 tok/s step 6541/19560 | loss 3.465697 (-0.20z)| norm 0.2985 (+0.48z)| lr 6.37e-04 | 4171.25 ms | 32.4% bf16 MFU | 125389 tok/s step 6542/19560 | loss 3.478167 (+0.14z)| norm 0.2574 (-0.19z)| lr 6.37e-04 | 4184.21 ms | 32.3% bf16 MFU | 125385 tok/s step 6543/19560 | loss 3.440684 (-0.86z)| norm 0.2985 (+0.48z)| lr 6.37e-04 | 4168.21 ms | 32.4% bf16 MFU | 125404 tok/s step 6544/19560 | loss 3.491375 (+0.51z)| norm 0.2443 (-0.40z)| lr 6.36e-04 | 4172.17 ms | 32.4% bf16 MFU | 125417 tok/s step 6545/19560 | loss 3.450697 (-0.60z)| norm 0.2689 (-0.01z)| lr 6.36e-04 | 4171.49 ms | 32.4% bf16 MFU | 125431 tok/s step 6546/19560 | loss 3.447343 (-0.69z)| norm 0.2896 (+0.32z)| lr 6.36e-04 | 4170.49 ms | 32.4% bf16 MFU | 125445 tok/s step 6547/19560 | loss 3.488293 (+0.42z)| norm 0.2323 (-0.61z)| lr 6.36e-04 | 4174.02 ms | 32.3% bf16 MFU | 125453 tok/s step 6548/19560 | loss 3.521194 (+1.29z)| norm 0.2666 (-0.05z)| lr 6.36e-04 | 4194.01 ms | 32.2% bf16 MFU | 125431 tok/s step 6549/19560 | loss 3.436428 (-0.99z)| norm 0.2534 (-0.27z)| lr 6.36e-04 | 4193.00 ms | 32.2% bf16 MFU | 125411 tok/s step 6550/19560 | loss 3.461634 (-0.31z)| norm 0.2266 (-0.71z)| lr 6.36e-04 | 4174.55 ms | 32.3% bf16 MFU | 125420 tok/s step 6551/19560 | loss 3.440412 (-0.86z)| norm 0.2753 (+0.08z)| lr 6.36e-04 | 4187.39 ms | 32.2% bf16 MFU | 125410 tok/s step 6552/19560 | loss 3.528693 (+1.51z)| norm 0.2498 (-0.34z)| lr 6.36e-04 | 4169.20 ms | 32.4% bf16 MFU | 125427 tok/s step 6553/19560 | loss 3.558584 (+2.25z)| norm 0.2578 (-0.21z)| lr 6.36e-04 | 4163.24 ms | 32.4% bf16 MFU | 125452 tok/s step 6554/19560 | loss 3.500588 (+0.72z)| norm 0.2344 (-0.58z)| lr 6.36e-04 | 4170.55 ms | 32.4% bf16 MFU | 125465 tok/s step 6555/19560 | loss 3.531774 (+1.52z)| norm 0.2480 (-0.36z)| lr 6.36e-04 | 4183.06 ms | 32.3% bf16 MFU | 125459 tok/s step 6556/19560 | loss 3.441209 (-0.85z)| norm 0.3648 (+1.51z)| lr 6.36e-04 | 4167.53 ms | 32.4% bf16 MFU | 125476 tok/s step 6557/19560 | loss 3.439213 (-0.88z)| norm 0.2802 (+0.15z)| lr 6.36e-04 | 4168.38 ms | 32.4% bf16 MFU | 125491 tok/s step 6558/19560 | loss 3.492085 (+0.49z)| norm 0.2465 (-0.40z)| lr 6.36e-04 | 4184.16 ms | 32.3% bf16 MFU | 125481 tok/s step 6559/19560 | loss 3.523859 (+1.30z)| norm 0.2459 (-0.41z)| lr 6.36e-04 | 4170.80 ms | 32.4% bf16 MFU | 125493 tok/s step 6560/19560 | loss 3.442618 (-0.80z)| norm 0.2581 (-0.21z)| lr 6.36e-04 | 4172.74 ms | 32.4% bf16 MFU | 125500 tok/s step 6561/19560 | loss 3.441049 (-0.83z)| norm 0.2391 (-0.51z)| lr 6.36e-04 | 4171.70 ms | 32.4% bf16 MFU | 125509 tok/s step 6562/19560 | loss 3.385806 (-2.25z)| norm 0.2564 (-0.23z)| lr 6.35e-04 | 4173.47 ms | 32.4% bf16 MFU | 125515 tok/s step 6563/19560 | loss 3.450127 (-0.59z)| norm 0.2577 (-0.21z)| lr 6.35e-04 | 4179.06 ms | 32.3% bf16 MFU | 125512 tok/s step 6564/19560 | loss 3.481888 (+0.22z)| norm 0.2729 (+0.04z)| lr 6.35e-04 | 4166.46 ms | 32.4% bf16 MFU | 125528 tok/s step 6565/19560 | loss 3.457392 (-0.41z)| norm 0.2311 (-0.64z)| lr 6.35e-04 | 4189.72 ms | 32.2% bf16 MFU | 125509 tok/s step 6566/19560 | loss 3.451554 (-0.55z)| norm 0.2901 (+0.31z)| lr 6.35e-04 | 4172.30 ms | 32.4% bf16 MFU | 125516 tok/s step 6567/19560 | loss 3.480355 (+0.20z)| norm 0.2628 (-0.13z)| lr 6.35e-04 | 4192.53 ms | 32.2% bf16 MFU | 125493 tok/s step 6568/19560 | loss 3.449689 (-0.59z)| norm 0.2760 (+0.08z)| lr 6.35e-04 | 4177.40 ms | 32.3% bf16 MFU | 125494 tok/s step 6569/19560 | loss 3.497431 (+0.64z)| norm 0.2610 (-0.16z)| lr 6.35e-04 | 4165.84 ms | 32.4% bf16 MFU | 125512 tok/s step 6570/19560 | loss 3.495036 (+0.57z)| norm 0.2612 (-0.16z)| lr 6.35e-04 | 4180.68 ms | 32.3% bf16 MFU | 125506 tok/s step 6571/19560 | loss 3.453957 (-0.49z)| norm 0.2716 (+0.00z)| lr 6.35e-04 | 4178.72 ms | 32.3% bf16 MFU | 125504 tok/s step 6572/19560 | loss 3.486432 (+0.34z)| norm 0.2378 (-0.54z)| lr 6.35e-04 | 4211.91 ms | 32.1% bf16 MFU | 125453 tok/s step 6573/19560 | loss 3.459860 (-0.34z)| norm 0.2760 (+0.08z)| lr 6.35e-04 | 4161.61 ms | 32.4% bf16 MFU | 125479 tok/s step 6574/19560 | loss 3.403052 (-1.77z)| norm 0.2586 (-0.20z)| lr 6.35e-04 | 4162.93 ms | 32.4% bf16 MFU | 125503 tok/s step 6575/19560 | loss 3.452642 (-0.50z)| norm 0.2320 (-0.63z)| lr 6.35e-04 | 4161.88 ms | 32.4% bf16 MFU | 125526 tok/s step 6576/19560 | loss 3.415602 (-1.44z)| norm 0.2729 (+0.03z)| lr 6.35e-04 | 4161.23 ms | 32.4% bf16 MFU | 125549 tok/s step 6577/19560 | loss 3.422678 (-1.24z)| norm 0.2543 (-0.27z)| lr 6.35e-04 | 4213.85 ms | 32.0% bf16 MFU | 125493 tok/s step 6578/19560 | loss 3.441297 (-0.76z)| norm 0.2478 (-0.37z)| lr 6.35e-04 | 4172.10 ms | 32.4% bf16 MFU | 125502 tok/s step 6579/19560 | loss 3.457953 (-0.34z)| norm 0.2409 (-0.48z)| lr 6.35e-04 | 4168.13 ms | 32.4% bf16 MFU | 125516 tok/s step 6580/19560 | loss 3.526591 (+1.39z)| norm 0.2404 (-0.48z)| lr 6.35e-04 | 4167.48 ms | 32.4% bf16 MFU | 125530 tok/s step 6581/19560 | loss 3.429218 (-1.06z)| norm 0.2382 (-0.51z)| lr 6.34e-04 | 4172.08 ms | 32.4% bf16 MFU | 125537 tok/s step 6582/19560 | loss 3.575458 (+2.53z)| norm 0.2545 (-0.25z)| lr 6.34e-04 | 4909.40 ms | 27.5% bf16 MFU | 124600 tok/s step 6583/19560 | loss 3.439680 (-0.80z)| norm 0.2345 (-0.56z)| lr 6.34e-04 | 4607.35 ms | 29.3% bf16 MFU | 124060 tok/s step 6584/19560 | loss 3.435334 (-0.92z)| norm 0.2297 (-0.63z)| lr 6.34e-04 | 4172.04 ms | 32.4% bf16 MFU | 124140 tok/s step 6585/19560 | loss 3.487980 (+0.38z)| norm 0.2347 (-0.55z)| lr 6.34e-04 | 4166.96 ms | 32.4% bf16 MFU | 124224 tok/s step 6586/19560 | loss 3.429865 (-1.04z)| norm 0.2417 (-0.43z)| lr 6.34e-04 | 4163.57 ms | 32.4% bf16 MFU | 124309 tok/s step 6587/19560 | loss 3.471564 (-0.01z)| norm 0.2523 (-0.26z)| lr 6.34e-04 | 4168.47 ms | 32.4% bf16 MFU | 124382 tok/s step 6588/19560 | loss 3.465096 (-0.17z)| norm 0.2464 (-0.35z)| lr 6.34e-04 | 4175.35 ms | 32.3% bf16 MFU | 124441 tok/s step 6589/19560 | loss 3.381858 (-2.17z)| norm 0.2607 (-0.12z)| lr 6.34e-04 | 4360.24 ms | 31.0% bf16 MFU | 124231 tok/s step 6590/19560 | loss 3.460999 (-0.25z)| norm 0.2304 (-0.60z)| lr 6.34e-04 | 4717.06 ms | 28.6% bf16 MFU | 123577 tok/s step 6591/19560 | loss 3.503357 (+0.77z)| norm 0.2623 (-0.09z)| lr 6.34e-04 | 4642.68 ms | 29.1% bf16 MFU | 123045 tok/s step 6592/19560 | loss 3.476910 (+0.16z)| norm 0.2585 (-0.15z)| lr 6.34e-04 | 4550.76 ms | 29.7% bf16 MFU | 122653 tok/s step 6593/19560 | loss 3.481833 (+0.27z)| norm 0.2367 (-0.50z)| lr 6.34e-04 | 4409.28 ms | 30.6% bf16 MFU | 122466 tok/s step 6594/19560 | loss 3.487026 (+0.40z)| norm 0.2710 (+0.05z)| lr 6.34e-04 | 4439.55 ms | 30.4% bf16 MFU | 122247 tok/s step 6595/19560 | loss 3.451555 (-0.48z)| norm 0.2590 (-0.14z)| lr 6.34e-04 | 4557.47 ms | 29.6% bf16 MFU | 121887 tok/s step 6596/19560 | loss 3.469794 (-0.02z)| norm 0.2769 (+0.15z)| lr 6.34e-04 | 6373.85 ms | 21.2% bf16 MFU | 119905 tok/s step 6597/19560 | loss 3.595609 (+2.98z)| norm 0.2565 (-0.18z)| lr 6.34e-04 | 4274.90 ms | 31.6% bf16 MFU | 120042 tok/s step 6598/19560 | loss 3.493431 (+0.51z)| norm 0.2578 (-0.15z)| lr 6.34e-04 | 4155.55 ms | 32.5% bf16 MFU | 120348 tok/s step 6599/19560 | loss 3.455235 (-0.39z)| norm 0.2472 (-0.32z)| lr 6.33e-04 | 4146.26 ms | 32.6% bf16 MFU | 120653 tok/s step 6600/19560 | loss 3.455741 (-0.39z)| norm 0.2490 (-0.29z)| lr 6.33e-04 | 4721.42 ms | 28.6% bf16 MFU | 120173 tok/s step 6601/19560 | loss 3.472845 (+0.02z)| norm 0.2465 (-0.32z)| lr 6.33e-04 | 4160.73 ms | 32.5% bf16 MFU | 120465 tok/s step 6602/19560 | loss 3.461442 (-0.26z)| norm 0.2624 (-0.07z)| lr 6.33e-04 | 4173.67 ms | 32.3% bf16 MFU | 120722 tok/s step 6603/19560 | loss 3.528028 (+1.36z)| norm 0.2403 (-0.43z)| lr 6.33e-04 | 4196.65 ms | 32.2% bf16 MFU | 120933 tok/s step 6604/19560 | loss 3.502410 (+0.72z)| norm 0.2373 (-0.47z)| lr 6.33e-04 | 4339.83 ms | 31.1% bf16 MFU | 120926 tok/s step 6605/19560 | loss 3.590710 (+2.76z)| norm 0.2916 (+0.40z)| lr 6.33e-04 | 4161.83 ms | 32.4% bf16 MFU | 121179 tok/s step 6606/19560 | loss 3.434814 (-0.91z)| norm 0.2558 (-0.18z)| lr 6.33e-04 | 4336.49 ms | 31.1% bf16 MFU | 121165 tok/s step 6607/19560 | loss 3.495012 (+0.50z)| norm 0.2624 (-0.08z)| lr 6.33e-04 | 4228.06 ms | 31.9% bf16 MFU | 121307 tok/s step 6608/19560 | loss 3.525113 (+1.20z)| norm 0.2223 (-0.72z)| lr 6.33e-04 | 4217.73 ms | 32.0% bf16 MFU | 121457 tok/s step 6609/19560 | loss 3.507892 (+0.80z)| norm 0.2601 (-0.11z)| lr 6.33e-04 | 4178.97 ms | 32.3% bf16 MFU | 121657 tok/s step 6610/19560 | loss 3.474226 (-0.01z)| norm 0.2629 (-0.07z)| lr 6.33e-04 | 4280.93 ms | 31.5% bf16 MFU | 121698 tok/s step 6611/19560 | loss 3.437071 (-0.89z)| norm 0.2467 (-0.33z)| lr 6.33e-04 | 4160.40 ms | 32.5% bf16 MFU | 121914 tok/s step 6612/19560 | loss 3.406251 (-1.60z)| norm 0.2764 (+0.15z)| lr 6.33e-04 | 4171.59 ms | 32.4% bf16 MFU | 122102 tok/s step 6613/19560 | loss 3.506659 (+0.77z)| norm 0.2508 (-0.27z)| lr 6.33e-04 | 4203.39 ms | 32.1% bf16 MFU | 122233 tok/s step 6614/19560 | loss 3.507209 (+0.78z)| norm 0.2382 (-0.46z)| lr 6.33e-04 | 4202.54 ms | 32.1% bf16 MFU | 122359 tok/s step 6615/19560 | loss 3.473153 (-0.02z)| norm 0.2388 (-0.45z)| lr 6.33e-04 | 4176.31 ms | 32.3% bf16 MFU | 122518 tok/s step 6616/19560 | loss 3.503458 (+0.69z)| norm 0.2428 (-0.38z)| lr 6.33e-04 | 4170.93 ms | 32.4% bf16 MFU | 122677 tok/s step 6617/19560 | loss 3.493861 (+0.46z)| norm 0.2432 (-0.37z)| lr 6.32e-04 | 4186.79 ms | 32.2% bf16 MFU | 122805 tok/s step 6618/19560 | loss 3.450230 (-0.56z)| norm 0.2451 (-0.34z)| lr 6.32e-04 | 4215.46 ms | 32.0% bf16 MFU | 122883 tok/s step 6619/19560 | loss 3.532301 (+1.41z)| norm 0.2747 (+0.13z)| lr 6.32e-04 | 4196.36 ms | 32.2% bf16 MFU | 122986 tok/s step 6620/19560 | loss 3.479715 (+0.14z)| norm 0.2523 (-0.23z)| lr 6.32e-04 | 4191.02 ms | 32.2% bf16 MFU | 123092 tok/s step 6621/19560 | loss 3.477833 (+0.10z)| norm 0.2626 (-0.07z)| lr 6.32e-04 | 4234.47 ms | 31.9% bf16 MFU | 123128 tok/s step 6622/19560 | loss 3.469463 (-0.11z)| norm 0.2674 (+0.01z)| lr 6.32e-04 | 4187.18 ms | 32.2% bf16 MFU | 123232 tok/s step 6623/19560 | loss 3.432439 (-0.98z)| norm 0.2950 (+0.45z)| lr 6.32e-04 | 4189.70 ms | 32.2% bf16 MFU | 123327 tok/s step 6624/19560 | loss 3.504588 (+0.78z)| norm 0.3037 (+0.58z)| lr 6.32e-04 | 4184.71 ms | 32.3% bf16 MFU | 123425 tok/s step 6625/19560 | loss 3.440806 (-0.81z)| norm 0.2348 (-0.52z)| lr 6.32e-04 | 4217.06 ms | 32.0% bf16 MFU | 123470 tok/s step 6626/19560 | loss 3.509544 (+0.89z)| norm 0.2887 (+0.34z)| lr 6.32e-04 | 4177.40 ms | 32.3% bf16 MFU | 123572 tok/s step 6627/19560 | loss 3.544233 (+1.72z)| norm 0.2796 (+0.19z)| lr 6.32e-04 | 4209.22 ms | 32.1% bf16 MFU | 123621 tok/s step 6628/19560 | loss 3.491629 (+0.43z)| norm 0.2319 (-0.57z)| lr 6.32e-04 | 4356.60 ms | 31.0% bf16 MFU | 123457 tok/s step 6629/19560 | loss 3.506476 (+0.79z)| norm 0.2616 (-0.10z)| lr 6.32e-04 | 4172.30 ms | 32.4% bf16 MFU | 123567 tok/s step 6630/19560 | loss 3.517059 (+1.05z)| norm 0.2324 (-0.56z)| lr 6.32e-04 | 4195.66 ms | 32.2% bf16 MFU | 123637 tok/s step 6631/19560 | loss 3.461546 (-0.32z)| norm 0.2676 (+0.00z)| lr 6.32e-04 | 4179.96 ms | 32.3% bf16 MFU | 123727 tok/s step 6632/19560 | loss 3.582539 (+2.58z)| norm 0.2562 (-0.18z)| lr 6.32e-04 | 4223.07 ms | 32.0% bf16 MFU | 123748 tok/s step 6633/19560 | loss 3.560027 (+2.00z)| norm 0.2696 (+0.03z)| lr 6.32e-04 | 4181.99 ms | 32.3% bf16 MFU | 123829 tok/s step 6634/19560 | loss 3.520500 (+1.04z)| norm 0.2531 (-0.23z)| lr 6.32e-04 | 4174.49 ms | 32.3% bf16 MFU | 123917 tok/s step 6635/19560 | loss 3.518726 (+0.99z)| norm 0.2496 (-0.29z)| lr 6.31e-04 | 4187.63 ms | 32.2% bf16 MFU | 123981 tok/s step 6636/19560 | loss 3.568887 (+2.13z)| norm 0.2963 (+0.45z)| lr 6.31e-04 | 4176.38 ms | 32.3% bf16 MFU | 124059 tok/s step 6637/19560 | loss 3.479611 (+0.03z)| norm 0.2602 (-0.13z)| lr 6.31e-04 | 4209.87 ms | 32.1% bf16 MFU | 124083 tok/s step 6638/19560 | loss 3.451921 (-0.61z)| norm 0.2629 (-0.09z)| lr 6.31e-04 | 4196.78 ms | 32.2% bf16 MFU | 124125 tok/s step 6639/19560 | loss 3.420595 (-1.33z)| norm 0.2548 (-0.22z)| lr 6.31e-04 | 4179.12 ms | 32.3% bf16 MFU | 124191 tok/s step 6640/19560 | loss 3.538596 (+1.39z)| norm 0.2569 (-0.19z)| lr 6.31e-04 | 4203.57 ms | 32.1% bf16 MFU | 124218 tok/s step 6641/19560 | loss 3.521850 (+1.00z)| norm 0.3026 (+0.54z)| lr 6.31e-04 | 4175.16 ms | 32.3% bf16 MFU | 124286 tok/s step 6642/19560 | loss 3.514752 (+0.83z)| norm 0.2739 (+0.08z)| lr 6.31e-04 | 4185.03 ms | 32.3% bf16 MFU | 124335 tok/s step 6643/19560 | loss 3.494106 (+0.34z)| norm 0.2529 (-0.26z)| lr 6.31e-04 | 4164.95 ms | 32.4% bf16 MFU | 124413 tok/s step 6644/19560 | loss 3.532020 (+1.20z)| norm 0.2684 (-0.01z)| lr 6.31e-04 | 4168.54 ms | 32.4% bf16 MFU | 124481 tok/s step 6645/19560 | loss 3.501875 (+0.51z)| norm 0.2550 (-0.22z)| lr 6.31e-04 | 4175.33 ms | 32.3% bf16 MFU | 124535 tok/s step 6646/19560 | loss 3.437392 (-1.02z)| norm 0.2738 (+0.08z)| lr 6.31e-04 | 4184.21 ms | 32.3% bf16 MFU | 124573 tok/s step 6647/19560 | loss 3.621328 (+3.18z)| norm 0.2499 (-0.31z)| lr 6.31e-04 | 4180.18 ms | 32.3% bf16 MFU | 124616 tok/s step 6648/19560 | loss 3.513954 (+0.73z)| norm 0.2446 (-0.39z)| lr 6.31e-04 | 4174.09 ms | 32.3% bf16 MFU | 124665 tok/s step 6649/19560 | loss 3.516695 (+0.79z)| norm 0.2740 (+0.08z)| lr 6.31e-04 | 4195.47 ms | 32.2% bf16 MFU | 124680 tok/s step 6650/19560 | loss 3.486765 (+0.10z)| norm 0.2576 (-0.18z)| lr 6.31e-04 | 4180.54 ms | 32.3% bf16 MFU | 124717 tok/s step 6651/19560 | loss 3.505196 (+0.51z)| norm 0.2644 (-0.08z)| lr 6.31e-04 | 4183.73 ms | 32.3% bf16 MFU | 124747 tok/s step 6652/19560 | loss 3.510109 (+0.62z)| norm 0.2640 (-0.08z)| lr 6.31e-04 | 4190.62 ms | 32.2% bf16 MFU | 124765 tok/s step 6653/19560 | loss 3.472968 (-0.24z)| norm 0.2532 (-0.25z)| lr 6.30e-04 | 4182.30 ms | 32.3% bf16 MFU | 124795 tok/s step 6654/19560 | loss 3.524305 (+0.96z)| norm 0.3141 (+0.72z)| lr 6.30e-04 | 4180.49 ms | 32.3% bf16 MFU | 124826 tok/s step 6655/19560 | loss 3.477407 (-0.13z)| norm 0.2730 (+0.06z)| lr 6.30e-04 | 4190.41 ms | 32.2% bf16 MFU | 124840 tok/s step 6656/19560 | loss 3.455893 (-0.63z)| norm 0.3124 (+0.69z)| lr 6.30e-04 | 4189.23 ms | 32.2% bf16 MFU | 124856 tok/s step 6657/19560 | loss 3.448781 (-0.80z)| norm 0.2479 (-0.34z)| lr 6.30e-04 | 4192.40 ms | 32.2% bf16 MFU | 124866 tok/s step 6658/19560 | loss 3.509162 (+0.61z)| norm 0.2614 (-0.13z)| lr 6.30e-04 | 4178.07 ms | 32.3% bf16 MFU | 124897 tok/s step 6659/19560 | loss 3.471035 (-0.29z)| norm 0.2570 (-0.20z)| lr 6.30e-04 | 4199.89 ms | 32.1% bf16 MFU | 124894 tok/s step 6660/19560 | loss 3.530778 (+1.11z)| norm 0.2714 (+0.23z)| lr 6.30e-04 | 4182.43 ms | 32.3% bf16 MFU | 124917 tok/s step 6661/19560 | loss 3.495147 (+0.27z)| norm 0.2723 (+0.34z)| lr 6.30e-04 | 4203.21 ms | 32.1% bf16 MFU | 124908 tok/s step 6662/19560 | loss 3.535155 (+1.22z)| norm 0.2499 (-0.51z)| lr 6.30e-04 | 4179.51 ms | 32.3% bf16 MFU | 124934 tok/s step 6663/19560 | loss 3.466307 (-0.41z)| norm 0.2678 (+0.26z)| lr 6.30e-04 | 4180.12 ms | 32.3% bf16 MFU | 124959 tok/s step 6664/19560 | loss 3.487172 (+0.09z)| norm 0.2342 (-1.18z)| lr 6.30e-04 | 4188.37 ms | 32.2% bf16 MFU | 124970 tok/s step 6665/19560 | loss 3.575331 (+2.12z)| norm 0.2596 (-0.04z)| lr 6.30e-04 | 4175.17 ms | 32.3% bf16 MFU | 125000 tok/s step 6666/19560 | loss 3.476408 (-0.18z)| norm 0.2900 (+1.39z)| lr 6.30e-04 | 4235.69 ms | 31.9% bf16 MFU | 124939 tok/s step 6667/19560 | loss 3.552089 (+1.55z)| norm 0.2649 (+0.25z)| lr 6.30e-04 | 4181.25 ms | 32.3% bf16 MFU | 124961 tok/s step 6668/19560 | loss 3.498757 (+0.32z)| norm 0.2509 (-0.42z)| lr 6.30e-04 | 4175.83 ms | 32.3% bf16 MFU | 124991 tok/s step 6669/19560 | loss 3.526522 (+0.95z)| norm 0.2496 (-0.47z)| lr 6.30e-04 | 4195.85 ms | 32.2% bf16 MFU | 124989 tok/s step 6670/19560 | loss 3.463934 (-0.49z)| norm 0.2640 (+0.23z)| lr 6.30e-04 | 4183.48 ms | 32.3% bf16 MFU | 125006 tok/s step 6671/19560 | loss 3.471077 (-0.33z)| norm 0.2721 (+0.64z)| lr 6.29e-04 | 4182.30 ms | 32.3% bf16 MFU | 125023 tok/s step 6672/19560 | loss 3.464789 (-0.47z)| norm 0.2685 (+0.46z)| lr 6.29e-04 | 4183.01 ms | 32.3% bf16 MFU | 125039 tok/s step 6673/19560 | loss 3.433052 (-1.19z)| norm 0.2593 (+0.01z)| lr 6.29e-04 | 4181.15 ms | 32.3% bf16 MFU | 125057 tok/s step 6674/19560 | loss 3.432256 (-1.20z)| norm 0.2690 (+0.50z)| lr 6.29e-04 | 4181.21 ms | 32.3% bf16 MFU | 125074 tok/s step 6675/19560 | loss 3.497479 (+0.29z)| norm 0.2422 (-0.85z)| lr 6.29e-04 | 4176.77 ms | 32.3% bf16 MFU | 125096 tok/s step 6676/19560 | loss 3.465916 (-0.43z)| norm 0.2697 (+0.53z)| lr 6.29e-04 | 4181.29 ms | 32.3% bf16 MFU | 125111 tok/s step 6677/19560 | loss 3.522801 (+0.86z)| norm 0.2503 (-0.44z)| lr 6.29e-04 | 4184.35 ms | 32.3% bf16 MFU | 125120 tok/s step 6678/19560 | loss 3.518888 (+0.76z)| norm 0.2619 (+0.13z)| lr 6.29e-04 | 4175.76 ms | 32.3% bf16 MFU | 125142 tok/s step 6679/19560 | loss 3.457593 (-0.65z)| norm 0.2606 (+0.07z)| lr 6.29e-04 | 4193.77 ms | 32.2% bf16 MFU | 125136 tok/s step 6680/19560 | loss 3.445765 (-0.91z)| norm 0.2709 (+0.58z)| lr 6.29e-04 | 4182.35 ms | 32.3% bf16 MFU | 125147 tok/s step 6681/19560 | loss 3.422789 (-1.41z)| norm 0.2644 (+0.25z)| lr 6.29e-04 | 4189.18 ms | 32.2% bf16 MFU | 125147 tok/s step 6682/19560 | loss 3.454277 (-0.68z)| norm 0.2752 (+0.78z)| lr 6.29e-04 | 4178.40 ms | 32.3% bf16 MFU | 125163 tok/s step 6683/19560 | loss 3.503883 (+0.47z)| norm 0.2431 (-0.85z)| lr 6.29e-04 | 4178.70 ms | 32.3% bf16 MFU | 125179 tok/s step 6684/19560 | loss 3.478886 (-0.11z)| norm 0.2643 (+0.31z)| lr 6.29e-04 | 4180.49 ms | 32.3% bf16 MFU | 125190 tok/s step 6685/19560 | loss 3.477250 (-0.16z)| norm 0.2494 (-0.54z)| lr 6.29e-04 | 4189.27 ms | 32.2% bf16 MFU | 125188 tok/s step 6686/19560 | loss 3.534110 (+1.15z)| norm 0.2549 (-0.23z)| lr 6.29e-04 | 4164.29 ms | 32.4% bf16 MFU | 125224 tok/s step 6687/19560 | loss 3.536467 (+1.20z)| norm 0.2320 (-1.54z)| lr 6.29e-04 | 4186.90 ms | 32.2% bf16 MFU | 125224 tok/s step 6688/19560 | loss 3.538516 (+1.23z)| norm 0.2446 (-0.81z)| lr 6.29e-04 | 4173.71 ms | 32.3% bf16 MFU | 125243 tok/s step 6689/19560 | loss 3.501432 (+0.36z)| norm 0.2268 (-1.81z)| lr 6.28e-04 | 4189.62 ms | 32.2% bf16 MFU | 125238 tok/s step 6690/19560 | loss 3.496080 (+0.22z)| norm 0.2554 (-0.17z)| lr 6.28e-04 | 4181.51 ms | 32.3% bf16 MFU | 125245 tok/s step 6691/19560 | loss 3.460541 (-0.62z)| norm 0.2394 (-1.08z)| lr 6.28e-04 | 4166.40 ms | 32.4% bf16 MFU | 125275 tok/s step 6692/19560 | loss 3.504018 (+0.41z)| norm 0.2278 (-1.70z)| lr 6.28e-04 | 4189.57 ms | 32.2% bf16 MFU | 125268 tok/s step 6693/19560 | loss 3.500164 (+0.31z)| norm 0.2488 (-0.53z)| lr 6.28e-04 | 4180.45 ms | 32.3% bf16 MFU | 125276 tok/s step 6694/19560 | loss 3.457036 (-0.72z)| norm 0.2604 (+0.15z)| lr 6.28e-04 | 4171.06 ms | 32.4% bf16 MFU | 125297 tok/s step 6695/19560 | loss 3.465499 (-0.51z)| norm 0.2477 (-0.58z)| lr 6.28e-04 | 4179.21 ms | 32.3% bf16 MFU | 125304 tok/s step 6696/19560 | loss 3.491981 (+0.11z)| norm 0.2722 (+0.83z)| lr 6.28e-04 | 4178.87 ms | 32.3% bf16 MFU | 125312 tok/s step 6697/19560 | loss 3.521483 (+0.80z)| norm 0.2512 (-0.37z)| lr 6.28e-04 | 4169.84 ms | 32.4% bf16 MFU | 125333 tok/s step 6698/19560 | loss 3.628721 (+3.19z)| norm 0.2621 (+0.25z)| lr 6.28e-04 | 4197.32 ms | 32.2% bf16 MFU | 125312 tok/s step 6699/19560 | loss 3.487824 (-0.03z)| norm 0.2405 (-0.97z)| lr 6.28e-04 | 4219.06 ms | 32.0% bf16 MFU | 125260 tok/s step 6700/19560 | loss 3.468506 (-0.46z)| norm 0.2769 (+1.10z)| lr 6.28e-04 | 4179.10 ms | 32.3% bf16 MFU | 125270 tok/s step 6701/19560 | loss 3.540528 (+1.16z)| norm 0.2436 (-0.80z)| lr 6.28e-04 | 4170.37 ms | 32.4% bf16 MFU | 125292 tok/s step 6702/19560 | loss 3.503169 (+0.30z)| norm 0.2552 (-0.13z)| lr 6.28e-04 | 4171.78 ms | 32.4% bf16 MFU | 125311 tok/s step 6703/19560 | loss 3.477082 (-0.31z)| norm 0.2610 (+0.19z)| lr 6.28e-04 | 4184.06 ms | 32.3% bf16 MFU | 125311 tok/s step 6704/19560 | loss 3.431242 (-1.38z)| norm 0.2328 (-1.42z)| lr 6.28e-04 | 4177.21 ms | 32.3% bf16 MFU | 125321 tok/s step 6705/19560 | loss 3.468349 (-0.53z)| norm 0.2411 (-0.93z)| lr 6.28e-04 | 4176.81 ms | 32.3% bf16 MFU | 125331 tok/s step 6706/19560 | loss 3.525587 (+0.80z)| norm 0.2216 (-2.01z)| lr 6.28e-04 | 4180.12 ms | 32.3% bf16 MFU | 125336 tok/s step 6707/19560 | loss 3.451311 (-0.94z)| norm 0.2397 (-0.98z)| lr 6.27e-04 | 4187.76 ms | 32.2% bf16 MFU | 125329 tok/s step 6708/19560 | loss 3.449084 (-0.98z)| norm 0.2529 (-0.24z)| lr 6.27e-04 | 4190.52 ms | 32.2% bf16 MFU | 125318 tok/s step 6709/19560 | loss 3.479774 (-0.27z)| norm 0.2505 (-0.38z)| lr 6.27e-04 | 4208.56 ms | 32.1% bf16 MFU | 125281 tok/s step 6710/19560 | loss 3.488110 (-0.06z)| norm 0.2193 (-2.11z)| lr 6.27e-04 | 4200.05 ms | 32.1% bf16 MFU | 125258 tok/s step 6711/19560 | loss 3.484681 (-0.15z)| norm 0.2732 (+0.90z)| lr 6.27e-04 | 4188.20 ms | 32.2% bf16 MFU | 125254 tok/s step 6712/19560 | loss 3.457555 (-0.82z)| norm 0.2458 (-0.66z)| lr 6.27e-04 | 4171.82 ms | 32.4% bf16 MFU | 125275 tok/s step 6713/19560 | loss 3.467102 (-0.58z)| norm 0.2506 (-0.40z)| lr 6.27e-04 | 4177.38 ms | 32.3% bf16 MFU | 125287 tok/s step 6714/19560 | loss 3.463299 (-0.68z)| norm 0.2408 (-0.95z)| lr 6.27e-04 | 4180.92 ms | 32.3% bf16 MFU | 125293 tok/s step 6715/19560 | loss 3.511019 (+0.48z)| norm 0.2649 (+0.41z)| lr 6.27e-04 | 4171.68 ms | 32.4% bf16 MFU | 125312 tok/s step 6716/19560 | loss 3.502058 (+0.25z)| norm 0.2439 (-0.78z)| lr 6.27e-04 | 4177.87 ms | 32.3% bf16 MFU | 125321 tok/s step 6717/19560 | loss 3.435005 (-1.43z)| norm 0.2495 (-0.46z)| lr 6.27e-04 | 4176.35 ms | 32.3% bf16 MFU | 125332 tok/s step 6718/19560 | loss 3.444307 (-1.19z)| norm 0.2599 (+0.13z)| lr 6.27e-04 | 4201.81 ms | 32.1% bf16 MFU | 125304 tok/s step 6719/19560 | loss 3.471139 (-0.52z)| norm 0.2474 (-0.59z)| lr 6.27e-04 | 4190.52 ms | 32.2% bf16 MFU | 125294 tok/s step 6720/19560 | loss 3.416463 (-1.84z)| norm 0.2636 (+0.34z)| lr 6.27e-04 | 4174.08 ms | 32.3% bf16 MFU | 125310 tok/s step 6721/19560 | loss 3.496103 (+0.11z)| norm 0.2635 (+0.32z)| lr 6.27e-04 | 4195.29 ms | 32.2% bf16 MFU | 125293 tok/s step 6722/19560 | loss 3.481537 (-0.24z)| norm 0.2782 (+1.16z)| lr 6.27e-04 | 4180.15 ms | 32.3% bf16 MFU | 125299 tok/s step 6723/19560 | loss 3.457112 (-0.85z)| norm 0.2746 (+0.94z)| lr 6.27e-04 | 4161.57 ms | 32.4% bf16 MFU | 125334 tok/s step 6724/19560 | loss 3.513905 (+0.54z)| norm 0.2576 (-0.02z)| lr 6.27e-04 | 4177.72 ms | 32.3% bf16 MFU | 125342 tok/s step 6725/19560 | loss 3.457561 (-0.84z)| norm 0.2714 (+0.77z)| lr 6.26e-04 | 4197.29 ms | 32.2% bf16 MFU | 125320 tok/s step 6726/19560 | loss 3.517372 (+0.66z)| norm 0.2895 (+1.77z)| lr 6.26e-04 | 4182.26 ms | 32.3% bf16 MFU | 125322 tok/s step 6727/19560 | loss 3.501628 (+0.26z)| norm 0.2777 (+1.08z)| lr 6.26e-04 | 4174.02 ms | 32.3% bf16 MFU | 125336 tok/s step 6728/19560 | loss 3.441070 (-1.26z)| norm 0.2482 (-0.58z)| lr 6.26e-04 | 4169.90 ms | 32.4% bf16 MFU | 125356 tok/s step 6729/19560 | loss 3.479614 (-0.29z)| norm 0.2635 (+0.27z)| lr 6.26e-04 | 4169.66 ms | 32.4% bf16 MFU | 125375 tok/s step 6730/19560 | loss 3.487843 (-0.09z)| norm 0.2481 (-0.59z)| lr 6.26e-04 | 4174.59 ms | 32.3% bf16 MFU | 125386 tok/s step 6731/19560 | loss 3.516961 (+0.65z)| norm 0.2680 (+0.52z)| lr 6.26e-04 | 4163.19 ms | 32.4% bf16 MFU | 125414 tok/s step 6732/19560 | loss 3.486746 (-0.11z)| norm 0.2407 (-1.02z)| lr 6.26e-04 | 4188.34 ms | 32.2% bf16 MFU | 125402 tok/s step 6733/19560 | loss 3.516361 (+0.66z)| norm 0.2539 (-0.26z)| lr 6.26e-04 | 4176.01 ms | 32.3% bf16 MFU | 125409 tok/s step 6734/19560 | loss 3.534408 (+1.11z)| norm 0.2532 (-0.30z)| lr 6.26e-04 | 4185.60 ms | 32.3% bf16 MFU | 125402 tok/s step 6735/19560 | loss 3.507453 (+0.41z)| norm 0.2459 (-0.71z)| lr 6.26e-04 | 4181.87 ms | 32.3% bf16 MFU | 125400 tok/s step 6736/19560 | loss 3.488824 (-0.06z)| norm 0.2898 (+1.78z)| lr 6.26e-04 | 4182.25 ms | 32.3% bf16 MFU | 125398 tok/s step 6737/19560 | loss 3.473382 (-0.46z)| norm 0.2969 (+2.13z)| lr 6.26e-04 | 4168.40 ms | 32.4% bf16 MFU | 125417 tok/s step 6738/19560 | loss 3.482825 (-0.21z)| norm 0.2472 (-0.67z)| lr 6.26e-04 | 4168.83 ms | 32.4% bf16 MFU | 125434 tok/s step 6739/19560 | loss 3.484867 (-0.17z)| norm 0.2381 (-1.17z)| lr 6.26e-04 | 4161.36 ms | 32.4% bf16 MFU | 125462 tok/s step 6740/19560 | loss 3.463095 (-0.77z)| norm 0.2380 (-1.16z)| lr 6.26e-04 | 4186.78 ms | 32.2% bf16 MFU | 125450 tok/s step 6741/19560 | loss 3.590240 (+2.54z)| norm 0.2696 (+0.60z)| lr 6.26e-04 | 4240.90 ms | 31.8% bf16 MFU | 125359 tok/s step 6742/19560 | loss 3.506381 (+0.36z)| norm 0.2954 (+2.01z)| lr 6.26e-04 | 4180.74 ms | 32.3% bf16 MFU | 125361 tok/s step 6743/19560 | loss 3.462869 (-0.77z)| norm 0.2592 (-0.01z)| lr 6.25e-04 | 4167.19 ms | 32.4% bf16 MFU | 125384 tok/s step 6744/19560 | loss 3.468162 (-0.62z)| norm 0.2443 (-0.85z)| lr 6.25e-04 | 4180.38 ms | 32.3% bf16 MFU | 125386 tok/s step 6745/19560 | loss 3.485630 (-0.17z)| norm 0.2771 (+0.97z)| lr 6.25e-04 | 4169.86 ms | 32.4% bf16 MFU | 125403 tok/s step 6746/19560 | loss 3.474900 (-0.45z)| norm 0.2564 (-0.19z)| lr 6.25e-04 | 4399.66 ms | 30.7% bf16 MFU | 125091 tok/s step 6747/19560 | loss 3.511701 (+0.51z)| norm 0.2629 (+0.18z)| lr 6.25e-04 | 4166.65 ms | 32.4% bf16 MFU | 125128 tok/s step 6748/19560 | loss 3.466282 (-0.67z)| norm 0.2472 (-0.70z)| lr 6.25e-04 | 4169.52 ms | 32.4% bf16 MFU | 125159 tok/s step 6749/19560 | loss 3.464959 (-0.70z)| norm 0.2771 (+0.97z)| lr 6.25e-04 | 4203.88 ms | 32.1% bf16 MFU | 125137 tok/s step 6750/19560 | loss 3.497038 (+0.13z)| norm 0.2609 (+0.07z)| lr 6.25e-04 | 4165.44 ms | 32.4% bf16 MFU | 125173 tok/s val loss 3.467884 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2823/10042 = 0.281119 step 6751/19560 | loss 3.528957 (+0.95z)| norm 0.2555 (-0.22z)| lr 6.25e-04 | 4252.88 ms | 31.7% bf16 MFU | 125078 tok/s step 6752/19560 | loss 3.464902 (-0.73z)| norm 0.2465 (-0.73z)| lr 6.25e-04 | 5083.62 ms | 26.6% bf16 MFU | 123981 tok/s step 6753/19560 | loss 3.479517 (-0.35z)| norm 0.2833 (+1.40z)| lr 6.25e-04 | 4299.36 ms | 31.4% bf16 MFU | 123879 tok/s step 6754/19560 | loss 3.469874 (-0.60z)| norm 0.2412 (-1.04z)| lr 6.25e-04 | 4371.57 ms | 30.9% bf16 MFU | 123682 tok/s step 6755/19560 | loss 3.479860 (-0.32z)| norm 0.2488 (-0.58z)| lr 6.25e-04 | 4279.40 ms | 31.6% bf16 MFU | 123623 tok/s step 6756/19560 | loss 3.465564 (-0.70z)| norm 0.2427 (-0.95z)| lr 6.25e-04 | 4304.07 ms | 31.4% bf16 MFU | 123533 tok/s step 6757/19560 | loss 3.449389 (-1.11z)| norm 0.2532 (-0.33z)| lr 6.25e-04 | 4277.18 ms | 31.6% bf16 MFU | 123485 tok/s step 6758/19560 | loss 3.528562 (+0.97z)| norm 0.2570 (-0.12z)| lr 6.25e-04 | 4764.60 ms | 28.3% bf16 MFU | 122813 tok/s step 6759/19560 | loss 3.478643 (-0.34z)| norm 0.2461 (-0.75z)| lr 6.25e-04 | 4196.58 ms | 32.2% bf16 MFU | 122919 tok/s step 6760/19560 | loss 3.465358 (-0.68z)| norm 0.2657 (+0.41z)| lr 6.25e-04 | 4266.42 ms | 31.6% bf16 MFU | 122917 tok/s step 6761/19560 | loss 3.437083 (-1.43z)| norm 0.2538 (-0.29z)| lr 6.24e-04 | 4159.98 ms | 32.5% bf16 MFU | 123073 tok/s step 6762/19560 | loss 3.490062 (+0.01z)| norm 0.2501 (-0.51z)| lr 6.24e-04 | 4170.49 ms | 32.4% bf16 MFU | 123205 tok/s step 6763/19560 | loss 3.437561 (-1.39z)| norm 0.2789 (+1.19z)| lr 6.24e-04 | 4184.89 ms | 32.3% bf16 MFU | 123309 tok/s step 6764/19560 | loss 3.399135 (-2.38z)| norm 0.2651 (+0.39z)| lr 6.24e-04 | 4164.49 ms | 32.4% bf16 MFU | 123438 tok/s step 6765/19560 | loss 3.479336 (-0.22z)| norm 0.2677 (+0.54z)| lr 6.24e-04 | 4178.81 ms | 32.3% bf16 MFU | 123539 tok/s step 6766/19560 | loss 3.478652 (-0.25z)| norm 0.2559 (-0.17z)| lr 6.24e-04 | 4202.21 ms | 32.1% bf16 MFU | 123601 tok/s step 6767/19560 | loss 3.519208 (+0.84z)| norm 0.2639 (+0.31z)| lr 6.24e-04 | 4178.92 ms | 32.3% bf16 MFU | 123694 tok/s step 6768/19560 | loss 3.506874 (+0.51z)| norm 0.2711 (+0.74z)| lr 6.24e-04 | 4161.72 ms | 32.4% bf16 MFU | 123808 tok/s step 6769/19560 | loss 3.486137 (-0.05z)| norm 0.2385 (-1.23z)| lr 6.24e-04 | 4170.20 ms | 32.4% bf16 MFU | 123904 tok/s step 6770/19560 | loss 3.487362 (-0.01z)| norm 0.2511 (-0.44z)| lr 6.24e-04 | 4256.14 ms | 31.7% bf16 MFU | 123868 tok/s step 6771/19560 | loss 3.489001 (+0.03z)| norm 0.2719 (+0.84z)| lr 6.24e-04 | 4161.91 ms | 32.4% bf16 MFU | 123973 tok/s step 6772/19560 | loss 3.455036 (-0.89z)| norm 0.2585 (+0.02z)| lr 6.24e-04 | 4174.75 ms | 32.3% bf16 MFU | 124053 tok/s step 6773/19560 | loss 3.458964 (-0.77z)| norm 0.2614 (+0.19z)| lr 6.24e-04 | 4182.68 ms | 32.3% bf16 MFU | 124118 tok/s step 6774/19560 | loss 3.473854 (-0.37z)| norm 0.2444 (-0.84z)| lr 6.24e-04 | 4187.57 ms | 32.2% bf16 MFU | 124172 tok/s step 6775/19560 | loss 3.473564 (-0.37z)| norm 0.2666 (+0.52z)| lr 6.24e-04 | 4182.58 ms | 32.3% bf16 MFU | 124231 tok/s step 6776/19560 | loss 3.515696 (+0.88z)| norm 0.2426 (-0.96z)| lr 6.24e-04 | 4170.83 ms | 32.4% bf16 MFU | 124305 tok/s step 6777/19560 | loss 3.458265 (-0.80z)| norm 0.2327 (-1.55z)| lr 6.24e-04 | 4172.12 ms | 32.4% bf16 MFU | 124373 tok/s step 6778/19560 | loss 3.451470 (-0.99z)| norm 0.2377 (-1.22z)| lr 6.24e-04 | 4180.24 ms | 32.3% bf16 MFU | 124425 tok/s step 6779/19560 | loss 3.493557 (+0.25z)| norm 0.2209 (-2.19z)| lr 6.23e-04 | 4189.12 ms | 32.2% bf16 MFU | 124462 tok/s step 6780/19560 | loss 3.512725 (+0.81z)| norm 0.2542 (-0.19z)| lr 6.23e-04 | 4475.47 ms | 30.2% bf16 MFU | 124096 tok/s step 6781/19560 | loss 3.481196 (-0.12z)| norm 0.2559 (-0.08z)| lr 6.23e-04 | 4686.41 ms | 28.8% bf16 MFU | 123485 tok/s step 6782/19560 | loss 3.513145 (+0.83z)| norm 0.2507 (-0.39z)| lr 6.23e-04 | 4551.25 ms | 29.7% bf16 MFU | 123070 tok/s step 6783/19560 | loss 3.505817 (+0.60z)| norm 0.2596 (+0.19z)| lr 6.23e-04 | 4226.36 ms | 31.9% bf16 MFU | 123120 tok/s step 6784/19560 | loss 3.501847 (+0.48z)| norm 0.2811 (+1.63z)| lr 6.23e-04 | 4383.50 ms | 30.8% bf16 MFU | 122944 tok/s step 6785/19560 | loss 3.495082 (+0.27z)| norm 0.2326 (-1.55z)| lr 6.23e-04 | 4244.41 ms | 31.8% bf16 MFU | 122973 tok/s step 6786/19560 | loss 3.451876 (-1.00z)| norm 0.2389 (-1.13z)| lr 6.23e-04 | 4308.47 ms | 31.3% bf16 MFU | 122909 tok/s step 6787/19560 | loss 3.465515 (-0.59z)| norm 0.2524 (-0.24z)| lr 6.23e-04 | 4323.19 ms | 31.2% bf16 MFU | 122827 tok/s step 6788/19560 | loss 3.551629 (+1.93z)| norm 0.2618 (+0.38z)| lr 6.23e-04 | 4175.19 ms | 32.3% bf16 MFU | 122964 tok/s step 6789/19560 | loss 3.478119 (-0.22z)| norm 0.2208 (-2.24z)| lr 6.23e-04 | 4240.01 ms | 31.8% bf16 MFU | 122998 tok/s step 6790/19560 | loss 3.500223 (+0.44z)| norm 0.2575 (+0.11z)| lr 6.23e-04 | 4218.85 ms | 32.0% bf16 MFU | 123062 tok/s step 6791/19560 | loss 3.518542 (+0.97z)| norm 0.2566 (+0.06z)| lr 6.23e-04 | 4272.11 ms | 31.6% bf16 MFU | 123045 tok/s step 6792/19560 | loss 3.494439 (+0.25z)| norm 0.2500 (-0.37z)| lr 6.23e-04 | 4177.14 ms | 32.3% bf16 MFU | 123169 tok/s step 6793/19560 | loss 3.475418 (-0.29z)| norm 0.2614 (+0.37z)| lr 6.23e-04 | 4208.42 ms | 32.1% bf16 MFU | 123239 tok/s step 6794/19560 | loss 3.441592 (-1.30z)| norm 0.2417 (-0.90z)| lr 6.23e-04 | 4173.21 ms | 32.4% bf16 MFU | 123359 tok/s step 6795/19560 | loss 3.406090 (-2.32z)| norm 0.2552 (-0.01z)| lr 6.23e-04 | 4239.70 ms | 31.8% bf16 MFU | 123374 tok/s step 6796/19560 | loss 3.481718 (-0.05z)| norm 0.2591 (+0.25z)| lr 6.23e-04 | 4252.01 ms | 31.8% bf16 MFU | 123371 tok/s step 6797/19560 | loss 3.475786 (-0.22z)| norm 0.2746 (+1.25z)| lr 6.22e-04 | 4262.92 ms | 31.7% bf16 MFU | 123351 tok/s step 6798/19560 | loss 3.454395 (-0.86z)| norm 0.2425 (-0.84z)| lr 6.22e-04 | 4193.57 ms | 32.2% bf16 MFU | 123435 tok/s step 6799/19560 | loss 3.480740 (-0.07z)| norm 0.2756 (+1.32z)| lr 6.22e-04 | 4164.66 ms | 32.4% bf16 MFU | 123558 tok/s step 6800/19560 | loss 3.638572 (+4.30z)| norm 0.2591 (+0.24z)| lr 6.22e-04 | 4170.70 ms | 32.4% bf16 MFU | 123665 tok/s step 6801/19560 | loss 3.433567 (-1.42z)| norm 0.2434 (-0.78z)| lr 6.22e-04 | 4166.71 ms | 32.4% bf16 MFU | 123773 tok/s step 6802/19560 | loss 3.501410 (+0.46z)| norm 0.2708 (+1.02z)| lr 6.22e-04 | 4167.39 ms | 32.4% bf16 MFU | 123875 tok/s step 6803/19560 | loss 3.617699 (+3.52z)| norm 0.2575 (+0.14z)| lr 6.22e-04 | 4167.53 ms | 32.4% bf16 MFU | 123971 tok/s step 6804/19560 | loss 3.437603 (-1.28z)| norm 0.2487 (-0.43z)| lr 6.22e-04 | 4164.19 ms | 32.4% bf16 MFU | 124068 tok/s step 6805/19560 | loss 3.503763 (+0.48z)| norm 0.2534 (-0.12z)| lr 6.22e-04 | 4169.14 ms | 32.4% bf16 MFU | 124152 tok/s step 6806/19560 | loss 3.536690 (+1.35z)| norm 0.2649 (+0.63z)| lr 6.22e-04 | 4184.06 ms | 32.3% bf16 MFU | 124210 tok/s step 6807/19560 | loss 3.498476 (+0.33z)| norm 0.2648 (+0.63z)| lr 6.22e-04 | 4167.27 ms | 32.4% bf16 MFU | 124290 tok/s step 6808/19560 | loss 3.462350 (-0.64z)| norm 0.2443 (-0.71z)| lr 6.22e-04 | 4171.90 ms | 32.4% bf16 MFU | 124359 tok/s step 6809/19560 | loss 3.481205 (-0.15z)| norm 0.2435 (-0.75z)| lr 6.22e-04 | 4168.12 ms | 32.4% bf16 MFU | 124430 tok/s step 6810/19560 | loss 3.543394 (+1.50z)| norm 0.2523 (-0.16z)| lr 6.22e-04 | 4180.84 ms | 32.3% bf16 MFU | 124479 tok/s step 6811/19560 | loss 3.486437 (-0.02z)| norm 0.2256 (-1.90z)| lr 6.22e-04 | 4165.58 ms | 32.4% bf16 MFU | 124548 tok/s step 6812/19560 | loss 3.501292 (+0.37z)| norm 0.2726 (+1.17z)| lr 6.22e-04 | 4177.26 ms | 32.3% bf16 MFU | 124596 tok/s step 6813/19560 | loss 3.457992 (-0.78z)| norm 0.2519 (-0.18z)| lr 6.22e-04 | 4166.66 ms | 32.4% bf16 MFU | 124658 tok/s step 6814/19560 | loss 3.559621 (+1.92z)| norm 0.2380 (-1.07z)| lr 6.21e-04 | 4193.83 ms | 32.2% bf16 MFU | 124676 tok/s step 6815/19560 | loss 3.481304 (-0.15z)| norm 0.2564 (+0.11z)| lr 6.21e-04 | 4184.19 ms | 32.3% bf16 MFU | 124707 tok/s step 6816/19560 | loss 3.493880 (+0.19z)| norm 0.3005 (+2.88z)| lr 6.21e-04 | 4167.57 ms | 32.4% bf16 MFU | 124762 tok/s step 6817/19560 | loss 3.525849 (+1.05z)| norm 0.2694 (+0.89z)| lr 6.21e-04 | 4165.63 ms | 32.4% bf16 MFU | 124817 tok/s step 6818/19560 | loss 3.499655 (+0.34z)| norm 0.2321 (-1.47z)| lr 6.21e-04 | 4211.79 ms | 32.1% bf16 MFU | 124800 tok/s step 6819/19560 | loss 3.507508 (+0.54z)| norm 0.2655 (+0.63z)| lr 6.21e-04 | 4181.03 ms | 32.3% bf16 MFU | 124830 tok/s step 6820/19560 | loss 3.505140 (+0.48z)| norm 0.2559 (+0.01z)| lr 6.21e-04 | 4162.97 ms | 32.4% bf16 MFU | 124885 tok/s step 6821/19560 | loss 3.439947 (-1.25z)| norm 0.2393 (-1.05z)| lr 6.21e-04 | 4168.25 ms | 32.4% bf16 MFU | 124930 tok/s step 6822/19560 | loss 3.547091 (+1.58z)| norm 0.2334 (-1.41z)| lr 6.21e-04 | 4162.87 ms | 32.4% bf16 MFU | 124981 tok/s step 6823/19560 | loss 3.527444 (+1.04z)| norm 0.2460 (-0.60z)| lr 6.21e-04 | 4179.61 ms | 32.3% bf16 MFU | 125004 tok/s step 6824/19560 | loss 3.573311 (+2.19z)| norm 0.2640 (+0.55z)| lr 6.21e-04 | 4165.35 ms | 32.4% bf16 MFU | 125047 tok/s step 6825/19560 | loss 3.551604 (+1.61z)| norm 0.2837 (+1.77z)| lr 6.21e-04 | 4170.92 ms | 32.4% bf16 MFU | 125080 tok/s step 6826/19560 | loss 3.483182 (-0.12z)| norm 0.2847 (+1.80z)| lr 6.21e-04 | 4168.63 ms | 32.4% bf16 MFU | 125114 tok/s step 6827/19560 | loss 3.466382 (-0.57z)| norm 0.2483 (-0.48z)| lr 6.21e-04 | 4169.25 ms | 32.4% bf16 MFU | 125146 tok/s step 6828/19560 | loss 3.416378 (-1.89z)| norm 0.2683 (+0.78z)| lr 6.21e-04 | 4177.67 ms | 32.3% bf16 MFU | 125164 tok/s step 6829/19560 | loss 3.511722 (+0.67z)| norm 0.2410 (-0.93z)| lr 6.21e-04 | 4170.00 ms | 32.4% bf16 MFU | 125192 tok/s step 6830/19560 | loss 3.468138 (-0.50z)| norm 0.2425 (-0.83z)| lr 6.21e-04 | 4172.51 ms | 32.4% bf16 MFU | 125215 tok/s step 6831/19560 | loss 3.519480 (+0.87z)| norm 0.2628 (+0.44z)| lr 6.21e-04 | 4167.47 ms | 32.4% bf16 MFU | 125244 tok/s step 6832/19560 | loss 3.508631 (+0.57z)| norm 0.2561 (+0.01z)| lr 6.20e-04 | 4174.72 ms | 32.3% bf16 MFU | 125261 tok/s step 6833/19560 | loss 3.468760 (-0.51z)| norm 0.2592 (+0.20z)| lr 6.20e-04 | 4190.28 ms | 32.2% bf16 MFU | 125254 tok/s step 6834/19560 | loss 3.473668 (-0.37z)| norm 0.2582 (+0.12z)| lr 6.20e-04 | 4160.58 ms | 32.5% bf16 MFU | 125292 tok/s step 6835/19560 | loss 3.473329 (-0.38z)| norm 0.2583 (+0.12z)| lr 6.20e-04 | 4170.73 ms | 32.4% bf16 MFU | 125313 tok/s step 6836/19560 | loss 3.506150 (+0.50z)| norm 0.2350 (-1.37z)| lr 6.20e-04 | 4171.22 ms | 32.4% bf16 MFU | 125332 tok/s step 6837/19560 | loss 3.505918 (+0.49z)| norm 0.2592 (+0.18z)| lr 6.20e-04 | 4165.24 ms | 32.4% bf16 MFU | 125359 tok/s step 6838/19560 | loss 3.524788 (+0.99z)| norm 0.2451 (-0.75z)| lr 6.20e-04 | 4164.38 ms | 32.4% bf16 MFU | 125386 tok/s step 6839/19560 | loss 3.490362 (+0.06z)| norm 0.2332 (-1.51z)| lr 6.20e-04 | 4185.13 ms | 32.3% bf16 MFU | 125380 tok/s step 6840/19560 | loss 3.499366 (+0.29z)| norm 0.2603 (+0.26z)| lr 6.20e-04 | 4165.50 ms | 32.4% bf16 MFU | 125405 tok/s step 6841/19560 | loss 3.537683 (+1.31z)| norm 0.2652 (+0.57z)| lr 6.20e-04 | 4170.61 ms | 32.4% bf16 MFU | 125420 tok/s step 6842/19560 | loss 3.499177 (+0.26z)| norm 0.2528 (-0.25z)| lr 6.20e-04 | 4173.19 ms | 32.4% bf16 MFU | 125430 tok/s step 6843/19560 | loss 3.536165 (+1.26z)| norm 0.2421 (-0.94z)| lr 6.20e-04 | 4173.68 ms | 32.3% bf16 MFU | 125440 tok/s step 6844/19560 | loss 3.602991 (+2.94z)| norm 0.2553 (-0.08z)| lr 6.20e-04 | 4172.69 ms | 32.4% bf16 MFU | 125450 tok/s step 6845/19560 | loss 3.524393 (+0.87z)| norm 0.2403 (-1.05z)| lr 6.20e-04 | 4170.26 ms | 32.4% bf16 MFU | 125464 tok/s step 6846/19560 | loss 3.465746 (-0.67z)| norm 0.2356 (-1.34z)| lr 6.20e-04 | 4167.36 ms | 32.4% bf16 MFU | 125481 tok/s step 6847/19560 | loss 3.569851 (+2.02z)| norm 0.2358 (-1.32z)| lr 6.20e-04 | 4177.86 ms | 32.3% bf16 MFU | 125481 tok/s step 6848/19560 | loss 3.486487 (-0.16z)| norm 0.2686 (+0.80z)| lr 6.20e-04 | 4169.42 ms | 32.4% bf16 MFU | 125495 tok/s step 6849/19560 | loss 3.430994 (-1.59z)| norm 0.2494 (-0.43z)| lr 6.20e-04 | 4192.85 ms | 32.2% bf16 MFU | 125472 tok/s step 6850/19560 | loss 3.475238 (-0.44z)| norm 0.2313 (-1.57z)| lr 6.19e-04 | 4182.84 ms | 32.3% bf16 MFU | 125466 tok/s step 6851/19560 | loss 3.497307 (+0.13z)| norm 0.2412 (-0.92z)| lr 6.19e-04 | 4164.92 ms | 32.4% bf16 MFU | 125486 tok/s step 6852/19560 | loss 3.454092 (-0.99z)| norm 0.2670 (+0.74z)| lr 6.19e-04 | 4172.38 ms | 32.4% bf16 MFU | 125495 tok/s step 6853/19560 | loss 3.450017 (-1.09z)| norm 0.2419 (-0.87z)| lr 6.19e-04 | 4172.29 ms | 32.4% bf16 MFU | 125503 tok/s step 6854/19560 | loss 3.475346 (-0.42z)| norm 0.2727 (+1.15z)| lr 6.19e-04 | 4161.84 ms | 32.4% bf16 MFU | 125527 tok/s step 6855/19560 | loss 3.468290 (-0.60z)| norm 0.2771 (+1.44z)| lr 6.19e-04 | 4163.05 ms | 32.4% bf16 MFU | 125547 tok/s step 6856/19560 | loss 3.539190 (+1.23z)| norm 0.2706 (+1.00z)| lr 6.19e-04 | 4179.21 ms | 32.3% bf16 MFU | 125543 tok/s step 6857/19560 | loss 3.478883 (-0.34z)| norm 0.2796 (+1.56z)| lr 6.19e-04 | 4173.86 ms | 32.3% bf16 MFU | 125546 tok/s step 6858/19560 | loss 3.479400 (-0.33z)| norm 0.2652 (+0.62z)| lr 6.19e-04 | 4172.06 ms | 32.4% bf16 MFU | 125552 tok/s step 6859/19560 | loss 3.486920 (-0.13z)| norm 0.2853 (+1.89z)| lr 6.19e-04 | 4175.72 ms | 32.3% bf16 MFU | 125552 tok/s step 6860/19560 | loss 3.503546 (+0.30z)| norm 0.2485 (-0.47z)| lr 6.19e-04 | 4166.84 ms | 32.4% bf16 MFU | 125566 tok/s step 6861/19560 | loss 3.471909 (-0.51z)| norm 0.3005 (+2.76z)| lr 6.19e-04 | 4165.06 ms | 32.4% bf16 MFU | 125581 tok/s step 6862/19560 | loss 3.500157 (+0.23z)| norm 0.2666 (+0.64z)| lr 6.19e-04 | 4167.92 ms | 32.4% bf16 MFU | 125592 tok/s step 6863/19560 | loss 3.489088 (-0.05z)| norm 0.2624 (+0.38z)| lr 6.19e-04 | 4169.17 ms | 32.4% bf16 MFU | 125600 tok/s step 6864/19560 | loss 3.454031 (-0.96z)| norm 0.2656 (+0.60z)| lr 6.19e-04 | 4175.90 ms | 32.3% bf16 MFU | 125598 tok/s step 6865/19560 | loss 3.560389 (+1.78z)| norm 0.2563 (+0.02z)| lr 6.19e-04 | 4165.89 ms | 32.4% bf16 MFU | 125610 tok/s step 6866/19560 | loss 3.494216 (+0.07z)| norm 0.2784 (+1.43z)| lr 6.19e-04 | 4172.18 ms | 32.4% bf16 MFU | 125613 tok/s step 6867/19560 | loss 3.477388 (-0.37z)| norm 0.2993 (+2.68z)| lr 6.18e-04 | 4221.98 ms | 32.0% bf16 MFU | 125541 tok/s step 6868/19560 | loss 3.502670 (+0.28z)| norm 0.5660 (+9.75z)| lr 6.18e-04 | 4165.06 ms | 32.4% bf16 MFU | 125558 tok/s step 6869/19560 | loss 3.507718 (+0.44z)| norm 0.3030 (+1.37z)| lr 6.18e-04 | 4159.49 ms | 32.5% bf16 MFU | 125583 tok/s step 6870/19560 | loss 3.496868 (+0.15z)| norm 0.3083 (+1.53z)| lr 6.18e-04 | 4164.33 ms | 32.4% bf16 MFU | 125598 tok/s step 6871/19560 | loss 3.486295 (-0.13z)| norm 0.2573 (-0.07z)| lr 6.18e-04 | 4159.16 ms | 32.5% bf16 MFU | 125621 tok/s step 6872/19560 | loss 3.436434 (-1.44z)| norm 0.2659 (+0.20z)| lr 6.18e-04 | 4163.58 ms | 32.4% bf16 MFU | 125636 tok/s step 6873/19560 | loss 3.463286 (-0.73z)| norm 0.2759 (+0.51z)| lr 6.18e-04 | 4173.71 ms | 32.3% bf16 MFU | 125635 tok/s step 6874/19560 | loss 3.476863 (-0.37z)| norm 0.2648 (+0.16z)| lr 6.18e-04 | 4162.95 ms | 32.4% bf16 MFU | 125651 tok/s step 6875/19560 | loss 3.482493 (-0.22z)| norm 0.2644 (+0.15z)| lr 6.18e-04 | 4170.21 ms | 32.4% bf16 MFU | 125654 tok/s step 6876/19560 | loss 3.468299 (-0.59z)| norm 0.2479 (-0.37z)| lr 6.18e-04 | 4164.83 ms | 32.4% bf16 MFU | 125666 tok/s step 6877/19560 | loss 3.475115 (-0.41z)| norm 0.2673 (+0.24z)| lr 6.18e-04 | 4204.87 ms | 32.1% bf16 MFU | 125617 tok/s step 6878/19560 | loss 3.486336 (-0.12z)| norm 0.2751 (+0.48z)| lr 6.18e-04 | 4171.50 ms | 32.4% bf16 MFU | 125620 tok/s step 6879/19560 | loss 3.504579 (+0.37z)| norm 0.2457 (-0.44z)| lr 6.18e-04 | 4166.05 ms | 32.4% bf16 MFU | 125631 tok/s step 6880/19560 | loss 3.518534 (+0.73z)| norm 0.2702 (+0.32z)| lr 6.18e-04 | 4165.05 ms | 32.4% bf16 MFU | 125644 tok/s step 6881/19560 | loss 3.496683 (+0.15z)| norm 0.2705 (+0.34z)| lr 6.18e-04 | 4161.22 ms | 32.4% bf16 MFU | 125661 tok/s step 6882/19560 | loss 3.501645 (+0.27z)| norm 0.2707 (+0.33z)| lr 6.18e-04 | 4178.26 ms | 32.3% bf16 MFU | 125652 tok/s step 6883/19560 | loss 3.521437 (+0.79z)| norm 0.3117 (+1.60z)| lr 6.18e-04 | 4177.61 ms | 32.3% bf16 MFU | 125645 tok/s step 6884/19560 | loss 3.527401 (+0.93z)| norm 0.2553 (-0.17z)| lr 6.18e-04 | 4188.21 ms | 32.2% bf16 MFU | 125621 tok/s step 6885/19560 | loss 3.459916 (-0.86z)| norm 0.2592 (-0.05z)| lr 6.17e-04 | 4172.31 ms | 32.4% bf16 MFU | 125623 tok/s step 6886/19560 | loss 3.451653 (-1.06z)| norm 0.2713 (+0.33z)| lr 6.17e-04 | 4174.99 ms | 32.3% bf16 MFU | 125621 tok/s step 6887/19560 | loss 3.468143 (-0.62z)| norm 0.2748 (+0.43z)| lr 6.17e-04 | 4163.97 ms | 32.4% bf16 MFU | 125636 tok/s step 6888/19560 | loss 3.478714 (-0.34z)| norm 0.2934 (+1.00z)| lr 6.17e-04 | 4178.01 ms | 32.3% bf16 MFU | 125628 tok/s step 6889/19560 | loss 3.477303 (-0.39z)| norm 0.2498 (-0.36z)| lr 6.17e-04 | 4175.65 ms | 32.3% bf16 MFU | 125625 tok/s step 6890/19560 | loss 3.481128 (-0.29z)| norm 0.2685 (+0.22z)| lr 6.17e-04 | 4173.16 ms | 32.4% bf16 MFU | 125625 tok/s step 6891/19560 | loss 3.434757 (-1.52z)| norm 0.2516 (-0.30z)| lr 6.17e-04 | 4166.58 ms | 32.4% bf16 MFU | 125635 tok/s step 6892/19560 | loss 3.514027 (+0.58z)| norm 0.2844 (+0.72z)| lr 6.17e-04 | 4165.09 ms | 32.4% bf16 MFU | 125647 tok/s step 6893/19560 | loss 3.479013 (-0.37z)| norm 0.2172 (-1.35z)| lr 6.17e-04 | 4176.88 ms | 32.3% bf16 MFU | 125641 tok/s step 6894/19560 | loss 3.454694 (-1.03z)| norm 0.2695 (+0.26z)| lr 6.17e-04 | 4162.95 ms | 32.4% bf16 MFU | 125656 tok/s step 6895/19560 | loss 3.477082 (-0.41z)| norm 0.2846 (+0.72z)| lr 6.17e-04 | 4180.55 ms | 32.3% bf16 MFU | 125644 tok/s step 6896/19560 | loss 3.511652 (+0.53z)| norm 0.2798 (+0.57z)| lr 6.17e-04 | 4176.61 ms | 32.3% bf16 MFU | 125638 tok/s step 6897/19560 | loss 3.534050 (+1.12z)| norm 0.2413 (-0.61z)| lr 6.17e-04 | 4171.11 ms | 32.4% bf16 MFU | 125641 tok/s step 6898/19560 | loss 3.481517 (-0.30z)| norm 0.2903 (+0.88z)| lr 6.17e-04 | 4169.11 ms | 32.4% bf16 MFU | 125647 tok/s step 6899/19560 | loss 3.501215 (+0.23z)| norm 0.2600 (-0.04z)| lr 6.17e-04 | 4171.26 ms | 32.4% bf16 MFU | 125649 tok/s step 6900/19560 | loss 3.514006 (+0.57z)| norm 0.2466 (-0.45z)| lr 6.17e-04 | 4174.77 ms | 32.3% bf16 MFU | 125646 tok/s step 6901/19560 | loss 3.511139 (+0.48z)| norm 0.2666 (+0.16z)| lr 6.17e-04 | 4176.58 ms | 32.3% bf16 MFU | 125640 tok/s step 6902/19560 | loss 3.494025 (+0.01z)| norm 0.2362 (-0.77z)| lr 6.17e-04 | 4174.52 ms | 32.3% bf16 MFU | 125638 tok/s step 6903/19560 | loss 3.493754 (-0.00z)| norm 0.2651 (+0.12z)| lr 6.16e-04 | 4169.43 ms | 32.4% bf16 MFU | 125643 tok/s step 6904/19560 | loss 3.531991 (+1.04z)| norm 0.2600 (-0.04z)| lr 6.16e-04 | 4177.67 ms | 32.3% bf16 MFU | 125636 tok/s step 6905/19560 | loss 3.505896 (+0.32z)| norm 0.2210 (-1.24z)| lr 6.16e-04 | 4169.43 ms | 32.4% bf16 MFU | 125641 tok/s step 6906/19560 | loss 3.539964 (+1.23z)| norm 0.2427 (-0.57z)| lr 6.16e-04 | 4172.13 ms | 32.4% bf16 MFU | 125642 tok/s step 6907/19560 | loss 3.490416 (-0.13z)| norm 0.2302 (-0.96z)| lr 6.16e-04 | 4175.09 ms | 32.3% bf16 MFU | 125639 tok/s step 6908/19560 | loss 3.442177 (-1.42z)| norm 0.2596 (-0.06z)| lr 6.16e-04 | 4163.00 ms | 32.4% bf16 MFU | 125654 tok/s step 6909/19560 | loss 3.516819 (+0.60z)| norm 0.2532 (-0.26z)| lr 6.16e-04 | 4168.58 ms | 32.4% bf16 MFU | 125660 tok/s step 6910/19560 | loss 3.449906 (-1.20z)| norm 0.2252 (-1.10z)| lr 6.16e-04 | 4171.65 ms | 32.4% bf16 MFU | 125661 tok/s step 6911/19560 | loss 3.477035 (-0.46z)| norm 0.2344 (-0.81z)| lr 6.16e-04 | 4175.29 ms | 32.3% bf16 MFU | 125656 tok/s step 6912/19560 | loss 3.416679 (-2.03z)| norm 0.2466 (-0.44z)| lr 6.16e-04 | 4169.71 ms | 32.4% bf16 MFU | 125660 tok/s step 6913/19560 | loss 3.487708 (-0.15z)| norm 0.2434 (-0.54z)| lr 6.16e-04 | 4173.02 ms | 32.4% bf16 MFU | 125659 tok/s step 6914/19560 | loss 3.458277 (-0.93z)| norm 0.2305 (-0.93z)| lr 6.16e-04 | 4171.17 ms | 32.4% bf16 MFU | 125661 tok/s step 6915/19560 | loss 3.564096 (+1.84z)| norm 0.2525 (-0.25z)| lr 6.16e-04 | 4180.46 ms | 32.3% bf16 MFU | 125649 tok/s step 6916/19560 | loss 3.446014 (-1.24z)| norm 0.2331 (-0.84z)| lr 6.16e-04 | 4170.85 ms | 32.4% bf16 MFU | 125651 tok/s step 6917/19560 | loss 3.495028 (+0.04z)| norm 0.2477 (-0.40z)| lr 6.16e-04 | 4214.41 ms | 32.0% bf16 MFU | 125589 tok/s step 6918/19560 | loss 3.452168 (-1.07z)| norm 0.2356 (-0.76z)| lr 6.16e-04 | 4169.70 ms | 32.4% bf16 MFU | 125596 tok/s step 6919/19560 | loss 3.460008 (-0.85z)| norm 0.2473 (-0.41z)| lr 6.16e-04 | 4169.78 ms | 32.4% bf16 MFU | 125603 tok/s step 6920/19560 | loss 3.493697 (+0.03z)| norm 0.2402 (-0.62z)| lr 6.15e-04 | 4169.42 ms | 32.4% bf16 MFU | 125610 tok/s step 6921/19560 | loss 3.465023 (-0.72z)| norm 0.2384 (-0.67z)| lr 6.15e-04 | 4168.13 ms | 32.4% bf16 MFU | 125619 tok/s step 6922/19560 | loss 3.451193 (-1.09z)| norm 0.2532 (-0.22z)| lr 6.15e-04 | 4176.24 ms | 32.3% bf16 MFU | 125615 tok/s step 6923/19560 | loss 3.647884 (+3.86z)| norm 1.7864 (+10.95z)| lr 6.15e-04 | 4172.26 ms | 32.4% bf16 MFU | 125617 tok/s step 6924/19560 | loss 3.471317 (-0.58z)| norm 0.4005 (+0.92z)| lr 6.15e-04 | 4165.99 ms | 32.4% bf16 MFU | 125629 tok/s step 6925/19560 | loss 3.483064 (-0.29z)| norm 0.3618 (+0.63z)| lr 6.15e-04 | 4162.69 ms | 32.4% bf16 MFU | 125645 tok/s step 6926/19560 | loss 3.486517 (-0.21z)| norm 0.3090 (+0.25z)| lr 6.15e-04 | 4170.62 ms | 32.4% bf16 MFU | 125648 tok/s step 6927/19560 | loss 3.493567 (-0.03z)| norm 0.3013 (+0.19z)| lr 6.15e-04 | 4196.19 ms | 32.2% bf16 MFU | 125613 tok/s step 6928/19560 | loss 3.547214 (+1.41z)| norm 0.2844 (+0.07z)| lr 6.15e-04 | 4177.68 ms | 32.3% bf16 MFU | 125607 tok/s step 6929/19560 | loss 3.470021 (-0.65z)| norm 0.2662 (-0.07z)| lr 6.15e-04 | 4168.31 ms | 32.4% bf16 MFU | 125616 tok/s step 6930/19560 | loss 3.484688 (-0.26z)| norm 0.2789 (+0.03z)| lr 6.15e-04 | 4263.43 ms | 31.7% bf16 MFU | 125484 tok/s step 6931/19560 | loss 3.492970 (-0.01z)| norm 0.2791 (+0.03z)| lr 6.15e-04 | 4162.60 ms | 32.4% bf16 MFU | 125507 tok/s step 6932/19560 | loss 3.623408 (+3.45z)| norm 0.2684 (-0.05z)| lr 6.15e-04 | 4174.54 ms | 32.3% bf16 MFU | 125511 tok/s step 6933/19560 | loss 3.482295 (-0.33z)| norm 0.2937 (+0.13z)| lr 6.15e-04 | 4200.07 ms | 32.1% bf16 MFU | 125477 tok/s step 6934/19560 | loss 3.465353 (-0.77z)| norm 0.2512 (-0.18z)| lr 6.15e-04 | 4216.03 ms | 32.0% bf16 MFU | 125421 tok/s step 6935/19560 | loss 3.487626 (-0.17z)| norm 0.2699 (-0.04z)| lr 6.15e-04 | 4172.89 ms | 32.4% bf16 MFU | 125432 tok/s step 6936/19560 | loss 3.561831 (+1.79z)| norm 0.2580 (-0.13z)| lr 6.15e-04 | 4172.80 ms | 32.4% bf16 MFU | 125443 tok/s step 6937/19560 | loss 3.446613 (-1.27z)| norm 0.2754 (-0.01z)| lr 6.15e-04 | 4169.03 ms | 32.4% bf16 MFU | 125459 tok/s step 6938/19560 | loss 3.518694 (+0.65z)| norm 0.2739 (-0.02z)| lr 6.14e-04 | 4162.96 ms | 32.4% bf16 MFU | 125483 tok/s step 6939/19560 | loss 3.480507 (-0.36z)| norm 0.2468 (-0.21z)| lr 6.14e-04 | 4162.82 ms | 32.4% bf16 MFU | 125506 tok/s step 6940/19560 | loss 3.459740 (-0.90z)| norm 0.2587 (-0.13z)| lr 6.14e-04 | 4221.22 ms | 32.0% bf16 MFU | 125441 tok/s step 6941/19560 | loss 3.508504 (+0.38z)| norm 0.2490 (-0.20z)| lr 6.14e-04 | 4213.25 ms | 32.0% bf16 MFU | 125391 tok/s step 6942/19560 | loss 3.492833 (-0.02z)| norm 0.2445 (-0.23z)| lr 6.14e-04 | 4167.89 ms | 32.4% bf16 MFU | 125411 tok/s step 6943/19560 | loss 3.457628 (-0.96z)| norm 0.2728 (-0.03z)| lr 6.14e-04 | 4160.93 ms | 32.4% bf16 MFU | 125440 tok/s step 6944/19560 | loss 3.476520 (-0.45z)| norm 0.2507 (-0.18z)| lr 6.14e-04 | 4186.17 ms | 32.3% bf16 MFU | 125430 tok/s step 6945/19560 | loss 3.607794 (+2.95z)| norm 0.2437 (-0.23z)| lr 6.14e-04 | 4167.42 ms | 32.4% bf16 MFU | 125449 tok/s step 6946/19560 | loss 3.493779 (-0.01z)| norm 0.2604 (-0.11z)| lr 6.14e-04 | 4164.85 ms | 32.4% bf16 MFU | 125471 tok/s step 6947/19560 | loss 3.484280 (-0.25z)| norm 0.2416 (-0.25z)| lr 6.14e-04 | 4166.89 ms | 32.4% bf16 MFU | 125488 tok/s step 6948/19560 | loss 3.461586 (-0.83z)| norm 0.2476 (-0.20z)| lr 6.14e-04 | 4166.42 ms | 32.4% bf16 MFU | 125506 tok/s step 6949/19560 | loss 3.499817 (+0.15z)| norm 0.2699 (-0.05z)| lr 6.14e-04 | 4164.81 ms | 32.4% bf16 MFU | 125525 tok/s step 6950/19560 | loss 3.411338 (-2.11z)| norm 0.2323 (-0.32z)| lr 6.14e-04 | 4162.15 ms | 32.4% bf16 MFU | 125547 tok/s step 6951/19560 | loss 3.468128 (-0.63z)| norm 0.2606 (-0.11z)| lr 6.14e-04 | 4176.59 ms | 32.3% bf16 MFU | 125546 tok/s step 6952/19560 | loss 3.477231 (-0.38z)| norm 0.2492 (-0.19z)| lr 6.14e-04 | 4532.56 ms | 29.8% bf16 MFU | 125052 tok/s step 6953/19560 | loss 3.498045 (+0.18z)| norm 0.2576 (-0.13z)| lr 6.14e-04 | 6463.98 ms | 20.9% bf16 MFU | 122855 tok/s step 6954/19560 | loss 3.467868 (-0.62z)| norm 0.2319 (-0.32z)| lr 6.14e-04 | 4159.25 ms | 32.5% bf16 MFU | 123015 tok/s step 6955/19560 | loss 3.539190 (+1.25z)| norm 0.2442 (-0.23z)| lr 6.13e-04 | 4171.67 ms | 32.4% bf16 MFU | 123148 tok/s step 6956/19560 | loss 3.479433 (-0.34z)| norm 0.2475 (-0.20z)| lr 6.13e-04 | 4159.97 ms | 32.5% bf16 MFU | 123292 tok/s step 6957/19560 | loss 3.473180 (-0.50z)| norm 0.2509 (-0.18z)| lr 6.13e-04 | 4160.44 ms | 32.5% bf16 MFU | 123429 tok/s step 6958/19560 | loss 3.473819 (-0.49z)| norm 0.2552 (-0.15z)| lr 6.13e-04 | 4160.60 ms | 32.5% bf16 MFU | 123558 tok/s step 6959/19560 | loss 3.452636 (-1.04z)| norm 0.2378 (-0.27z)| lr 6.13e-04 | 4149.59 ms | 32.5% bf16 MFU | 123697 tok/s step 6960/19560 | loss 3.538390 (+1.25z)| norm 0.2233 (-0.37z)| lr 6.13e-04 | 4157.57 ms | 32.5% bf16 MFU | 123818 tok/s step 6961/19560 | loss 3.472227 (-0.52z)| norm 0.2427 (-0.23z)| lr 6.13e-04 | 4159.92 ms | 32.5% bf16 MFU | 123928 tok/s step 6962/19560 | loss 3.514820 (+0.61z)| norm 0.2471 (-0.20z)| lr 6.13e-04 | 4184.53 ms | 32.3% bf16 MFU | 123997 tok/s step 6963/19560 | loss 3.519939 (+0.73z)| norm 0.2513 (-0.17z)| lr 6.13e-04 | 4160.36 ms | 32.5% bf16 MFU | 124098 tok/s step 6964/19560 | loss 3.510643 (+0.49z)| norm 0.2440 (-0.22z)| lr 6.13e-04 | 4165.90 ms | 32.4% bf16 MFU | 124186 tok/s step 6965/19560 | loss 3.476692 (-0.41z)| norm 0.2349 (-0.29z)| lr 6.13e-04 | 4165.83 ms | 32.4% bf16 MFU | 124269 tok/s step 6966/19560 | loss 3.571696 (+2.08z)| norm 0.2859 (+0.08z)| lr 6.13e-04 | 4159.86 ms | 32.5% bf16 MFU | 124357 tok/s step 6967/19560 | loss 3.466808 (-0.67z)| norm 0.3074 (+0.23z)| lr 6.13e-04 | 4161.70 ms | 32.4% bf16 MFU | 124438 tok/s step 6968/19560 | loss 3.474083 (-0.47z)| norm 0.2505 (-0.18z)| lr 6.13e-04 | 4223.47 ms | 32.0% bf16 MFU | 124423 tok/s step 6969/19560 | loss 3.525549 (+0.88z)| norm 0.2570 (-0.13z)| lr 6.13e-04 | 4182.99 ms | 32.3% bf16 MFU | 124469 tok/s step 6970/19560 | loss 3.429821 (-1.60z)| norm 0.2870 (+0.08z)| lr 6.13e-04 | 4160.10 ms | 32.5% bf16 MFU | 124547 tok/s step 6971/19560 | loss 3.483166 (-0.21z)| norm 0.2555 (-0.15z)| lr 6.13e-04 | 5544.96 ms | 24.3% bf16 MFU | 123047 tok/s step 6972/19560 | loss 3.451100 (-1.04z)| norm 0.2632 (-0.09z)| lr 6.13e-04 | 4637.67 ms | 29.1% bf16 MFU | 122547 tok/s step 6973/19560 | loss 3.493517 (+0.10z)| norm 0.2451 (-0.22z)| lr 6.12e-04 | 4497.36 ms | 30.0% bf16 MFU | 122249 tok/s step 6974/19560 | loss 3.530734 (+1.09z)| norm 0.2629 (-0.10z)| lr 6.12e-04 | 4427.36 ms | 30.5% bf16 MFU | 122057 tok/s step 6975/19560 | loss 3.508214 (+0.51z)| norm 0.2964 (+0.14z)| lr 6.12e-04 | 4371.90 ms | 30.9% bf16 MFU | 121951 tok/s step 6976/19560 | loss 3.460965 (-0.78z)| norm 0.2791 (+0.02z)| lr 6.12e-04 | 4222.65 ms | 32.0% bf16 MFU | 122061 tok/s step 6977/19560 | loss 3.497387 (+0.20z)| norm 0.2472 (-0.21z)| lr 6.12e-04 | 4307.50 ms | 31.3% bf16 MFU | 122044 tok/s step 6978/19560 | loss 3.454241 (-0.98z)| norm 0.2673 (-0.07z)| lr 6.12e-04 | 4226.40 ms | 31.9% bf16 MFU | 122144 tok/s step 6979/19560 | loss 3.447025 (-1.16z)| norm 0.2846 (+0.05z)| lr 6.12e-04 | 4173.08 ms | 32.4% bf16 MFU | 122319 tok/s step 6980/19560 | loss 3.506249 (+0.45z)| norm 0.2455 (-0.23z)| lr 6.12e-04 | 4282.32 ms | 31.5% bf16 MFU | 122324 tok/s step 6981/19560 | loss 3.451698 (-1.05z)| norm 0.2589 (-0.13z)| lr 6.12e-04 | 4169.33 ms | 32.4% bf16 MFU | 122496 tok/s step 6982/19560 | loss 3.516805 (+0.73z)| norm 0.2952 (+0.13z)| lr 6.12e-04 | 4174.76 ms | 32.3% bf16 MFU | 122650 tok/s step 6983/19560 | loss 3.480590 (-0.27z)| norm 0.2865 (+0.06z)| lr 6.12e-04 | 4363.14 ms | 30.9% bf16 MFU | 122526 tok/s step 6984/19560 | loss 3.505660 (+0.43z)| norm 0.2707 (-0.05z)| lr 6.12e-04 | 4174.63 ms | 32.3% bf16 MFU | 122679 tok/s step 6985/19560 | loss 3.526089 (+0.98z)| norm 0.2792 (+0.01z)| lr 6.12e-04 | 4213.71 ms | 32.0% bf16 MFU | 122766 tok/s step 6986/19560 | loss 3.529955 (+1.07z)| norm 0.2908 (+0.09z)| lr 6.12e-04 | 4164.85 ms | 32.4% bf16 MFU | 122922 tok/s step 6987/19560 | loss 3.479543 (-0.31z)| norm 0.2574 (-0.15z)| lr 6.12e-04 | 4194.22 ms | 32.2% bf16 MFU | 123026 tok/s step 6988/19560 | loss 3.535446 (+1.21z)| norm 0.2506 (-0.19z)| lr 6.12e-04 | 4197.03 ms | 32.2% bf16 MFU | 123121 tok/s step 6989/19560 | loss 3.490835 (-0.01z)| norm 0.2697 (-0.06z)| lr 6.12e-04 | 4167.12 ms | 32.4% bf16 MFU | 123255 tok/s step 6990/19560 | loss 3.454261 (-0.99z)| norm 0.2469 (-0.22z)| lr 6.11e-04 | 4220.07 ms | 32.0% bf16 MFU | 123304 tok/s step 6991/19560 | loss 3.443422 (-1.27z)| norm 0.2415 (-0.26z)| lr 6.11e-04 | 4172.73 ms | 32.4% bf16 MFU | 123422 tok/s step 6992/19560 | loss 3.470545 (-0.54z)| norm 0.2591 (-0.13z)| lr 6.11e-04 | 4170.94 ms | 32.4% bf16 MFU | 123536 tok/s step 6993/19560 | loss 3.485317 (-0.13z)| norm 0.2378 (-0.28z)| lr 6.11e-04 | 4174.65 ms | 32.3% bf16 MFU | 123638 tok/s step 6994/19560 | loss 3.512200 (+0.61z)| norm 0.2276 (-0.35z)| lr 6.11e-04 | 4173.92 ms | 32.3% bf16 MFU | 123737 tok/s step 6995/19560 | loss 3.436133 (-1.46z)| norm 0.2538 (-0.16z)| lr 6.11e-04 | 4265.11 ms | 31.7% bf16 MFU | 123696 tok/s step 6996/19560 | loss 3.432334 (-1.54z)| norm 0.2553 (-0.13z)| lr 6.11e-04 | 4160.05 ms | 32.5% bf16 MFU | 123813 tok/s step 6997/19560 | loss 3.409531 (-2.09z)| norm 0.2330 (-0.29z)| lr 6.11e-04 | 4242.50 ms | 31.8% bf16 MFU | 123801 tok/s step 6998/19560 | loss 3.479357 (-0.24z)| norm 0.2526 (-0.15z)| lr 6.11e-04 | 4227.06 ms | 31.9% bf16 MFU | 123813 tok/s step 6999/19560 | loss 3.433707 (-1.43z)| norm 0.2507 (-0.16z)| lr 6.11e-04 | 4173.66 ms | 32.3% bf16 MFU | 123903 tok/s step 7000/19560 | loss 3.503062 (+0.39z)| norm 0.2364 (-0.26z)| lr 6.11e-04 | 4213.65 ms | 32.0% bf16 MFU | 123929 tok/s val loss 3.464718 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2895/10042 = 0.288289 step 7001/19560 | loss 3.468283 (-0.54z)| norm 0.2443 (-0.20z)| lr 6.11e-04 | 4509.43 ms | 29.9% bf16 MFU | 123546 tok/s step 7002/19560 | loss 3.511147 (+0.59z)| norm 0.2526 (-0.14z)| lr 6.11e-04 | 4363.44 ms | 30.9% bf16 MFU | 123376 tok/s step 7003/19560 | loss 3.550863 (+1.61z)| norm 0.2492 (-0.17z)| lr 6.11e-04 | 4171.12 ms | 32.4% bf16 MFU | 123492 tok/s step 7004/19560 | loss 3.435125 (-1.40z)| norm 0.2605 (-0.08z)| lr 6.11e-04 | 4288.04 ms | 31.5% bf16 MFU | 123431 tok/s step 7005/19560 | loss 3.443075 (-1.18z)| norm 0.2687 (-0.02z)| lr 6.11e-04 | 4318.90 ms | 31.3% bf16 MFU | 123329 tok/s step 7006/19560 | loss 3.520824 (+0.82z)| norm 0.3181 (+0.33z)| lr 6.11e-04 | 4242.00 ms | 31.8% bf16 MFU | 123342 tok/s step 7007/19560 | loss 3.493196 (+0.11z)| norm 0.2728 (+0.00z)| lr 6.10e-04 | 4159.09 ms | 32.5% bf16 MFU | 123478 tok/s step 7008/19560 | loss 3.564909 (+1.93z)| norm 0.2575 (-0.11z)| lr 6.10e-04 | 4338.93 ms | 31.1% bf16 MFU | 123346 tok/s step 7009/19560 | loss 3.461419 (-0.70z)| norm 0.2537 (-0.14z)| lr 6.10e-04 | 4372.03 ms | 30.9% bf16 MFU | 123175 tok/s step 7010/19560 | loss 3.721382 (+5.21z)| norm 0.2631 (-0.07z)| lr 6.10e-04 | 4220.99 ms | 32.0% bf16 MFU | 123226 tok/s step 7011/19560 | loss 3.485368 (-0.12z)| norm 0.2789 (+0.05z)| lr 6.10e-04 | 4166.08 ms | 32.4% bf16 MFU | 123357 tok/s step 7012/19560 | loss 3.483546 (-0.15z)| norm 0.2650 (-0.05z)| lr 6.10e-04 | 4349.97 ms | 31.0% bf16 MFU | 123216 tok/s step 7013/19560 | loss 3.495283 (+0.11z)| norm 0.2603 (-0.09z)| lr 6.10e-04 | 4218.66 ms | 32.0% bf16 MFU | 123269 tok/s step 7014/19560 | loss 3.515275 (+0.56z)| norm 0.2806 (+0.06z)| lr 6.10e-04 | 4178.81 ms | 32.3% bf16 MFU | 123379 tok/s step 7015/19560 | loss 3.467860 (-0.53z)| norm 0.2561 (-0.12z)| lr 6.10e-04 | 4153.48 ms | 32.5% bf16 MFU | 123521 tok/s step 7016/19560 | loss 3.511614 (+0.47z)| norm 0.2602 (-0.08z)| lr 6.10e-04 | 4182.15 ms | 32.3% bf16 MFU | 123613 tok/s step 7017/19560 | loss 3.570240 (+1.76z)| norm 0.2482 (-0.17z)| lr 6.10e-04 | 4181.04 ms | 32.3% bf16 MFU | 123702 tok/s step 7018/19560 | loss 3.534938 (+0.96z)| norm 0.2858 (+0.10z)| lr 6.10e-04 | 4224.41 ms | 32.0% bf16 MFU | 123723 tok/s step 7019/19560 | loss 3.514761 (+0.49z)| norm 0.2590 (-0.10z)| lr 6.10e-04 | 4174.22 ms | 32.3% bf16 MFU | 123817 tok/s step 7020/19560 | loss 3.487267 (-0.12z)| norm 0.2632 (-0.06z)| lr 6.10e-04 | 4182.13 ms | 32.3% bf16 MFU | 123894 tok/s step 7021/19560 | loss 3.477473 (-0.34z)| norm 0.2551 (-0.12z)| lr 6.10e-04 | 4171.35 ms | 32.4% bf16 MFU | 123984 tok/s step 7022/19560 | loss 3.458079 (-0.78z)| norm 0.2615 (-0.08z)| lr 6.10e-04 | 4202.53 ms | 32.1% bf16 MFU | 124022 tok/s step 7023/19560 | loss 3.427532 (-1.45z)| norm 0.2451 (-0.20z)| lr 6.10e-04 | 4182.54 ms | 32.3% bf16 MFU | 124089 tok/s step 7024/19560 | loss 3.473555 (-0.41z)| norm 0.2353 (-0.26z)| lr 6.10e-04 | 4197.72 ms | 32.2% bf16 MFU | 124129 tok/s step 7025/19560 | loss 3.506284 (+0.32z)| norm 0.2912 (+0.14z)| lr 6.09e-04 | 4172.85 ms | 32.4% bf16 MFU | 124205 tok/s step 7026/19560 | loss 3.554659 (+1.39z)| norm 0.3171 (+0.33z)| lr 6.09e-04 | 4178.81 ms | 32.3% bf16 MFU | 124268 tok/s step 7027/19560 | loss 3.525145 (+0.72z)| norm 0.2638 (-0.06z)| lr 6.09e-04 | 4180.20 ms | 32.3% bf16 MFU | 124326 tok/s step 7028/19560 | loss 3.488569 (-0.09z)| norm 0.2336 (-0.28z)| lr 6.09e-04 | 4179.01 ms | 32.3% bf16 MFU | 124382 tok/s step 7029/19560 | loss 3.468201 (-0.53z)| norm 0.2344 (-0.27z)| lr 6.09e-04 | 4211.60 ms | 32.1% bf16 MFU | 124387 tok/s step 7030/19560 | loss 3.477235 (-0.33z)| norm 0.2626 (-0.07z)| lr 6.09e-04 | 4160.29 ms | 32.5% bf16 MFU | 124469 tok/s step 7031/19560 | loss 3.498843 (+0.15z)| norm 0.2461 (-0.19z)| lr 6.09e-04 | 4172.83 ms | 32.4% bf16 MFU | 124528 tok/s step 7032/19560 | loss 3.537679 (+1.02z)| norm 0.5749 (+2.16z)| lr 6.09e-04 | 4173.47 ms | 32.4% bf16 MFU | 124583 tok/s step 7033/19560 | loss 3.485267 (-0.15z)| norm 0.2909 (+0.12z)| lr 6.09e-04 | 4174.76 ms | 32.3% bf16 MFU | 124633 tok/s step 7034/19560 | loss 3.513924 (+0.50z)| norm 0.2782 (+0.02z)| lr 6.09e-04 | 4165.18 ms | 32.4% bf16 MFU | 124695 tok/s step 7035/19560 | loss 3.467372 (-0.54z)| norm 0.2648 (-0.08z)| lr 6.09e-04 | 4171.34 ms | 32.4% bf16 MFU | 124744 tok/s step 7036/19560 | loss 3.474571 (-0.38z)| norm 0.2801 (+0.03z)| lr 6.09e-04 | 4179.55 ms | 32.3% bf16 MFU | 124779 tok/s step 7037/19560 | loss 3.484687 (-0.15z)| norm 0.2813 (+0.04z)| lr 6.09e-04 | 4165.26 ms | 32.4% bf16 MFU | 124834 tok/s step 7038/19560 | loss 3.439269 (-1.17z)| norm 0.2574 (-0.13z)| lr 6.09e-04 | 4171.26 ms | 32.4% bf16 MFU | 124877 tok/s step 7039/19560 | loss 3.477942 (-0.30z)| norm 0.2540 (-0.16z)| lr 6.09e-04 | 4181.96 ms | 32.3% bf16 MFU | 124901 tok/s step 7040/19560 | loss 3.500286 (+0.19z)| norm 0.2576 (-0.13z)| lr 6.09e-04 | 4167.73 ms | 32.4% bf16 MFU | 124946 tok/s step 7041/19560 | loss 3.667631 (+3.72z)| norm 0.2733 (-0.02z)| lr 6.09e-04 | 4171.65 ms | 32.4% bf16 MFU | 124983 tok/s step 7042/19560 | loss 3.461936 (-0.67z)| norm 0.2788 (+0.01z)| lr 6.08e-04 | 4162.98 ms | 32.4% bf16 MFU | 125031 tok/s step 7043/19560 | loss 3.453344 (-0.84z)| norm 0.2546 (-0.16z)| lr 6.08e-04 | 4175.47 ms | 32.3% bf16 MFU | 125057 tok/s step 7044/19560 | loss 3.481267 (-0.25z)| norm 0.2755 (-0.01z)| lr 6.08e-04 | 4189.60 ms | 32.2% bf16 MFU | 125061 tok/s step 7045/19560 | loss 3.485795 (-0.15z)| norm 0.2410 (-0.26z)| lr 6.08e-04 | 4323.36 ms | 31.2% bf16 MFU | 124872 tok/s step 7046/19560 | loss 3.452260 (-0.88z)| norm 0.2532 (-0.17z)| lr 6.08e-04 | 4179.72 ms | 32.3% bf16 MFU | 124900 tok/s step 7047/19560 | loss 3.519352 (+0.56z)| norm 0.2687 (-0.06z)| lr 6.08e-04 | 4169.46 ms | 32.4% bf16 MFU | 124942 tok/s step 7048/19560 | loss 3.479270 (-0.30z)| norm 0.2500 (-0.20z)| lr 6.08e-04 | 4168.60 ms | 32.4% bf16 MFU | 124984 tok/s step 7049/19560 | loss 3.475496 (-0.38z)| norm 0.2574 (-0.15z)| lr 6.08e-04 | 4173.99 ms | 32.3% bf16 MFU | 125015 tok/s step 7050/19560 | loss 3.476477 (-0.37z)| norm 0.2437 (-0.24z)| lr 6.08e-04 | 4168.59 ms | 32.4% bf16 MFU | 125053 tok/s step 7051/19560 | loss 3.442316 (-1.12z)| norm 0.2550 (-0.29z)| lr 6.08e-04 | 4195.29 ms | 32.2% bf16 MFU | 125049 tok/s step 7052/19560 | loss 3.485661 (-0.14z)| norm 0.2620 (-0.07z)| lr 6.08e-04 | 4184.07 ms | 32.3% bf16 MFU | 125061 tok/s step 7053/19560 | loss 3.427082 (-1.44z)| norm 0.3093 (+1.35z)| lr 6.08e-04 | 4252.38 ms | 31.8% bf16 MFU | 124973 tok/s step 7054/19560 | loss 3.447057 (-0.99z)| norm 0.2486 (-0.45z)| lr 6.08e-04 | 4163.32 ms | 32.4% bf16 MFU | 125021 tok/s step 7055/19560 | loss 3.524748 (+0.74z)| norm 0.2652 (+0.06z)| lr 6.08e-04 | 4168.54 ms | 32.4% bf16 MFU | 125058 tok/s step 7056/19560 | loss 3.443595 (-1.05z)| norm 0.2949 (+0.95z)| lr 6.08e-04 | 4163.92 ms | 32.4% bf16 MFU | 125101 tok/s step 7057/19560 | loss 3.481104 (-0.22z)| norm 0.2479 (-0.46z)| lr 6.08e-04 | 4175.41 ms | 32.3% bf16 MFU | 125124 tok/s step 7058/19560 | loss 3.517105 (+0.58z)| norm 0.2443 (-0.56z)| lr 6.08e-04 | 4155.83 ms | 32.5% bf16 MFU | 125176 tok/s step 7059/19560 | loss 3.482597 (-0.19z)| norm 0.3012 (+1.14z)| lr 6.07e-04 | 4175.68 ms | 32.3% bf16 MFU | 125195 tok/s step 7060/19560 | loss 3.423663 (-1.50z)| norm 0.2341 (-0.86z)| lr 6.07e-04 | 4189.62 ms | 32.2% bf16 MFU | 125192 tok/s step 7061/19560 | loss 3.451037 (-0.87z)| norm 0.2854 (+0.67z)| lr 6.07e-04 | 4208.28 ms | 32.1% bf16 MFU | 125162 tok/s step 7062/19560 | loss 3.506802 (+0.40z)| norm 0.2568 (-0.18z)| lr 6.07e-04 | 4186.60 ms | 32.2% bf16 MFU | 125165 tok/s step 7063/19560 | loss 3.491186 (+0.04z)| norm 0.2511 (-0.35z)| lr 6.07e-04 | 4174.10 ms | 32.3% bf16 MFU | 125187 tok/s step 7064/19560 | loss 3.543504 (+1.24z)| norm 0.2819 (+0.57z)| lr 6.07e-04 | 4177.69 ms | 32.3% bf16 MFU | 125203 tok/s step 7065/19560 | loss 3.467085 (-0.52z)| norm 0.2540 (-0.26z)| lr 6.07e-04 | 4182.60 ms | 32.3% bf16 MFU | 125210 tok/s step 7066/19560 | loss 3.620764 (+2.91z)| norm 0.2560 (-0.20z)| lr 6.07e-04 | 4176.24 ms | 32.3% bf16 MFU | 125227 tok/s step 7067/19560 | loss 3.451091 (-0.87z)| norm 0.2754 (+0.38z)| lr 6.07e-04 | 4170.95 ms | 32.4% bf16 MFU | 125250 tok/s step 7068/19560 | loss 3.457919 (-0.71z)| norm 0.2597 (-0.09z)| lr 6.07e-04 | 4162.54 ms | 32.4% bf16 MFU | 125286 tok/s step 7069/19560 | loss 3.489701 (-0.00z)| norm 0.2289 (-1.01z)| lr 6.07e-04 | 4170.75 ms | 32.4% bf16 MFU | 125307 tok/s step 7070/19560 | loss 3.487788 (-0.05z)| norm 0.2813 (+0.55z)| lr 6.07e-04 | 4175.74 ms | 32.3% bf16 MFU | 125319 tok/s step 7071/19560 | loss 3.534467 (+0.98z)| norm 0.2536 (-0.27z)| lr 6.07e-04 | 4176.59 ms | 32.3% bf16 MFU | 125330 tok/s step 7072/19560 | loss 3.537851 (+1.04z)| norm 0.2487 (-0.42z)| lr 6.07e-04 | 4176.52 ms | 32.3% bf16 MFU | 125340 tok/s step 7073/19560 | loss 3.480803 (-0.21z)| norm 0.2602 (-0.08z)| lr 6.07e-04 | 4167.34 ms | 32.4% bf16 MFU | 125363 tok/s step 7074/19560 | loss 3.500279 (+0.23z)| norm 0.2181 (-1.32z)| lr 6.07e-04 | 4166.60 ms | 32.4% bf16 MFU | 125387 tok/s step 7075/19560 | loss 3.494410 (+0.10z)| norm 0.2540 (-0.26z)| lr 6.07e-04 | 4161.02 ms | 32.4% bf16 MFU | 125417 tok/s step 7076/19560 | loss 3.481007 (-0.21z)| norm 0.2421 (-0.61z)| lr 6.07e-04 | 4164.59 ms | 32.4% bf16 MFU | 125441 tok/s step 7077/19560 | loss 3.569604 (+1.77z)| norm 0.2525 (-0.30z)| lr 6.06e-04 | 4183.27 ms | 32.3% bf16 MFU | 125435 tok/s step 7078/19560 | loss 3.472024 (-0.44z)| norm 0.2431 (-0.58z)| lr 6.06e-04 | 4181.11 ms | 32.3% bf16 MFU | 125433 tok/s step 7079/19560 | loss 3.595060 (+2.29z)| norm 0.2523 (-0.30z)| lr 6.06e-04 | 4164.97 ms | 32.4% bf16 MFU | 125456 tok/s step 7080/19560 | loss 3.434521 (-1.27z)| norm 0.2837 (+0.62z)| lr 6.06e-04 | 4185.76 ms | 32.3% bf16 MFU | 125446 tok/s step 7081/19560 | loss 3.427435 (-1.40z)| norm 0.2735 (+0.31z)| lr 6.06e-04 | 4169.62 ms | 32.4% bf16 MFU | 125460 tok/s step 7082/19560 | loss 3.496183 (+0.10z)| norm 0.2719 (+0.26z)| lr 6.06e-04 | 4176.32 ms | 32.3% bf16 MFU | 125464 tok/s step 7083/19560 | loss 3.503464 (+0.27z)| norm 0.2983 (+1.03z)| lr 6.06e-04 | 4160.32 ms | 32.5% bf16 MFU | 125492 tok/s step 7084/19560 | loss 3.456788 (-0.76z)| norm 0.3119 (+1.41z)| lr 6.06e-04 | 4176.91 ms | 32.3% bf16 MFU | 125494 tok/s step 7085/19560 | loss 3.467735 (-0.51z)| norm 0.2949 (+0.89z)| lr 6.06e-04 | 4171.57 ms | 32.4% bf16 MFU | 125503 tok/s step 7086/19560 | loss 3.532009 (+0.89z)| norm 0.2894 (+0.72z)| lr 6.06e-04 | 4185.36 ms | 32.3% bf16 MFU | 125491 tok/s step 7087/19560 | loss 3.421417 (-1.53z)| norm 0.2731 (+0.24z)| lr 6.06e-04 | 4166.82 ms | 32.4% bf16 MFU | 125508 tok/s step 7088/19560 | loss 3.482014 (-0.19z)| norm 0.2776 (+0.36z)| lr 6.06e-04 | 4181.28 ms | 32.3% bf16 MFU | 125502 tok/s step 7089/19560 | loss 3.428490 (-1.35z)| norm 0.2546 (-0.32z)| lr 6.06e-04 | 4181.37 ms | 32.3% bf16 MFU | 125496 tok/s step 7090/19560 | loss 3.454688 (-0.77z)| norm 0.2465 (-0.56z)| lr 6.06e-04 | 4169.42 ms | 32.4% bf16 MFU | 125509 tok/s step 7091/19560 | loss 3.509399 (+0.42z)| norm 0.2684 (+0.08z)| lr 6.06e-04 | 4167.58 ms | 32.4% bf16 MFU | 125523 tok/s step 7092/19560 | loss 3.482203 (-0.16z)| norm 0.2607 (-0.15z)| lr 6.06e-04 | 4172.60 ms | 32.4% bf16 MFU | 125530 tok/s step 7093/19560 | loss 3.502265 (+0.27z)| norm 0.2426 (-0.69z)| lr 6.06e-04 | 4182.45 ms | 32.3% bf16 MFU | 125521 tok/s step 7094/19560 | loss 3.467384 (-0.48z)| norm 0.2555 (-0.30z)| lr 6.05e-04 | 4172.92 ms | 32.4% bf16 MFU | 125527 tok/s step 7095/19560 | loss 3.450395 (-0.85z)| norm 0.2502 (-0.45z)| lr 6.05e-04 | 4177.06 ms | 32.3% bf16 MFU | 125526 tok/s step 7096/19560 | loss 3.483062 (-0.13z)| norm 0.2350 (-0.90z)| lr 6.05e-04 | 4215.43 ms | 32.0% bf16 MFU | 125469 tok/s step 7097/19560 | loss 3.455864 (-0.72z)| norm 0.2576 (-0.22z)| lr 6.05e-04 | 4168.77 ms | 32.4% bf16 MFU | 125483 tok/s step 7098/19560 | loss 3.560592 (+1.56z)| norm 0.2301 (-1.03z)| lr 6.05e-04 | 4167.70 ms | 32.4% bf16 MFU | 125499 tok/s step 7099/19560 | loss 3.486553 (-0.07z)| norm 0.2694 (+0.14z)| lr 6.05e-04 | 4249.87 ms | 31.8% bf16 MFU | 125393 tok/s step 7100/19560 | loss 3.476530 (-0.29z)| norm 0.2780 (+0.39z)| lr 6.05e-04 | 4180.19 ms | 32.3% bf16 MFU | 125394 tok/s step 7101/19560 | loss 3.480075 (-0.21z)| norm 0.2869 (+0.64z)| lr 6.05e-04 | 4195.66 ms | 32.2% bf16 MFU | 125372 tok/s step 7102/19560 | loss 3.438593 (-1.11z)| norm 0.4067 (+3.91z)| lr 6.05e-04 | 4203.39 ms | 32.1% bf16 MFU | 125340 tok/s step 7103/19560 | loss 3.521346 (+0.71z)| norm 0.2882 (+0.61z)| lr 6.05e-04 | 4212.68 ms | 32.1% bf16 MFU | 125296 tok/s step 7104/19560 | loss 3.477650 (-0.25z)| norm 0.2739 (+0.22z)| lr 6.05e-04 | 4163.92 ms | 32.4% bf16 MFU | 125327 tok/s step 7105/19560 | loss 3.491845 (+0.06z)| norm 0.2508 (-0.43z)| lr 6.05e-04 | 4170.63 ms | 32.4% bf16 MFU | 125346 tok/s step 7106/19560 | loss 3.516638 (+0.60z)| norm 0.2388 (-0.76z)| lr 6.05e-04 | 4365.39 ms | 30.9% bf16 MFU | 125084 tok/s step 7107/19560 | loss 3.531459 (+0.91z)| norm 0.2854 (+0.54z)| lr 6.05e-04 | 4175.80 ms | 32.3% bf16 MFU | 125107 tok/s step 7108/19560 | loss 3.540405 (+1.09z)| norm 0.2704 (+0.12z)| lr 6.05e-04 | 4206.43 ms | 32.1% bf16 MFU | 125084 tok/s step 7109/19560 | loss 3.581712 (+1.95z)| norm 0.2880 (+0.60z)| lr 6.05e-04 | 4160.76 ms | 32.5% bf16 MFU | 125130 tok/s step 7110/19560 | loss 3.475859 (-0.33z)| norm 0.2653 (-0.02z)| lr 6.05e-04 | 4179.87 ms | 32.3% bf16 MFU | 125145 tok/s step 7111/19560 | loss 3.548122 (+1.21z)| norm 0.2319 (-0.94z)| lr 6.04e-04 | 4171.30 ms | 32.4% bf16 MFU | 125172 tok/s step 7112/19560 | loss 3.469813 (-0.47z)| norm 0.2493 (-0.46z)| lr 6.04e-04 | 4174.82 ms | 32.3% bf16 MFU | 125193 tok/s step 7113/19560 | loss 3.509183 (+0.38z)| norm 0.2612 (-0.12z)| lr 6.04e-04 | 4166.51 ms | 32.4% bf16 MFU | 125225 tok/s step 7114/19560 | loss 3.474622 (-0.35z)| norm 0.2536 (-0.32z)| lr 6.04e-04 | 4172.94 ms | 32.4% bf16 MFU | 125246 tok/s step 7115/19560 | loss 3.468486 (-0.48z)| norm 0.2784 (+0.36z)| lr 6.04e-04 | 4170.43 ms | 32.4% bf16 MFU | 125269 tok/s step 7116/19560 | loss 3.477324 (-0.28z)| norm 0.2558 (-0.27z)| lr 6.04e-04 | 4163.21 ms | 32.4% bf16 MFU | 125302 tok/s step 7117/19560 | loss 3.513346 (+0.49z)| norm 0.2352 (-0.83z)| lr 6.04e-04 | 4167.56 ms | 32.4% bf16 MFU | 125327 tok/s step 7118/19560 | loss 3.448102 (-0.92z)| norm 0.2331 (-0.89z)| lr 6.04e-04 | 4181.32 ms | 32.3% bf16 MFU | 125330 tok/s step 7119/19560 | loss 3.470913 (-0.43z)| norm 0.2378 (-0.75z)| lr 6.04e-04 | 4178.80 ms | 32.3% bf16 MFU | 125337 tok/s step 7120/19560 | loss 3.492800 (+0.04z)| norm 0.2390 (-0.71z)| lr 6.04e-04 | 4161.79 ms | 32.4% bf16 MFU | 125369 tok/s step 7121/19560 | loss 3.469599 (-0.46z)| norm 0.2488 (-0.45z)| lr 6.04e-04 | 4162.97 ms | 32.4% bf16 MFU | 125398 tok/s step 7122/19560 | loss 3.459600 (-0.67z)| norm 0.2589 (-0.17z)| lr 6.04e-04 | 4264.02 ms | 31.7% bf16 MFU | 125276 tok/s step 7123/19560 | loss 3.417012 (-1.58z)| norm 0.2531 (-0.34z)| lr 6.04e-04 | 4161.40 ms | 32.4% bf16 MFU | 125311 tok/s step 7124/19560 | loss 3.493786 (+0.07z)| norm 0.2122 (-1.45z)| lr 6.04e-04 | 4166.66 ms | 32.4% bf16 MFU | 125337 tok/s step 7125/19560 | loss 3.478306 (-0.28z)| norm 0.2461 (-0.52z)| lr 6.04e-04 | 4259.88 ms | 31.7% bf16 MFU | 125224 tok/s step 7126/19560 | loss 3.458946 (-0.70z)| norm 0.2544 (-0.29z)| lr 6.04e-04 | 4170.27 ms | 32.4% bf16 MFU | 125249 tok/s step 7127/19560 | loss 3.464583 (-0.59z)| norm 0.2286 (-1.00z)| lr 6.04e-04 | 4181.30 ms | 32.3% bf16 MFU | 125256 tok/s step 7128/19560 | loss 3.504528 (+0.29z)| norm 0.2404 (-0.67z)| lr 6.03e-04 | 4185.23 ms | 32.3% bf16 MFU | 125257 tok/s step 7129/19560 | loss 3.585652 (+2.03z)| norm 0.2597 (-0.14z)| lr 6.03e-04 | 4164.40 ms | 32.4% bf16 MFU | 125289 tok/s step 7130/19560 | loss 3.458202 (-0.73z)| norm 0.2812 (+0.44z)| lr 6.03e-04 | 4199.13 ms | 32.2% bf16 MFU | 125267 tok/s step 7131/19560 | loss 3.464527 (-0.58z)| norm 0.2627 (-0.07z)| lr 6.03e-04 | 4175.32 ms | 32.3% bf16 MFU | 125282 tok/s step 7132/19560 | loss 3.534887 (+0.94z)| norm 0.2696 (+0.12z)| lr 6.03e-04 | 4160.94 ms | 32.4% bf16 MFU | 125318 tok/s step 7133/19560 | loss 3.567478 (+1.62z)| norm 0.2639 (-0.04z)| lr 6.03e-04 | 4164.41 ms | 32.4% bf16 MFU | 125347 tok/s step 7134/19560 | loss 3.526723 (+0.73z)| norm 0.2627 (-0.06z)| lr 6.03e-04 | 4247.90 ms | 31.8% bf16 MFU | 125251 tok/s step 7135/19560 | loss 3.483912 (-0.19z)| norm 0.2572 (-0.21z)| lr 6.03e-04 | 4190.48 ms | 32.2% bf16 MFU | 125244 tok/s step 7136/19560 | loss 3.477062 (-0.33z)| norm 0.2685 (+0.10z)| lr 6.03e-04 | 4173.68 ms | 32.3% bf16 MFU | 125263 tok/s step 7137/19560 | loss 3.534886 (+0.92z)| norm 0.2750 (+0.28z)| lr 6.03e-04 | 4195.82 ms | 32.2% bf16 MFU | 125247 tok/s step 7138/19560 | loss 3.486628 (-0.10z)| norm 0.2779 (+0.36z)| lr 6.03e-04 | 4166.23 ms | 32.4% bf16 MFU | 125277 tok/s step 7139/19560 | loss 3.496556 (+0.14z)| norm 0.2701 (+0.14z)| lr 6.03e-04 | 4181.03 ms | 32.3% bf16 MFU | 125283 tok/s step 7140/19560 | loss 3.576510 (+2.04z)| norm 0.2751 (+0.28z)| lr 6.03e-04 | 4177.60 ms | 32.3% bf16 MFU | 125294 tok/s step 7141/19560 | loss 3.465280 (-0.63z)| norm 0.2619 (-0.09z)| lr 6.03e-04 | 4167.55 ms | 32.4% bf16 MFU | 125319 tok/s step 7142/19560 | loss 3.497261 (+0.14z)| norm 0.2696 (+0.13z)| lr 6.03e-04 | 4169.21 ms | 32.4% bf16 MFU | 125341 tok/s step 7143/19560 | loss 3.490685 (-0.02z)| norm 0.3057 (+1.12z)| lr 6.03e-04 | 4189.07 ms | 32.2% bf16 MFU | 125332 tok/s step 7144/19560 | loss 3.562365 (+1.68z)| norm 0.2554 (-0.28z)| lr 6.03e-04 | 4176.98 ms | 32.3% bf16 MFU | 125341 tok/s step 7145/19560 | loss 3.443650 (-1.13z)| norm 0.2707 (+0.14z)| lr 6.02e-04 | 4183.83 ms | 32.3% bf16 MFU | 125340 tok/s step 7146/19560 | loss 3.460868 (-0.71z)| norm 0.2718 (+0.18z)| lr 6.02e-04 | 4178.75 ms | 32.3% bf16 MFU | 125346 tok/s step 7147/19560 | loss 3.549899 (+1.42z)| norm 0.2410 (-0.67z)| lr 6.02e-04 | 4173.79 ms | 32.3% bf16 MFU | 125359 tok/s step 7148/19560 | loss 3.450360 (-0.95z)| norm 0.2806 (+0.42z)| lr 6.02e-04 | 4197.54 ms | 32.2% bf16 MFU | 125337 tok/s step 7149/19560 | loss 3.454576 (-0.85z)| norm 0.2588 (-0.18z)| lr 6.02e-04 | 4158.12 ms | 32.5% bf16 MFU | 125374 tok/s step 7150/19560 | loss 3.568814 (+1.83z)| norm 0.2367 (-0.79z)| lr 6.02e-04 | 4183.36 ms | 32.3% bf16 MFU | 125372 tok/s step 7151/19560 | loss 3.503969 (+0.29z)| norm 0.2481 (-0.48z)| lr 6.02e-04 | 4177.79 ms | 32.3% bf16 MFU | 125378 tok/s step 7152/19560 | loss 3.434023 (-1.35z)| norm 0.2290 (-1.00z)| lr 6.02e-04 | 4190.18 ms | 32.2% bf16 MFU | 125365 tok/s step 7153/19560 | loss 3.508198 (+0.40z)| norm 0.2671 (+0.06z)| lr 6.02e-04 | 4179.85 ms | 32.3% bf16 MFU | 125368 tok/s step 7154/19560 | loss 3.544729 (+1.26z)| norm 0.2614 (-0.09z)| lr 6.02e-04 | 4235.58 ms | 31.9% bf16 MFU | 125289 tok/s step 7155/19560 | loss 3.526504 (+0.83z)| norm 0.2733 (+0.24z)| lr 6.02e-04 | 4245.38 ms | 31.8% bf16 MFU | 125199 tok/s step 7156/19560 | loss 3.493144 (+0.04z)| norm 0.2492 (-0.44z)| lr 6.02e-04 | 4197.90 ms | 32.2% bf16 MFU | 125184 tok/s step 7157/19560 | loss 3.507540 (+0.38z)| norm 0.2514 (-0.38z)| lr 6.02e-04 | 4162.76 ms | 32.4% bf16 MFU | 125222 tok/s step 7158/19560 | loss 3.528979 (+0.87z)| norm 0.2686 (+0.10z)| lr 6.02e-04 | 4163.86 ms | 32.4% bf16 MFU | 125257 tok/s step 7159/19560 | loss 3.539092 (+1.10z)| norm 0.2450 (-0.56z)| lr 6.02e-04 | 4160.27 ms | 32.5% bf16 MFU | 125295 tok/s step 7160/19560 | loss 3.545916 (+1.25z)| norm 0.2712 (+0.38z)| lr 6.02e-04 | 4174.11 ms | 32.3% bf16 MFU | 125311 tok/s step 7161/19560 | loss 3.444853 (-1.10z)| norm 0.2638 (+0.06z)| lr 6.02e-04 | 4294.46 ms | 31.4% bf16 MFU | 125149 tok/s step 7162/19560 | loss 3.422155 (-1.60z)| norm 0.2698 (+0.33z)| lr 6.02e-04 | 4867.08 ms | 27.7% bf16 MFU | 124278 tok/s step 7163/19560 | loss 3.478360 (-0.30z)| norm 0.2689 (+0.29z)| lr 6.01e-04 | 4474.24 ms | 30.2% bf16 MFU | 123923 tok/s step 7164/19560 | loss 3.474214 (-0.40z)| norm 0.2751 (+0.57z)| lr 6.01e-04 | 4490.31 ms | 30.1% bf16 MFU | 123565 tok/s step 7165/19560 | loss 3.437328 (-1.23z)| norm 0.2647 (+0.11z)| lr 6.01e-04 | 4326.75 ms | 31.2% bf16 MFU | 123445 tok/s step 7166/19560 | loss 3.445711 (-1.04z)| norm 0.2779 (+0.69z)| lr 6.01e-04 | 4217.80 ms | 32.0% bf16 MFU | 123488 tok/s step 7167/19560 | loss 3.495763 (+0.10z)| norm 0.2694 (+0.31z)| lr 6.01e-04 | 4371.53 ms | 30.9% bf16 MFU | 123310 tok/s step 7168/19560 | loss 3.429200 (-1.40z)| norm 0.2411 (-0.94z)| lr 6.01e-04 | 4224.67 ms | 32.0% bf16 MFU | 123350 tok/s step 7169/19560 | loss 3.483366 (-0.14z)| norm 0.2639 (+0.07z)| lr 6.01e-04 | 4212.42 ms | 32.1% bf16 MFU | 123406 tok/s step 7170/19560 | loss 3.405450 (-2.01z)| norm 0.2566 (-0.25z)| lr 6.01e-04 | 4276.65 ms | 31.6% bf16 MFU | 123365 tok/s step 7171/19560 | loss 3.424250 (-1.54z)| norm 0.2565 (-0.25z)| lr 6.01e-04 | 4239.44 ms | 31.8% bf16 MFU | 123380 tok/s step 7172/19560 | loss 3.440256 (-1.14z)| norm 0.2421 (-0.88z)| lr 6.01e-04 | 4170.07 ms | 32.4% bf16 MFU | 123497 tok/s step 7173/19560 | loss 3.416127 (-1.69z)| norm 0.2443 (-0.78z)| lr 6.01e-04 | 4176.25 ms | 32.3% bf16 MFU | 123600 tok/s step 7174/19560 | loss 3.442102 (-1.07z)| norm 0.2532 (-0.38z)| lr 6.01e-04 | 4209.59 ms | 32.1% bf16 MFU | 123647 tok/s step 7175/19560 | loss 3.493982 (+0.15z)| norm 0.2460 (-0.70z)| lr 6.01e-04 | 4189.31 ms | 32.2% bf16 MFU | 123722 tok/s step 7176/19560 | loss 3.499499 (+0.28z)| norm 0.2563 (-0.24z)| lr 6.01e-04 | 4400.03 ms | 30.7% bf16 MFU | 123494 tok/s step 7177/19560 | loss 3.469099 (-0.43z)| norm 0.2679 (+0.27z)| lr 6.01e-04 | 4180.53 ms | 32.3% bf16 MFU | 123590 tok/s step 7178/19560 | loss 3.462873 (-0.58z)| norm 0.2597 (-0.10z)| lr 6.01e-04 | 4202.60 ms | 32.1% bf16 MFU | 123648 tok/s step 7179/19560 | loss 3.411651 (-1.76z)| norm 0.2452 (-0.74z)| lr 6.01e-04 | 4174.81 ms | 32.3% bf16 MFU | 123745 tok/s step 7180/19560 | loss 3.532568 (+1.04z)| norm 0.2416 (-0.89z)| lr 6.00e-04 | 4176.29 ms | 32.3% bf16 MFU | 123834 tok/s step 7181/19560 | loss 3.463025 (-0.58z)| norm 0.2340 (-1.22z)| lr 6.00e-04 | 4178.77 ms | 32.3% bf16 MFU | 123916 tok/s step 7182/19560 | loss 3.507138 (+0.44z)| norm 0.2412 (-0.89z)| lr 6.00e-04 | 4180.59 ms | 32.3% bf16 MFU | 123991 tok/s step 7183/19560 | loss 3.441868 (-1.07z)| norm 0.2471 (-0.62z)| lr 6.00e-04 | 4169.88 ms | 32.4% bf16 MFU | 124078 tok/s step 7184/19560 | loss 3.421228 (-1.54z)| norm 0.2442 (-0.74z)| lr 6.00e-04 | 4192.54 ms | 32.2% bf16 MFU | 124126 tok/s step 7185/19560 | loss 3.462591 (-0.58z)| norm 0.2508 (-0.44z)| lr 6.00e-04 | 4169.29 ms | 32.4% bf16 MFU | 124208 tok/s step 7186/19560 | loss 3.513184 (+0.60z)| norm 0.2804 (+0.88z)| lr 6.00e-04 | 4204.63 ms | 32.1% bf16 MFU | 124232 tok/s step 7187/19560 | loss 3.481997 (-0.12z)| norm 0.2547 (-0.26z)| lr 6.00e-04 | 4243.66 ms | 31.8% bf16 MFU | 124198 tok/s step 7188/19560 | loss 3.399094 (-2.03z)| norm 0.2647 (+0.18z)| lr 6.00e-04 | 4181.75 ms | 32.3% bf16 MFU | 124256 tok/s step 7189/19560 | loss 3.477949 (-0.22z)| norm 0.2316 (-1.32z)| lr 6.00e-04 | 4181.81 ms | 32.3% bf16 MFU | 124312 tok/s step 7190/19560 | loss 3.525127 (+0.87z)| norm 0.2444 (-0.72z)| lr 6.00e-04 | 4203.22 ms | 32.1% bf16 MFU | 124333 tok/s step 7191/19560 | loss 3.453182 (-0.78z)| norm 0.2485 (-0.53z)| lr 6.00e-04 | 4181.04 ms | 32.3% bf16 MFU | 124387 tok/s step 7192/19560 | loss 3.433196 (-1.23z)| norm 0.2217 (-1.73z)| lr 6.00e-04 | 4175.30 ms | 32.3% bf16 MFU | 124446 tok/s step 7193/19560 | loss 3.442388 (-1.01z)| norm 0.2487 (-0.50z)| lr 6.00e-04 | 4178.16 ms | 32.3% bf16 MFU | 124498 tok/s step 7194/19560 | loss 3.394344 (-2.11z)| norm 0.2546 (-0.23z)| lr 6.00e-04 | 4201.45 ms | 32.1% bf16 MFU | 124512 tok/s step 7195/19560 | loss 3.453853 (-0.72z)| norm 0.2417 (-0.80z)| lr 6.00e-04 | 4249.80 ms | 31.8% bf16 MFU | 124455 tok/s step 7196/19560 | loss 3.479632 (-0.12z)| norm 0.2222 (-1.65z)| lr 6.00e-04 | 4180.78 ms | 32.3% bf16 MFU | 124502 tok/s step 7197/19560 | loss 3.461412 (-0.54z)| norm 0.2962 (+1.63z)| lr 5.99e-04 | 4240.35 ms | 31.8% bf16 MFU | 124459 tok/s step 7198/19560 | loss 3.432064 (-1.21z)| norm 0.2882 (+1.27z)| lr 5.99e-04 | 4180.63 ms | 32.3% bf16 MFU | 124507 tok/s step 7199/19560 | loss 3.525787 (+0.98z)| norm 0.3024 (+1.86z)| lr 5.99e-04 | 4166.50 ms | 32.4% bf16 MFU | 124573 tok/s step 7200/19560 | loss 3.461082 (-0.52z)| norm 0.2751 (+0.65z)| lr 5.99e-04 | 4217.28 ms | 32.0% bf16 MFU | 124560 tok/s step 7201/19560 | loss 3.481353 (-0.04z)| norm 0.2790 (+0.81z)| lr 5.99e-04 | 4175.39 ms | 32.3% bf16 MFU | 124611 tok/s step 7202/19560 | loss 3.427650 (-1.29z)| norm 0.2788 (+0.80z)| lr 5.99e-04 | 4175.97 ms | 32.3% bf16 MFU | 124658 tok/s step 7203/19560 | loss 3.463438 (-0.44z)| norm 0.2610 (+0.00z)| lr 5.99e-04 | 4178.48 ms | 32.3% bf16 MFU | 124698 tok/s step 7204/19560 | loss 3.448405 (-0.79z)| norm 0.2343 (-1.17z)| lr 5.99e-04 | 4173.59 ms | 32.4% bf16 MFU | 124744 tok/s step 7205/19560 | loss 3.453272 (-0.66z)| norm 0.2635 (+0.11z)| lr 5.99e-04 | 4186.13 ms | 32.3% bf16 MFU | 124769 tok/s step 7206/19560 | loss 3.404495 (-1.78z)| norm 0.2398 (-0.94z)| lr 5.99e-04 | 4166.59 ms | 32.4% bf16 MFU | 124823 tok/s step 7207/19560 | loss 3.455817 (-0.57z)| norm 0.2775 (+0.72z)| lr 5.99e-04 | 4174.87 ms | 32.3% bf16 MFU | 124861 tok/s step 7208/19560 | loss 3.394855 (-2.01z)| norm 0.2473 (-0.60z)| lr 5.99e-04 | 4181.72 ms | 32.3% bf16 MFU | 124886 tok/s step 7209/19560 | loss 3.584143 (+2.42z)| norm 0.2727 (+0.52z)| lr 5.99e-04 | 4177.36 ms | 32.3% bf16 MFU | 124917 tok/s step 7210/19560 | loss 3.402362 (-1.79z)| norm 0.2507 (-0.44z)| lr 5.99e-04 | 4185.05 ms | 32.3% bf16 MFU | 124935 tok/s step 7211/19560 | loss 3.539374 (+1.36z)| norm 0.2967 (+1.59z)| lr 5.99e-04 | 4174.41 ms | 32.3% bf16 MFU | 124968 tok/s step 7212/19560 | loss 3.435255 (-1.02z)| norm 0.2692 (+0.40z)| lr 5.99e-04 | 4231.30 ms | 31.9% bf16 MFU | 124915 tok/s step 7213/19560 | loss 3.445447 (-0.78z)| norm 0.2520 (-0.36z)| lr 5.99e-04 | 4179.43 ms | 32.3% bf16 MFU | 124942 tok/s step 7214/19560 | loss 3.471568 (-0.18z)| norm 0.2674 (+0.35z)| lr 5.98e-04 | 4175.66 ms | 32.3% bf16 MFU | 124973 tok/s step 7215/19560 | loss 3.510340 (+0.70z)| norm 0.2627 (+0.14z)| lr 5.98e-04 | 4192.93 ms | 32.2% bf16 MFU | 124976 tok/s step 7216/19560 | loss 3.396185 (-1.89z)| norm 0.2560 (-0.16z)| lr 5.98e-04 | 4189.93 ms | 32.2% bf16 MFU | 124984 tok/s step 7217/19560 | loss 3.483835 (+0.09z)| norm 0.2640 (+0.20z)| lr 5.98e-04 | 4183.19 ms | 32.3% bf16 MFU | 125001 tok/s step 7218/19560 | loss 3.465591 (-0.33z)| norm 0.2599 (+0.01z)| lr 5.98e-04 | 4181.77 ms | 32.3% bf16 MFU | 125020 tok/s step 7219/19560 | loss 3.457925 (-0.49z)| norm 0.2404 (-0.88z)| lr 5.98e-04 | 4168.34 ms | 32.4% bf16 MFU | 125058 tok/s step 7220/19560 | loss 3.414392 (-1.46z)| norm 0.2549 (-0.21z)| lr 5.98e-04 | 4181.43 ms | 32.3% bf16 MFU | 125074 tok/s step 7221/19560 | loss 3.501807 (+0.52z)| norm 0.2200 (-1.79z)| lr 5.98e-04 | 4179.08 ms | 32.3% bf16 MFU | 125093 tok/s step 7222/19560 | loss 3.464343 (-0.33z)| norm 0.2579 (-0.06z)| lr 5.98e-04 | 4181.63 ms | 32.3% bf16 MFU | 125107 tok/s step 7223/19560 | loss 3.578326 (+2.19z)| norm 0.2998 (+1.80z)| lr 5.98e-04 | 4185.69 ms | 32.3% bf16 MFU | 125115 tok/s step 7224/19560 | loss 3.561700 (+1.79z)| norm 0.2915 (+1.40z)| lr 5.98e-04 | 4174.81 ms | 32.3% bf16 MFU | 125138 tok/s step 7225/19560 | loss 3.509339 (+0.62z)| norm 0.2954 (+1.55z)| lr 5.98e-04 | 4167.28 ms | 32.4% bf16 MFU | 125172 tok/s step 7226/19560 | loss 3.500695 (+0.45z)| norm 0.2607 (+0.00z)| lr 5.98e-04 | 4177.05 ms | 32.3% bf16 MFU | 125189 tok/s step 7227/19560 | loss 3.506050 (+0.57z)| norm 0.2900 (+1.30z)| lr 5.98e-04 | 4246.42 ms | 31.8% bf16 MFU | 125103 tok/s step 7228/19560 | loss 3.439421 (-0.91z)| norm 0.2843 (+1.04z)| lr 5.98e-04 | 4172.05 ms | 32.4% bf16 MFU | 125131 tok/s step 7229/19560 | loss 3.496261 (+0.35z)| norm 0.2444 (-0.72z)| lr 5.98e-04 | 4174.21 ms | 32.3% bf16 MFU | 125155 tok/s step 7230/19560 | loss 3.547556 (+1.46z)| norm 0.3030 (+2.31z)| lr 5.98e-04 | 4198.55 ms | 32.2% bf16 MFU | 125141 tok/s step 7231/19560 | loss 3.469524 (-0.25z)| norm 0.2797 (+1.07z)| lr 5.97e-04 | 4195.56 ms | 32.2% bf16 MFU | 125132 tok/s step 7232/19560 | loss 3.470186 (-0.23z)| norm 0.2821 (+1.19z)| lr 5.97e-04 | 4176.59 ms | 32.3% bf16 MFU | 125152 tok/s step 7233/19560 | loss 3.511068 (+0.66z)| norm 0.2746 (+0.78z)| lr 5.97e-04 | 4175.33 ms | 32.3% bf16 MFU | 125172 tok/s step 7234/19560 | loss 3.389498 (-1.97z)| norm 0.2620 (+0.10z)| lr 5.97e-04 | 4180.52 ms | 32.3% bf16 MFU | 125184 tok/s step 7235/19560 | loss 3.536056 (+1.22z)| norm 0.2574 (-0.13z)| lr 5.97e-04 | 4192.72 ms | 32.2% bf16 MFU | 125178 tok/s step 7236/19560 | loss 3.651211 (+3.54z)| norm 0.2714 (+0.62z)| lr 5.97e-04 | 4176.13 ms | 32.3% bf16 MFU | 125196 tok/s step 7237/19560 | loss 3.484979 (+0.10z)| norm 0.2636 (+0.21z)| lr 5.97e-04 | 4176.55 ms | 32.3% bf16 MFU | 125213 tok/s step 7238/19560 | loss 3.424149 (-1.17z)| norm 0.2766 (+0.91z)| lr 5.97e-04 | 4174.47 ms | 32.3% bf16 MFU | 125232 tok/s step 7239/19560 | loss 3.452276 (-0.57z)| norm 0.2598 (-0.01z)| lr 5.97e-04 | 4184.10 ms | 32.3% bf16 MFU | 125235 tok/s step 7240/19560 | loss 3.549445 (+1.47z)| norm 0.2744 (+0.78z)| lr 5.97e-04 | 4179.05 ms | 32.3% bf16 MFU | 125246 tok/s step 7241/19560 | loss 3.563168 (+1.73z)| norm 0.2524 (-0.42z)| lr 5.97e-04 | 4167.08 ms | 32.4% bf16 MFU | 125275 tok/s step 7242/19560 | loss 3.401801 (-1.60z)| norm 0.2681 (+0.43z)| lr 5.97e-04 | 4169.40 ms | 32.4% bf16 MFU | 125299 tok/s step 7243/19560 | loss 3.472412 (-0.15z)| norm 0.2636 (+0.19z)| lr 5.97e-04 | 4181.45 ms | 32.3% bf16 MFU | 125303 tok/s step 7244/19560 | loss 3.495096 (+0.32z)| norm 0.2697 (+0.52z)| lr 5.97e-04 | 4182.66 ms | 32.3% bf16 MFU | 125305 tok/s step 7245/19560 | loss 3.526844 (+0.97z)| norm 0.2391 (-1.16z)| lr 5.97e-04 | 4172.81 ms | 32.4% bf16 MFU | 125322 tok/s step 7246/19560 | loss 3.485055 (+0.10z)| norm 0.2378 (-1.24z)| lr 5.97e-04 | 4172.08 ms | 32.4% bf16 MFU | 125339 tok/s step 7247/19560 | loss 3.410963 (-1.40z)| norm 0.2353 (-1.38z)| lr 5.97e-04 | 4179.03 ms | 32.3% bf16 MFU | 125345 tok/s step 7248/19560 | loss 3.419054 (-1.22z)| norm 0.2346 (-1.41z)| lr 5.96e-04 | 4171.46 ms | 32.4% bf16 MFU | 125362 tok/s step 7249/19560 | loss 3.454961 (-0.48z)| norm 0.2390 (-1.16z)| lr 5.96e-04 | 4189.90 ms | 32.2% bf16 MFU | 125351 tok/s step 7250/19560 | loss 3.409615 (-1.39z)| norm 0.2575 (-0.15z)| lr 5.96e-04 | 4189.20 ms | 32.2% bf16 MFU | 125341 tok/s val loss 3.456047 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2840/10042 = 0.282812 step 7251/19560 | loss 3.461209 (-0.36z)| norm 0.2223 (-2.03z)| lr 5.96e-04 | 4415.27 ms | 30.6% bf16 MFU | 125011 tok/s step 7252/19560 | loss 3.491330 (+0.25z)| norm 0.2448 (-0.85z)| lr 5.96e-04 | 4894.48 ms | 27.6% bf16 MFU | 124116 tok/s step 7253/19560 | loss 3.380468 (-1.95z)| norm 0.2610 (+0.04z)| lr 5.96e-04 | 4320.68 ms | 31.2% bf16 MFU | 123978 tok/s step 7254/19560 | loss 3.398843 (-1.56z)| norm 0.2619 (+0.08z)| lr 5.96e-04 | 4198.42 ms | 32.2% bf16 MFU | 124023 tok/s step 7255/19560 | loss 3.419698 (-1.13z)| norm 0.2479 (-0.70z)| lr 5.96e-04 | 4230.54 ms | 31.9% bf16 MFU | 124018 tok/s step 7256/19560 | loss 3.448963 (-0.55z)| norm 0.2661 (+0.30z)| lr 5.96e-04 | 4356.74 ms | 31.0% bf16 MFU | 123834 tok/s step 7257/19560 | loss 3.476740 (+0.02z)| norm 0.2580 (-0.15z)| lr 5.96e-04 | 4188.07 ms | 32.2% bf16 MFU | 123902 tok/s step 7258/19560 | loss 3.434758 (-0.82z)| norm 0.2614 (+0.05z)| lr 5.96e-04 | 4301.58 ms | 31.4% bf16 MFU | 123801 tok/s step 7259/19560 | loss 3.518676 (+0.85z)| norm 0.2586 (-0.11z)| lr 5.96e-04 | 4182.91 ms | 32.3% bf16 MFU | 123878 tok/s step 7260/19560 | loss 3.461155 (-0.29z)| norm 0.2726 (+0.68z)| lr 5.96e-04 | 4204.58 ms | 32.1% bf16 MFU | 123918 tok/s step 7261/19560 | loss 3.435373 (-0.79z)| norm 0.2715 (+0.61z)| lr 5.96e-04 | 4206.31 ms | 32.1% bf16 MFU | 123955 tok/s step 7262/19560 | loss 3.442385 (-0.64z)| norm 0.2349 (-1.43z)| lr 5.96e-04 | 4247.52 ms | 31.8% bf16 MFU | 123929 tok/s step 7263/19560 | loss 3.448859 (-0.50z)| norm 0.2638 (+0.19z)| lr 5.96e-04 | 4196.84 ms | 32.2% bf16 MFU | 123978 tok/s step 7264/19560 | loss 3.473600 (+0.00z)| norm 0.2673 (+0.38z)| lr 5.96e-04 | 4352.56 ms | 31.0% bf16 MFU | 123802 tok/s step 7265/19560 | loss 3.440551 (-0.66z)| norm 0.2624 (+0.12z)| lr 5.95e-04 | 4193.43 ms | 32.2% bf16 MFU | 123863 tok/s step 7266/19560 | loss 3.470879 (-0.04z)| norm 0.2911 (+1.70z)| lr 5.95e-04 | 4170.24 ms | 32.4% bf16 MFU | 123956 tok/s step 7267/19560 | loss 3.435095 (-0.76z)| norm 0.2721 (+0.65z)| lr 5.95e-04 | 4176.80 ms | 32.3% bf16 MFU | 124035 tok/s step 7268/19560 | loss 3.418993 (-1.07z)| norm 0.2570 (-0.18z)| lr 5.95e-04 | 4180.77 ms | 32.3% bf16 MFU | 124103 tok/s step 7269/19560 | loss 3.453818 (-0.35z)| norm 0.2861 (+1.41z)| lr 5.95e-04 | 4182.54 ms | 32.3% bf16 MFU | 124166 tok/s step 7270/19560 | loss 3.511013 (+0.82z)| norm 0.2650 (+0.25z)| lr 5.95e-04 | 4188.33 ms | 32.2% bf16 MFU | 124216 tok/s step 7271/19560 | loss 3.449648 (-0.43z)| norm 0.2832 (+1.29z)| lr 5.95e-04 | 4208.71 ms | 32.1% bf16 MFU | 124234 tok/s step 7272/19560 | loss 3.407503 (-1.29z)| norm 0.2709 (+0.59z)| lr 5.95e-04 | 4174.10 ms | 32.3% bf16 MFU | 124303 tok/s step 7273/19560 | loss 3.471513 (+0.04z)| norm 0.2803 (+1.11z)| lr 5.95e-04 | 4179.44 ms | 32.3% bf16 MFU | 124360 tok/s step 7274/19560 | loss 3.452500 (-0.36z)| norm 0.2622 (+0.10z)| lr 5.95e-04 | 4182.14 ms | 32.3% bf16 MFU | 124410 tok/s step 7275/19560 | loss 3.433427 (-0.74z)| norm 0.3022 (+2.29z)| lr 5.95e-04 | 4174.01 ms | 32.3% bf16 MFU | 124470 tok/s step 7276/19560 | loss 3.509335 (+0.84z)| norm 0.2777 (+0.93z)| lr 5.95e-04 | 4220.69 ms | 32.0% bf16 MFU | 124457 tok/s step 7277/19560 | loss 3.393164 (-1.57z)| norm 0.2472 (-0.75z)| lr 5.95e-04 | 4193.30 ms | 32.2% bf16 MFU | 124486 tok/s step 7278/19560 | loss 3.509279 (+0.86z)| norm 0.2823 (+1.17z)| lr 5.95e-04 | 4247.03 ms | 31.8% bf16 MFU | 124434 tok/s step 7279/19560 | loss 3.485468 (+0.37z)| norm 0.2758 (+0.80z)| lr 5.95e-04 | 4199.38 ms | 32.2% bf16 MFU | 124455 tok/s step 7280/19560 | loss 3.447242 (-0.44z)| norm 0.2327 (-1.59z)| lr 5.95e-04 | 4221.54 ms | 32.0% bf16 MFU | 124442 tok/s step 7281/19560 | loss 3.441306 (-0.56z)| norm 0.2651 (+0.21z)| lr 5.95e-04 | 4183.49 ms | 32.3% bf16 MFU | 124486 tok/s step 7282/19560 | loss 3.421680 (-0.96z)| norm 0.2645 (+0.17z)| lr 5.94e-04 | 4203.10 ms | 32.1% bf16 MFU | 124498 tok/s step 7283/19560 | loss 3.452575 (-0.29z)| norm 0.2509 (-0.57z)| lr 5.94e-04 | 4189.49 ms | 32.2% bf16 MFU | 124531 tok/s step 7284/19560 | loss 3.436788 (-0.62z)| norm 0.2759 (+0.80z)| lr 5.94e-04 | 4192.07 ms | 32.2% bf16 MFU | 124557 tok/s step 7285/19560 | loss 3.414445 (-1.08z)| norm 0.2404 (-1.16z)| lr 5.94e-04 | 4184.14 ms | 32.3% bf16 MFU | 124595 tok/s step 7286/19560 | loss 3.479083 (+0.31z)| norm 0.2874 (+1.42z)| lr 5.94e-04 | 4189.85 ms | 32.2% bf16 MFU | 124622 tok/s step 7287/19560 | loss 3.477430 (+0.29z)| norm 0.2833 (+1.18z)| lr 5.94e-04 | 4169.32 ms | 32.4% bf16 MFU | 124678 tok/s step 7288/19560 | loss 3.457621 (-0.13z)| norm 0.2654 (+0.20z)| lr 5.94e-04 | 4184.27 ms | 32.3% bf16 MFU | 124709 tok/s step 7289/19560 | loss 3.464838 (+0.03z)| norm 0.2798 (+0.98z)| lr 5.94e-04 | 4170.53 ms | 32.4% bf16 MFU | 124759 tok/s step 7290/19560 | loss 3.406039 (-1.26z)| norm 0.2521 (-0.52z)| lr 5.94e-04 | 4192.84 ms | 32.2% bf16 MFU | 124773 tok/s step 7291/19560 | loss 3.465147 (+0.04z)| norm 0.2829 (+1.14z)| lr 5.94e-04 | 4164.48 ms | 32.4% bf16 MFU | 124830 tok/s step 7292/19560 | loss 3.451991 (-0.25z)| norm 0.2901 (+1.52z)| lr 5.94e-04 | 4170.36 ms | 32.4% bf16 MFU | 124874 tok/s step 7293/19560 | loss 3.471005 (+0.17z)| norm 0.3089 (+2.45z)| lr 5.94e-04 | 4186.26 ms | 32.3% bf16 MFU | 124892 tok/s step 7294/19560 | loss 3.447191 (-0.36z)| norm 0.2591 (-0.16z)| lr 5.94e-04 | 4174.78 ms | 32.3% bf16 MFU | 124927 tok/s step 7295/19560 | loss 3.411677 (-1.12z)| norm 0.2760 (+0.73z)| lr 5.94e-04 | 4172.99 ms | 32.4% bf16 MFU | 124962 tok/s step 7296/19560 | loss 3.485992 (+0.50z)| norm 0.2454 (-0.89z)| lr 5.94e-04 | 4182.54 ms | 32.3% bf16 MFU | 124982 tok/s step 7297/19560 | loss 3.446269 (-0.37z)| norm 0.2584 (-0.20z)| lr 5.94e-04 | 4173.38 ms | 32.4% bf16 MFU | 125014 tok/s step 7298/19560 | loss 3.463375 (-0.00z)| norm 0.2464 (-0.83z)| lr 5.93e-04 | 4181.81 ms | 32.3% bf16 MFU | 125032 tok/s step 7299/19560 | loss 3.424289 (-0.86z)| norm 0.2671 (+0.26z)| lr 5.93e-04 | 4166.71 ms | 32.4% bf16 MFU | 125072 tok/s step 7300/19560 | loss 3.433220 (-0.67z)| norm 0.2516 (-0.56z)| lr 5.93e-04 | 4186.46 ms | 32.3% bf16 MFU | 125080 tok/s step 7301/19560 | loss 3.554920 (+1.98z)| norm 0.2562 (-0.32z)| lr 5.93e-04 | 4218.10 ms | 32.0% bf16 MFU | 125041 tok/s step 7302/19560 | loss 3.452627 (-0.26z)| norm 0.2693 (+0.36z)| lr 5.93e-04 | 4187.16 ms | 32.2% bf16 MFU | 125049 tok/s step 7303/19560 | loss 3.448287 (-0.35z)| norm 0.2495 (-0.69z)| lr 5.93e-04 | 4174.58 ms | 32.3% bf16 MFU | 125076 tok/s step 7304/19560 | loss 3.472521 (+0.19z)| norm 0.2559 (-0.35z)| lr 5.93e-04 | 4210.64 ms | 32.1% bf16 MFU | 125048 tok/s step 7305/19560 | loss 3.469871 (+0.13z)| norm 0.2584 (-0.22z)| lr 5.93e-04 | 4173.78 ms | 32.3% bf16 MFU | 125077 tok/s step 7306/19560 | loss 3.434485 (-0.64z)| norm 0.2642 (+0.09z)| lr 5.93e-04 | 4192.09 ms | 32.2% bf16 MFU | 125076 tok/s step 7307/19560 | loss 3.479155 (+0.33z)| norm 0.2741 (+0.61z)| lr 5.93e-04 | 4202.80 ms | 32.1% bf16 MFU | 125060 tok/s step 7308/19560 | loss 3.470518 (+0.15z)| norm 0.2487 (-0.75z)| lr 5.93e-04 | 4189.07 ms | 32.2% bf16 MFU | 125065 tok/s step 7309/19560 | loss 3.477564 (+0.30z)| norm 0.2401 (-1.22z)| lr 5.93e-04 | 4178.76 ms | 32.3% bf16 MFU | 125085 tok/s step 7310/19560 | loss 3.436976 (-0.59z)| norm 0.2371 (-1.37z)| lr 5.93e-04 | 4174.86 ms | 32.3% bf16 MFU | 125109 tok/s step 7311/19560 | loss 3.424239 (-0.86z)| norm 0.2367 (-1.38z)| lr 5.93e-04 | 4379.01 ms | 30.8% bf16 MFU | 124840 tok/s step 7312/19560 | loss 3.483940 (+0.45z)| norm 0.2475 (-0.81z)| lr 5.93e-04 | 4223.02 ms | 32.0% bf16 MFU | 124806 tok/s step 7313/19560 | loss 3.436601 (-0.60z)| norm 0.2643 (+0.08z)| lr 5.93e-04 | 4173.05 ms | 32.4% bf16 MFU | 124847 tok/s step 7314/19560 | loss 3.441139 (-0.49z)| norm 0.2659 (+0.17z)| lr 5.93e-04 | 4180.29 ms | 32.3% bf16 MFU | 124876 tok/s step 7315/19560 | loss 3.397904 (-1.42z)| norm 0.2490 (-0.73z)| lr 5.92e-04 | 4182.19 ms | 32.3% bf16 MFU | 124900 tok/s step 7316/19560 | loss 3.462063 (-0.02z)| norm 0.2596 (-0.16z)| lr 5.92e-04 | 4177.65 ms | 32.3% bf16 MFU | 124930 tok/s step 7317/19560 | loss 3.504957 (+0.93z)| norm 0.2502 (-0.68z)| lr 5.92e-04 | 4313.03 ms | 31.3% bf16 MFU | 124762 tok/s step 7318/19560 | loss 3.433400 (-0.65z)| norm 0.2584 (-0.24z)| lr 5.92e-04 | 4170.63 ms | 32.4% bf16 MFU | 124809 tok/s step 7319/19560 | loss 3.401587 (-1.34z)| norm 0.2551 (-0.42z)| lr 5.92e-04 | 4184.64 ms | 32.3% bf16 MFU | 124833 tok/s step 7320/19560 | loss 3.502794 (+0.90z)| norm 0.2644 (+0.06z)| lr 5.92e-04 | 4193.87 ms | 32.2% bf16 MFU | 124842 tok/s step 7321/19560 | loss 3.406137 (-1.24z)| norm 0.2870 (+1.29z)| lr 5.92e-04 | 4179.37 ms | 32.3% bf16 MFU | 124872 tok/s step 7322/19560 | loss 3.485757 (+0.51z)| norm 0.2734 (+0.53z)| lr 5.92e-04 | 4180.59 ms | 32.3% bf16 MFU | 124899 tok/s step 7323/19560 | loss 3.482337 (+0.43z)| norm 0.2458 (-0.99z)| lr 5.92e-04 | 4177.88 ms | 32.3% bf16 MFU | 124929 tok/s step 7324/19560 | loss 3.458858 (-0.09z)| norm 0.2898 (+1.42z)| lr 5.92e-04 | 4183.44 ms | 32.3% bf16 MFU | 124948 tok/s step 7325/19560 | loss 3.386461 (-1.67z)| norm 0.2458 (-1.02z)| lr 5.92e-04 | 4191.10 ms | 32.2% bf16 MFU | 124956 tok/s step 7326/19560 | loss 3.424476 (-0.83z)| norm 0.2613 (-0.13z)| lr 5.92e-04 | 4173.03 ms | 32.4% bf16 MFU | 124990 tok/s step 7327/19560 | loss 3.420355 (-0.91z)| norm 0.2647 (+0.08z)| lr 5.92e-04 | 4170.74 ms | 32.4% bf16 MFU | 125026 tok/s step 7328/19560 | loss 3.493351 (+0.70z)| norm 0.2431 (-1.16z)| lr 5.92e-04 | 4185.21 ms | 32.3% bf16 MFU | 125038 tok/s step 7329/19560 | loss 3.441208 (-0.44z)| norm 0.2496 (-0.77z)| lr 5.92e-04 | 4192.85 ms | 32.2% bf16 MFU | 125038 tok/s step 7330/19560 | loss 3.395086 (-1.45z)| norm 0.2725 (+0.56z)| lr 5.92e-04 | 4183.15 ms | 32.3% bf16 MFU | 125053 tok/s step 7331/19560 | loss 3.542158 (+1.74z)| norm 0.2632 (+0.02z)| lr 5.92e-04 | 4177.40 ms | 32.3% bf16 MFU | 125076 tok/s step 7332/19560 | loss 3.433548 (-0.61z)| norm 0.2647 (+0.09z)| lr 5.91e-04 | 4196.49 ms | 32.2% bf16 MFU | 125069 tok/s step 7333/19560 | loss 3.447310 (-0.31z)| norm 0.2523 (-0.62z)| lr 5.91e-04 | 4179.91 ms | 32.3% bf16 MFU | 125087 tok/s step 7334/19560 | loss 3.486902 (+0.54z)| norm 0.2348 (-1.64z)| lr 5.91e-04 | 4234.52 ms | 31.9% bf16 MFU | 125023 tok/s step 7335/19560 | loss 3.445144 (-0.37z)| norm 0.2458 (-0.99z)| lr 5.91e-04 | 4252.03 ms | 31.8% bf16 MFU | 124937 tok/s step 7336/19560 | loss 3.451350 (-0.25z)| norm 0.2339 (-1.65z)| lr 5.91e-04 | 4176.08 ms | 32.3% bf16 MFU | 124967 tok/s step 7337/19560 | loss 3.412841 (-1.08z)| norm 0.2362 (-1.50z)| lr 5.91e-04 | 4163.89 ms | 32.4% bf16 MFU | 125015 tok/s step 7338/19560 | loss 3.473211 (+0.26z)| norm 0.2606 (-0.11z)| lr 5.91e-04 | 4179.98 ms | 32.3% bf16 MFU | 125035 tok/s step 7339/19560 | loss 3.387323 (-1.66z)| norm 0.2377 (-1.40z)| lr 5.91e-04 | 4174.32 ms | 32.3% bf16 MFU | 125064 tok/s step 7340/19560 | loss 3.394539 (-1.47z)| norm 0.2346 (-1.55z)| lr 5.91e-04 | 4171.35 ms | 32.4% bf16 MFU | 125095 tok/s step 7341/19560 | loss 3.427146 (-0.74z)| norm 0.2381 (-1.34z)| lr 5.91e-04 | 4219.49 ms | 32.0% bf16 MFU | 125053 tok/s step 7342/19560 | loss 3.443319 (-0.37z)| norm 0.2212 (-2.23z)| lr 5.91e-04 | 4222.57 ms | 32.0% bf16 MFU | 125008 tok/s step 7343/19560 | loss 3.409473 (-1.11z)| norm 0.2620 (+0.04z)| lr 5.91e-04 | 4204.92 ms | 32.1% bf16 MFU | 124992 tok/s step 7344/19560 | loss 3.450752 (-0.20z)| norm 0.2238 (-2.04z)| lr 5.91e-04 | 4184.21 ms | 32.3% bf16 MFU | 125008 tok/s step 7345/19560 | loss 3.480404 (+0.47z)| norm 0.2580 (-0.16z)| lr 5.91e-04 | 4171.18 ms | 32.4% bf16 MFU | 125042 tok/s step 7346/19560 | loss 3.450857 (-0.19z)| norm 0.2302 (-1.65z)| lr 5.91e-04 | 4172.31 ms | 32.4% bf16 MFU | 125073 tok/s step 7347/19560 | loss 3.392820 (-1.48z)| norm 0.2636 (+0.15z)| lr 5.91e-04 | 4257.28 ms | 31.7% bf16 MFU | 124977 tok/s step 7348/19560 | loss 3.546348 (+1.91z)| norm 0.2415 (-1.05z)| lr 5.91e-04 | 4163.91 ms | 32.4% bf16 MFU | 125023 tok/s step 7349/19560 | loss 3.415072 (-0.98z)| norm 0.2367 (-1.33z)| lr 5.90e-04 | 4182.87 ms | 32.3% bf16 MFU | 125039 tok/s step 7350/19560 | loss 3.458981 (-0.01z)| norm 0.2717 (+0.59z)| lr 5.90e-04 | 4190.75 ms | 32.2% bf16 MFU | 125043 tok/s step 7351/19560 | loss 3.421682 (-0.82z)| norm 0.2441 (-0.92z)| lr 5.90e-04 | 4177.92 ms | 32.3% bf16 MFU | 125065 tok/s step 7352/19560 | loss 3.465149 (+0.18z)| norm 0.2507 (-0.54z)| lr 5.90e-04 | 4736.98 ms | 28.5% bf16 MFU | 124346 tok/s step 7353/19560 | loss 3.453396 (-0.08z)| norm 0.2345 (-1.44z)| lr 5.90e-04 | 4706.25 ms | 28.7% bf16 MFU | 123699 tok/s step 7354/19560 | loss 3.479723 (+0.54z)| norm 0.2949 (+1.95z)| lr 5.90e-04 | 4325.23 ms | 31.2% bf16 MFU | 123574 tok/s step 7355/19560 | loss 3.512005 (+1.29z)| norm 0.2799 (+1.12z)| lr 5.90e-04 | 4435.43 ms | 30.4% bf16 MFU | 123306 tok/s step 7356/19560 | loss 3.561867 (+2.38z)| norm 0.2931 (+1.85z)| lr 5.90e-04 | 4483.49 ms | 30.1% bf16 MFU | 122988 tok/s step 7357/19560 | loss 3.437682 (-0.45z)| norm 0.2570 (-0.18z)| lr 5.90e-04 | 4613.47 ms | 29.3% bf16 MFU | 122520 tok/s step 7358/19560 | loss 3.440370 (-0.37z)| norm 0.2929 (+1.86z)| lr 5.90e-04 | 4269.32 ms | 31.6% bf16 MFU | 122534 tok/s step 7359/19560 | loss 3.488840 (+0.75z)| norm 0.2692 (+0.53z)| lr 5.90e-04 | 4180.85 ms | 32.3% bf16 MFU | 122678 tok/s step 7360/19560 | loss 3.462777 (+0.15z)| norm 0.2456 (-0.80z)| lr 5.90e-04 | 4210.58 ms | 32.1% bf16 MFU | 122770 tok/s step 7361/19560 | loss 3.419014 (-0.86z)| norm 0.2692 (+0.55z)| lr 5.90e-04 | 4255.74 ms | 31.7% bf16 MFU | 122791 tok/s step 7362/19560 | loss 3.431864 (-0.57z)| norm 0.2802 (+1.16z)| lr 5.90e-04 | 4266.22 ms | 31.6% bf16 MFU | 122796 tok/s step 7363/19560 | loss 3.458746 (+0.08z)| norm 0.2515 (-0.47z)| lr 5.90e-04 | 4210.77 ms | 32.1% bf16 MFU | 122882 tok/s step 7364/19560 | loss 3.485228 (+0.81z)| norm 0.2636 (+0.22z)| lr 5.90e-04 | 4259.86 ms | 31.7% bf16 MFU | 122892 tok/s step 7365/19560 | loss 3.483569 (+0.77z)| norm 0.2717 (+0.68z)| lr 5.90e-04 | 4176.42 ms | 32.3% bf16 MFU | 123024 tok/s step 7366/19560 | loss 3.525447 (+1.83z)| norm 0.2562 (-0.19z)| lr 5.89e-04 | 4163.32 ms | 32.4% bf16 MFU | 123169 tok/s step 7367/19560 | loss 3.424974 (-0.77z)| norm 0.2987 (+2.18z)| lr 5.89e-04 | 4228.86 ms | 31.9% bf16 MFU | 123210 tok/s step 7368/19560 | loss 3.486702 (+0.86z)| norm 0.2794 (+1.09z)| lr 5.89e-04 | 4180.79 ms | 32.3% bf16 MFU | 123319 tok/s step 7369/19560 | loss 3.497026 (+1.18z)| norm 0.2620 (+0.11z)| lr 5.89e-04 | 4215.93 ms | 32.0% bf16 MFU | 123371 tok/s step 7370/19560 | loss 3.457568 (+0.09z)| norm 0.2687 (+0.49z)| lr 5.89e-04 | 4221.46 ms | 32.0% bf16 MFU | 123413 tok/s step 7371/19560 | loss 3.502062 (+1.30z)| norm 0.2720 (+0.67z)| lr 5.89e-04 | 4190.86 ms | 32.2% bf16 MFU | 123497 tok/s step 7372/19560 | loss 3.460731 (+0.18z)| norm 0.2619 (+0.11z)| lr 5.89e-04 | 4189.50 ms | 32.2% bf16 MFU | 123579 tok/s step 7373/19560 | loss 3.417686 (-0.98z)| norm 0.2913 (+1.71z)| lr 5.89e-04 | 4245.44 ms | 31.8% bf16 MFU | 123575 tok/s step 7374/19560 | loss 3.482175 (+0.80z)| norm 0.2340 (-1.46z)| lr 5.89e-04 | 4202.15 ms | 32.1% bf16 MFU | 123635 tok/s step 7375/19560 | loss 3.548294 (+2.55z)| norm 0.2412 (-1.07z)| lr 5.89e-04 | 4280.68 ms | 31.5% bf16 MFU | 123577 tok/s step 7376/19560 | loss 3.626982 (+4.30z)| norm 0.2765 (+0.88z)| lr 5.89e-04 | 4315.49 ms | 31.3% bf16 MFU | 123472 tok/s step 7377/19560 | loss 3.487183 (+0.78z)| norm 0.2844 (+1.30z)| lr 5.89e-04 | 4171.21 ms | 32.4% bf16 MFU | 123583 tok/s step 7378/19560 | loss 3.516281 (+1.48z)| norm 0.2459 (-0.84z)| lr 5.89e-04 | 4176.61 ms | 32.3% bf16 MFU | 123681 tok/s step 7379/19560 | loss 3.444693 (-0.30z)| norm 0.2569 (-0.25z)| lr 5.89e-04 | 4178.26 ms | 32.3% bf16 MFU | 123771 tok/s step 7380/19560 | loss 3.415449 (-1.02z)| norm 0.2694 (+0.45z)| lr 5.89e-04 | 4208.15 ms | 32.1% bf16 MFU | 123812 tok/s step 7381/19560 | loss 3.482581 (+0.64z)| norm 0.2279 (-1.87z)| lr 5.89e-04 | 4290.51 ms | 31.5% bf16 MFU | 123731 tok/s step 7382/19560 | loss 3.505961 (+1.22z)| norm 0.2453 (-0.88z)| lr 5.88e-04 | 4199.96 ms | 32.1% bf16 MFU | 123786 tok/s step 7383/19560 | loss 3.507467 (+1.24z)| norm 0.2521 (-0.50z)| lr 5.88e-04 | 4168.62 ms | 32.4% bf16 MFU | 123885 tok/s step 7384/19560 | loss 3.499842 (+1.03z)| norm 0.2276 (-1.84z)| lr 5.88e-04 | 4226.74 ms | 31.9% bf16 MFU | 123893 tok/s step 7385/19560 | loss 3.477404 (+0.46z)| norm 0.2444 (-0.90z)| lr 5.88e-04 | 4173.87 ms | 32.3% bf16 MFU | 123979 tok/s step 7386/19560 | loss 3.424071 (-0.88z)| norm 0.2623 (+0.09z)| lr 5.88e-04 | 4200.18 ms | 32.1% bf16 MFU | 124021 tok/s step 7387/19560 | loss 3.525238 (+1.66z)| norm 0.2493 (-0.62z)| lr 5.88e-04 | 4196.13 ms | 32.2% bf16 MFU | 124067 tok/s step 7388/19560 | loss 3.501913 (+1.06z)| norm 0.2527 (-0.43z)| lr 5.88e-04 | 4179.20 ms | 32.3% bf16 MFU | 124137 tok/s step 7389/19560 | loss 3.539301 (+1.95z)| norm 0.2541 (-0.34z)| lr 5.88e-04 | 4174.14 ms | 32.3% bf16 MFU | 124210 tok/s step 7390/19560 | loss 3.578011 (+2.79z)| norm 0.2621 (+0.08z)| lr 5.88e-04 | 4176.74 ms | 32.3% bf16 MFU | 124276 tok/s step 7391/19560 | loss 3.476820 (+0.37z)| norm 0.2766 (+0.88z)| lr 5.88e-04 | 4185.91 ms | 32.3% bf16 MFU | 124325 tok/s step 7392/19560 | loss 3.568469 (+2.48z)| norm 0.2573 (-0.18z)| lr 5.88e-04 | 4255.88 ms | 31.7% bf16 MFU | 124268 tok/s step 7393/19560 | loss 3.427374 (-0.81z)| norm 0.3098 (+2.63z)| lr 5.88e-04 | 4244.79 ms | 31.8% bf16 MFU | 124230 tok/s step 7394/19560 | loss 3.449729 (-0.28z)| norm 0.2763 (+0.84z)| lr 5.88e-04 | 4524.00 ms | 29.8% bf16 MFU | 123813 tok/s step 7395/19560 | loss 3.493075 (+0.72z)| norm 0.2594 (-0.07z)| lr 5.88e-04 | 4179.51 ms | 32.3% bf16 MFU | 123895 tok/s step 7396/19560 | loss 3.505092 (+0.98z)| norm 0.2732 (+0.67z)| lr 5.88e-04 | 4183.94 ms | 32.3% bf16 MFU | 123965 tok/s step 7397/19560 | loss 3.493881 (+0.71z)| norm 0.2663 (+0.30z)| lr 5.88e-04 | 4179.78 ms | 32.3% bf16 MFU | 124039 tok/s step 7398/19560 | loss 3.501940 (+0.90z)| norm 0.2691 (+0.45z)| lr 5.88e-04 | 4164.03 ms | 32.4% bf16 MFU | 124132 tok/s step 7399/19560 | loss 3.428136 (-0.81z)| norm 0.2635 (+0.16z)| lr 5.87e-04 | 4196.92 ms | 32.2% bf16 MFU | 124172 tok/s step 7400/19560 | loss 3.501030 (+0.87z)| norm 0.2550 (-0.30z)| lr 5.87e-04 | 4181.95 ms | 32.3% bf16 MFU | 124232 tok/s step 7401/19560 | loss 3.605590 (+3.15z)| norm 0.2670 (+0.37z)| lr 5.87e-04 | 4235.16 ms | 31.9% bf16 MFU | 124210 tok/s step 7402/19560 | loss 3.416076 (-1.08z)| norm 0.2585 (-0.10z)| lr 5.87e-04 | 4173.24 ms | 32.4% bf16 MFU | 124281 tok/s step 7403/19560 | loss 3.449794 (-0.33z)| norm 0.2828 (+1.27z)| lr 5.87e-04 | 4175.22 ms | 32.3% bf16 MFU | 124345 tok/s step 7404/19560 | loss 3.414018 (-1.11z)| norm 0.2591 (-0.05z)| lr 5.87e-04 | 4163.12 ms | 32.4% bf16 MFU | 124425 tok/s step 7405/19560 | loss 3.436861 (-0.62z)| norm 0.2691 (+0.50z)| lr 5.87e-04 | 4174.05 ms | 32.3% bf16 MFU | 124484 tok/s step 7406/19560 | loss 3.492824 (+0.65z)| norm 0.2667 (+0.38z)| lr 5.87e-04 | 4182.54 ms | 32.3% bf16 MFU | 124527 tok/s step 7407/19560 | loss 3.475873 (+0.27z)| norm 0.2476 (-0.69z)| lr 5.87e-04 | 4174.05 ms | 32.3% bf16 MFU | 124581 tok/s step 7408/19560 | loss 3.416339 (-1.07z)| norm 0.2513 (-0.50z)| lr 5.87e-04 | 4183.78 ms | 32.3% bf16 MFU | 124618 tok/s step 7409/19560 | loss 3.482284 (+0.41z)| norm 0.2353 (-1.39z)| lr 5.87e-04 | 4211.86 ms | 32.1% bf16 MFU | 124611 tok/s step 7410/19560 | loss 3.452509 (-0.27z)| norm 0.2413 (-1.03z)| lr 5.87e-04 | 4184.01 ms | 32.3% bf16 MFU | 124646 tok/s step 7411/19560 | loss 3.515124 (+1.13z)| norm 0.2847 (+1.40z)| lr 5.87e-04 | 4177.21 ms | 32.3% bf16 MFU | 124689 tok/s step 7412/19560 | loss 3.411491 (-1.19z)| norm 0.2239 (-1.97z)| lr 5.87e-04 | 4193.96 ms | 32.2% bf16 MFU | 124705 tok/s step 7413/19560 | loss 3.518678 (+1.19z)| norm 0.2630 (+0.19z)| lr 5.87e-04 | 4180.75 ms | 32.3% bf16 MFU | 124740 tok/s step 7414/19560 | loss 3.485375 (+0.44z)| norm 0.2490 (-0.58z)| lr 5.87e-04 | 4180.27 ms | 32.3% bf16 MFU | 124774 tok/s step 7415/19560 | loss 3.513895 (+1.07z)| norm 0.2284 (-1.71z)| lr 5.87e-04 | 4181.06 ms | 32.3% bf16 MFU | 124805 tok/s step 7416/19560 | loss 3.496806 (+0.68z)| norm 0.2719 (+0.72z)| lr 5.86e-04 | 4173.28 ms | 32.4% bf16 MFU | 124846 tok/s step 7417/19560 | loss 3.472896 (+0.15z)| norm 0.2218 (-2.03z)| lr 5.86e-04 | 4196.35 ms | 32.2% bf16 MFU | 124851 tok/s step 7418/19560 | loss 3.435093 (-0.70z)| norm 0.2576 (-0.05z)| lr 5.86e-04 | 4186.66 ms | 32.2% bf16 MFU | 124870 tok/s step 7419/19560 | loss 3.469567 (+0.07z)| norm 0.2613 (+0.16z)| lr 5.86e-04 | 4186.18 ms | 32.3% bf16 MFU | 124889 tok/s step 7420/19560 | loss 3.540514 (+1.62z)| norm 0.2423 (-0.88z)| lr 5.86e-04 | 4182.30 ms | 32.3% bf16 MFU | 124912 tok/s step 7421/19560 | loss 3.505510 (+0.84z)| norm 0.2699 (+0.71z)| lr 5.86e-04 | 4176.03 ms | 32.3% bf16 MFU | 124944 tok/s step 7422/19560 | loss 3.462343 (-0.11z)| norm 0.2509 (-0.39z)| lr 5.86e-04 | 4172.33 ms | 32.4% bf16 MFU | 124980 tok/s step 7423/19560 | loss 3.424504 (-0.95z)| norm 0.2805 (+1.32z)| lr 5.86e-04 | 4177.60 ms | 32.3% bf16 MFU | 125006 tok/s step 7424/19560 | loss 3.474991 (+0.16z)| norm 0.2531 (-0.27z)| lr 5.86e-04 | 4181.24 ms | 32.3% bf16 MFU | 125025 tok/s step 7425/19560 | loss 3.514781 (+1.03z)| norm 0.2598 (+0.12z)| lr 5.86e-04 | 4179.83 ms | 32.3% bf16 MFU | 125045 tok/s step 7426/19560 | loss 3.465921 (-0.05z)| norm 0.2346 (-1.33z)| lr 5.86e-04 | 4179.25 ms | 32.3% bf16 MFU | 125065 tok/s step 7427/19560 | loss 3.427151 (-0.90z)| norm 0.2322 (-1.44z)| lr 5.86e-04 | 4183.06 ms | 32.3% bf16 MFU | 125079 tok/s step 7428/19560 | loss 3.479657 (+0.25z)| norm 0.2394 (-1.02z)| lr 5.86e-04 | 4186.21 ms | 32.3% bf16 MFU | 125087 tok/s step 7429/19560 | loss 3.538011 (+1.54z)| norm 0.2449 (-0.70z)| lr 5.86e-04 | 4181.79 ms | 32.3% bf16 MFU | 125101 tok/s step 7430/19560 | loss 3.606678 (+2.94z)| norm 0.2477 (-0.53z)| lr 5.86e-04 | 4213.75 ms | 32.0% bf16 MFU | 125068 tok/s step 7431/19560 | loss 3.455860 (-0.30z)| norm 0.2338 (-1.31z)| lr 5.86e-04 | 4178.39 ms | 32.3% bf16 MFU | 125088 tok/s step 7432/19560 | loss 3.453341 (-0.35z)| norm 0.2441 (-0.71z)| lr 5.86e-04 | 4169.77 ms | 32.4% bf16 MFU | 125120 tok/s step 7433/19560 | loss 3.457006 (-0.27z)| norm 0.2186 (-2.10z)| lr 5.85e-04 | 4181.15 ms | 32.3% bf16 MFU | 125134 tok/s step 7434/19560 | loss 3.472394 (+0.06z)| norm 0.2331 (-1.27z)| lr 5.85e-04 | 4169.99 ms | 32.4% bf16 MFU | 125164 tok/s step 7435/19560 | loss 3.463009 (-0.14z)| norm 0.2450 (-0.61z)| lr 5.85e-04 | 4178.50 ms | 32.3% bf16 MFU | 125179 tok/s step 7436/19560 | loss 3.513221 (+0.93z)| norm 0.2469 (-0.51z)| lr 5.85e-04 | 4199.71 ms | 32.1% bf16 MFU | 125162 tok/s step 7437/19560 | loss 3.462272 (-0.16z)| norm 0.2531 (-0.17z)| lr 5.85e-04 | 4183.02 ms | 32.3% bf16 MFU | 125171 tok/s step 7438/19560 | loss 3.441982 (-0.60z)| norm 0.2421 (-0.78z)| lr 5.85e-04 | 4189.76 ms | 32.2% bf16 MFU | 125169 tok/s step 7439/19560 | loss 3.451562 (-0.40z)| norm 0.2535 (-0.15z)| lr 5.85e-04 | 4166.64 ms | 32.4% bf16 MFU | 125202 tok/s step 7440/19560 | loss 3.450285 (-0.42z)| norm 0.2608 (+0.24z)| lr 5.85e-04 | 4187.12 ms | 32.2% bf16 MFU | 125203 tok/s step 7441/19560 | loss 3.486400 (+0.35z)| norm 0.2663 (+0.55z)| lr 5.85e-04 | 4188.67 ms | 32.2% bf16 MFU | 125201 tok/s step 7442/19560 | loss 3.520750 (+1.07z)| norm 0.2562 (-0.01z)| lr 5.85e-04 | 4179.03 ms | 32.3% bf16 MFU | 125214 tok/s step 7443/19560 | loss 3.388204 (-1.76z)| norm 0.2520 (-0.24z)| lr 5.85e-04 | 4179.96 ms | 32.3% bf16 MFU | 125225 tok/s step 7444/19560 | loss 3.378408 (-1.93z)| norm 0.2472 (-0.51z)| lr 5.85e-04 | 4176.79 ms | 32.3% bf16 MFU | 125240 tok/s step 7445/19560 | loss 3.566469 (+1.99z)| norm 0.2405 (-0.87z)| lr 5.85e-04 | 4176.79 ms | 32.3% bf16 MFU | 125254 tok/s step 7446/19560 | loss 3.468071 (-0.06z)| norm 0.2820 (+1.42z)| lr 5.85e-04 | 4178.99 ms | 32.3% bf16 MFU | 125264 tok/s step 7447/19560 | loss 3.465183 (-0.13z)| norm 0.2383 (-0.99z)| lr 5.85e-04 | 4178.68 ms | 32.3% bf16 MFU | 125274 tok/s step 7448/19560 | loss 3.457705 (-0.28z)| norm 0.2728 (+0.90z)| lr 5.85e-04 | 4180.11 ms | 32.3% bf16 MFU | 125282 tok/s step 7449/19560 | loss 3.485296 (+0.29z)| norm 0.2509 (-0.28z)| lr 5.84e-04 | 4182.24 ms | 32.3% bf16 MFU | 125286 tok/s step 7450/19560 | loss 3.455234 (-0.34z)| norm 0.2653 (+0.52z)| lr 5.84e-04 | 4167.01 ms | 32.4% bf16 MFU | 125312 tok/s step 7451/19560 | loss 3.557530 (+1.79z)| norm 0.2458 (-0.56z)| lr 5.84e-04 | 4199.30 ms | 32.2% bf16 MFU | 125289 tok/s step 7452/19560 | loss 3.470556 (-0.03z)| norm 0.2784 (+1.26z)| lr 5.84e-04 | 4187.42 ms | 32.2% bf16 MFU | 125285 tok/s step 7453/19560 | loss 3.429517 (-0.90z)| norm 0.2372 (-1.04z)| lr 5.84e-04 | 4178.34 ms | 32.3% bf16 MFU | 125295 tok/s step 7454/19560 | loss 3.460998 (-0.25z)| norm 0.2408 (-0.83z)| lr 5.84e-04 | 4169.64 ms | 32.4% bf16 MFU | 125317 tok/s step 7455/19560 | loss 3.432184 (-0.86z)| norm 0.3102 (+2.92z)| lr 5.84e-04 | 4168.21 ms | 32.4% bf16 MFU | 125340 tok/s step 7456/19560 | loss 3.451981 (-0.43z)| norm 0.2848 (+1.52z)| lr 5.84e-04 | 4177.76 ms | 32.3% bf16 MFU | 125348 tok/s step 7457/19560 | loss 3.511295 (+0.81z)| norm 0.2923 (+1.88z)| lr 5.84e-04 | 4172.87 ms | 32.4% bf16 MFU | 125363 tok/s step 7458/19560 | loss 3.441839 (-0.67z)| norm 0.2738 (+0.91z)| lr 5.84e-04 | 4268.63 ms | 31.6% bf16 MFU | 125236 tok/s step 7459/19560 | loss 3.467200 (-0.12z)| norm 0.2404 (-0.85z)| lr 5.84e-04 | 4171.74 ms | 32.4% bf16 MFU | 125258 tok/s step 7460/19560 | loss 3.395025 (-1.66z)| norm 0.2627 (+0.33z)| lr 5.84e-04 | 4175.12 ms | 32.3% bf16 MFU | 125274 tok/s step 7461/19560 | loss 3.448182 (-0.52z)| norm 0.2464 (-0.53z)| lr 5.84e-04 | 4172.34 ms | 32.4% bf16 MFU | 125293 tok/s step 7462/19560 | loss 3.499012 (+0.57z)| norm 0.2403 (-0.85z)| lr 5.84e-04 | 4181.56 ms | 32.3% bf16 MFU | 125297 tok/s step 7463/19560 | loss 3.461574 (-0.24z)| norm 0.2636 (+0.37z)| lr 5.84e-04 | 4178.06 ms | 32.3% bf16 MFU | 125307 tok/s step 7464/19560 | loss 3.485391 (+0.27z)| norm 0.2571 (+0.02z)| lr 5.84e-04 | 4182.97 ms | 32.3% bf16 MFU | 125308 tok/s step 7465/19560 | loss 3.478919 (+0.12z)| norm 0.2508 (-0.33z)| lr 5.84e-04 | 4173.65 ms | 32.3% bf16 MFU | 125324 tok/s step 7466/19560 | loss 3.454903 (-0.40z)| norm 0.2426 (-0.75z)| lr 5.83e-04 | 4172.05 ms | 32.4% bf16 MFU | 125341 tok/s step 7467/19560 | loss 3.443163 (-0.67z)| norm 0.2689 (+0.63z)| lr 5.83e-04 | 4180.03 ms | 32.3% bf16 MFU | 125345 tok/s step 7468/19560 | loss 3.526552 (+1.14z)| norm 0.2621 (+0.26z)| lr 5.83e-04 | 4168.80 ms | 32.4% bf16 MFU | 125366 tok/s step 7469/19560 | loss 3.474602 (-0.01z)| norm 0.2764 (+1.01z)| lr 5.83e-04 | 4199.30 ms | 32.2% bf16 MFU | 125340 tok/s step 7470/19560 | loss 3.568641 (+2.01z)| norm 0.2605 (+0.15z)| lr 5.83e-04 | 4173.80 ms | 32.3% bf16 MFU | 125354 tok/s step 7471/19560 | loss 3.459080 (-0.38z)| norm 0.2537 (-0.22z)| lr 5.83e-04 | 4177.68 ms | 32.3% bf16 MFU | 125361 tok/s step 7472/19560 | loss 3.472366 (-0.09z)| norm 0.2636 (+0.31z)| lr 5.83e-04 | 4169.83 ms | 32.4% bf16 MFU | 125380 tok/s step 7473/19560 | loss 3.499949 (+0.51z)| norm 0.2542 (-0.21z)| lr 5.83e-04 | 4190.42 ms | 32.2% bf16 MFU | 125367 tok/s step 7474/19560 | loss 3.449452 (-0.60z)| norm 0.2617 (+0.19z)| lr 5.83e-04 | 4171.51 ms | 32.4% bf16 MFU | 125382 tok/s step 7475/19560 | loss 3.456280 (-0.47z)| norm 0.2583 (+0.00z)| lr 5.83e-04 | 4166.76 ms | 32.4% bf16 MFU | 125405 tok/s step 7476/19560 | loss 3.540183 (+1.40z)| norm 0.2922 (+1.85z)| lr 5.83e-04 | 4179.20 ms | 32.3% bf16 MFU | 125407 tok/s step 7477/19560 | loss 3.477875 (+0.00z)| norm 0.2553 (-0.19z)| lr 5.83e-04 | 4184.50 ms | 32.3% bf16 MFU | 125401 tok/s step 7478/19560 | loss 3.488067 (+0.23z)| norm 0.2761 (+0.96z)| lr 5.83e-04 | 4186.06 ms | 32.3% bf16 MFU | 125394 tok/s step 7479/19560 | loss 3.485820 (+0.17z)| norm 0.2604 (+0.08z)| lr 5.83e-04 | 4180.41 ms | 32.3% bf16 MFU | 125395 tok/s step 7480/19560 | loss 3.486351 (+0.17z)| norm 0.2648 (+0.32z)| lr 5.83e-04 | 4181.99 ms | 32.3% bf16 MFU | 125393 tok/s step 7481/19560 | loss 3.478710 (-0.00z)| norm 0.2499 (-0.52z)| lr 5.83e-04 | 4176.26 ms | 32.3% bf16 MFU | 125401 tok/s step 7482/19560 | loss 3.417434 (-1.37z)| norm 0.2648 (+0.33z)| lr 5.83e-04 | 4183.38 ms | 32.3% bf16 MFU | 125397 tok/s step 7483/19560 | loss 3.424129 (-1.20z)| norm 0.2378 (-1.18z)| lr 5.82e-04 | 4172.49 ms | 32.4% bf16 MFU | 125410 tok/s step 7484/19560 | loss 3.487439 (+0.24z)| norm 0.2583 (-0.00z)| lr 5.82e-04 | 4172.77 ms | 32.4% bf16 MFU | 125422 tok/s step 7485/19560 | loss 3.466812 (-0.24z)| norm 0.2431 (-0.87z)| lr 5.82e-04 | 4172.20 ms | 32.4% bf16 MFU | 125434 tok/s step 7486/19560 | loss 3.529149 (+1.16z)| norm 0.2589 (+0.06z)| lr 5.82e-04 | 4180.68 ms | 32.3% bf16 MFU | 125432 tok/s step 7487/19560 | loss 3.552842 (+1.67z)| norm 0.2628 (+0.29z)| lr 5.82e-04 | 4189.69 ms | 32.2% bf16 MFU | 125418 tok/s step 7488/19560 | loss 3.494559 (+0.35z)| norm 0.2545 (-0.20z)| lr 5.82e-04 | 4178.90 ms | 32.3% bf16 MFU | 125420 tok/s step 7489/19560 | loss 3.444213 (-0.78z)| norm 0.2582 (+0.02z)| lr 5.82e-04 | 4163.68 ms | 32.4% bf16 MFU | 125445 tok/s step 7490/19560 | loss 3.516042 (+0.82z)| norm 0.2632 (+0.32z)| lr 5.82e-04 | 4170.12 ms | 32.4% bf16 MFU | 125459 tok/s step 7491/19560 | loss 3.458565 (-0.47z)| norm 0.2881 (+1.75z)| lr 5.82e-04 | 4166.77 ms | 32.4% bf16 MFU | 125477 tok/s step 7492/19560 | loss 3.481471 (+0.04z)| norm 0.2514 (-0.38z)| lr 5.82e-04 | 4179.59 ms | 32.3% bf16 MFU | 125475 tok/s step 7493/19560 | loss 3.429183 (-1.12z)| norm 0.2662 (+0.48z)| lr 5.82e-04 | 4665.27 ms | 28.9% bf16 MFU | 124820 tok/s step 7494/19560 | loss 3.364399 (-2.50z)| norm 0.2439 (-0.81z)| lr 5.82e-04 | 4175.70 ms | 32.3% bf16 MFU | 124857 tok/s step 7495/19560 | loss 3.499021 (+0.45z)| norm 0.2657 (+0.48z)| lr 5.82e-04 | 4179.79 ms | 32.3% bf16 MFU | 124886 tok/s step 7496/19560 | loss 3.408988 (-1.51z)| norm 0.2885 (+1.83z)| lr 5.82e-04 | 4189.36 ms | 32.2% bf16 MFU | 124899 tok/s step 7497/19560 | loss 3.454132 (-0.51z)| norm 0.2491 (-0.50z)| lr 5.82e-04 | 4189.28 ms | 32.2% bf16 MFU | 124912 tok/s step 7498/19560 | loss 3.482990 (+0.12z)| norm 0.2506 (-0.40z)| lr 5.82e-04 | 4174.99 ms | 32.3% bf16 MFU | 124945 tok/s step 7499/19560 | loss 3.483259 (+0.12z)| norm 0.2503 (-0.41z)| lr 5.81e-04 | 4186.69 ms | 32.2% bf16 MFU | 124959 tok/s step 7500/19560 | loss 3.452951 (-0.54z)| norm 0.2630 (+0.34z)| lr 5.81e-04 | 4185.06 ms | 32.3% bf16 MFU | 124975 tok/s val loss 3.450114 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2885/10042 = 0.287293 step 7501/19560 | loss 3.661463 (+3.79z)| norm 0.3013 (+2.58z)| lr 5.81e-04 | 4333.28 ms | 31.2% bf16 MFU | 124776 tok/s step 7502/19560 | loss 3.550129 (+1.45z)| norm 0.2914 (+1.96z)| lr 5.81e-04 | 4339.19 ms | 31.1% bf16 MFU | 124578 tok/s step 7503/19560 | loss 3.434906 (-0.91z)| norm 0.2387 (-1.11z)| lr 5.81e-04 | 4163.04 ms | 32.4% bf16 MFU | 124646 tok/s step 7504/19560 | loss 3.560745 (+1.75z)| norm 0.5412 (+9.29z)| lr 5.81e-04 | 4318.76 ms | 31.3% bf16 MFU | 124484 tok/s step 7505/19560 | loss 3.467278 (-0.24z)| norm 0.2681 (+0.28z)| lr 5.81e-04 | 4207.00 ms | 32.1% bf16 MFU | 124491 tok/s step 7506/19560 | loss 3.524592 (+0.98z)| norm 0.2839 (+0.79z)| lr 5.81e-04 | 4165.51 ms | 32.4% bf16 MFU | 124560 tok/s step 7507/19560 | loss 3.481322 (+0.06z)| norm 0.2906 (+1.00z)| lr 5.81e-04 | 4224.06 ms | 32.0% bf16 MFU | 124538 tok/s step 7508/19560 | loss 3.489713 (+0.22z)| norm 0.2791 (+0.62z)| lr 5.81e-04 | 4178.95 ms | 32.3% bf16 MFU | 124584 tok/s step 7509/19560 | loss 3.511662 (+0.69z)| norm 0.2698 (+0.30z)| lr 5.81e-04 | 4215.63 ms | 32.0% bf16 MFU | 124573 tok/s step 7510/19560 | loss 3.424441 (-1.16z)| norm 0.2690 (+0.27z)| lr 5.81e-04 | 4263.16 ms | 31.7% bf16 MFU | 124493 tok/s step 7511/19560 | loss 3.501250 (+0.48z)| norm 0.2594 (-0.05z)| lr 5.81e-04 | 4276.14 ms | 31.6% bf16 MFU | 124399 tok/s step 7512/19560 | loss 3.416457 (-1.31z)| norm 0.2471 (-0.46z)| lr 5.81e-04 | 4202.19 ms | 32.1% bf16 MFU | 124417 tok/s step 7513/19560 | loss 3.447490 (-0.64z)| norm 0.2626 (+0.05z)| lr 5.81e-04 | 4186.98 ms | 32.2% bf16 MFU | 124457 tok/s step 7514/19560 | loss 3.548666 (+1.47z)| norm 0.2392 (-0.72z)| lr 5.81e-04 | 4179.46 ms | 32.3% bf16 MFU | 124507 tok/s step 7515/19560 | loss 3.496666 (+0.38z)| norm 0.2466 (-0.48z)| lr 5.81e-04 | 4172.81 ms | 32.4% bf16 MFU | 124564 tok/s step 7516/19560 | loss 3.488878 (+0.22z)| norm 0.2461 (-0.49z)| lr 5.80e-04 | 4205.97 ms | 32.1% bf16 MFU | 124568 tok/s step 7517/19560 | loss 3.499871 (+0.46z)| norm 0.2455 (-0.51z)| lr 5.80e-04 | 4250.68 ms | 31.8% bf16 MFU | 124507 tok/s step 7518/19560 | loss 3.597365 (+2.51z)| norm 0.2570 (-0.13z)| lr 5.80e-04 | 4169.67 ms | 32.4% bf16 MFU | 124568 tok/s step 7519/19560 | loss 3.389892 (-1.83z)| norm 0.2537 (-0.23z)| lr 5.80e-04 | 4180.72 ms | 32.3% bf16 MFU | 124610 tok/s step 7520/19560 | loss 3.437789 (-0.82z)| norm 0.2487 (-0.39z)| lr 5.80e-04 | 4171.81 ms | 32.4% bf16 MFU | 124663 tok/s step 7521/19560 | loss 3.493106 (+0.34z)| norm 0.2333 (-0.89z)| lr 5.80e-04 | 4189.23 ms | 32.2% bf16 MFU | 124688 tok/s step 7522/19560 | loss 3.451394 (-0.55z)| norm 0.2581 (-0.06z)| lr 5.80e-04 | 4173.76 ms | 32.3% bf16 MFU | 124734 tok/s step 7523/19560 | loss 3.429257 (-1.00z)| norm 0.2571 (-0.09z)| lr 5.80e-04 | 4165.77 ms | 32.4% bf16 MFU | 124790 tok/s step 7524/19560 | loss 3.518016 (+0.87z)| norm 0.2577 (-0.07z)| lr 5.80e-04 | 4172.00 ms | 32.4% bf16 MFU | 124834 tok/s step 7525/19560 | loss 3.469741 (-0.15z)| norm 0.2665 (+0.23z)| lr 5.80e-04 | 4181.02 ms | 32.3% bf16 MFU | 124862 tok/s step 7526/19560 | loss 3.481071 (+0.10z)| norm 0.2800 (+0.67z)| lr 5.80e-04 | 4178.25 ms | 32.3% bf16 MFU | 124893 tok/s step 7527/19560 | loss 3.578392 (+2.10z)| norm 0.2715 (+0.39z)| lr 5.80e-04 | 4181.53 ms | 32.3% bf16 MFU | 124918 tok/s step 7528/19560 | loss 3.537161 (+1.23z)| norm 0.2561 (-0.13z)| lr 5.80e-04 | 4183.08 ms | 32.3% bf16 MFU | 124939 tok/s step 7529/19560 | loss 3.481970 (+0.11z)| norm 0.3071 (+1.55z)| lr 5.80e-04 | 4177.00 ms | 32.3% bf16 MFU | 124967 tok/s step 7530/19560 | loss 3.471104 (-0.14z)| norm 0.2405 (-0.64z)| lr 5.80e-04 | 4182.79 ms | 32.3% bf16 MFU | 124986 tok/s step 7531/19560 | loss 3.478511 (+0.02z)| norm 0.2741 (+0.47z)| lr 5.80e-04 | 4172.66 ms | 32.4% bf16 MFU | 125019 tok/s step 7532/19560 | loss 3.516943 (+0.83z)| norm 0.2823 (+0.73z)| lr 5.79e-04 | 4185.09 ms | 32.3% bf16 MFU | 125032 tok/s step 7533/19560 | loss 3.470647 (-0.18z)| norm 0.2849 (+0.81z)| lr 5.79e-04 | 4173.58 ms | 32.4% bf16 MFU | 125062 tok/s step 7534/19560 | loss 3.421560 (-1.22z)| norm 0.2543 (-0.19z)| lr 5.79e-04 | 4181.49 ms | 32.3% bf16 MFU | 125078 tok/s step 7535/19560 | loss 3.484602 (+0.14z)| norm 0.2535 (-0.22z)| lr 5.79e-04 | 4168.51 ms | 32.4% bf16 MFU | 125112 tok/s step 7536/19560 | loss 3.444394 (-0.74z)| norm 0.2678 (+0.25z)| lr 5.79e-04 | 4195.09 ms | 32.2% bf16 MFU | 125106 tok/s step 7537/19560 | loss 3.478728 (+0.01z)| norm 0.2371 (-0.77z)| lr 5.79e-04 | 4171.54 ms | 32.4% bf16 MFU | 125134 tok/s step 7538/19560 | loss 3.483995 (+0.11z)| norm 0.2634 (+0.10z)| lr 5.79e-04 | 4164.22 ms | 32.4% bf16 MFU | 125173 tok/s step 7539/19560 | loss 3.471888 (-0.14z)| norm 0.2658 (+0.18z)| lr 5.79e-04 | 4180.81 ms | 32.3% bf16 MFU | 125184 tok/s step 7540/19560 | loss 3.596285 (+2.49z)| norm 0.2450 (-0.52z)| lr 5.79e-04 | 4175.27 ms | 32.3% bf16 MFU | 125204 tok/s step 7541/19560 | loss 3.440076 (-0.84z)| norm 0.2597 (-0.03z)| lr 5.79e-04 | 4188.52 ms | 32.2% bf16 MFU | 125202 tok/s step 7542/19560 | loss 3.583025 (+2.16z)| norm 1.2762 (+10.68z)| lr 5.79e-04 | 4162.76 ms | 32.4% bf16 MFU | 125239 tok/s step 7543/19560 | loss 3.472181 (-0.16z)| norm 0.3441 (+0.79z)| lr 5.79e-04 | 4776.62 ms | 28.3% bf16 MFU | 124465 tok/s step 7544/19560 | loss 3.413066 (-1.37z)| norm 0.2851 (+0.16z)| lr 5.79e-04 | 4449.97 ms | 30.3% bf16 MFU | 124133 tok/s step 7545/19560 | loss 3.472836 (-0.13z)| norm 0.2850 (+0.16z)| lr 5.79e-04 | 4707.22 ms | 28.7% bf16 MFU | 123495 tok/s step 7546/19560 | loss 3.455300 (-0.50z)| norm 0.3130 (+0.45z)| lr 5.79e-04 | 4344.39 ms | 31.1% bf16 MFU | 123355 tok/s step 7547/19560 | loss 3.469707 (-0.20z)| norm 0.2791 (+0.09z)| lr 5.79e-04 | 4496.91 ms | 30.0% bf16 MFU | 123016 tok/s step 7548/19560 | loss 3.475909 (-0.06z)| norm 0.2681 (-0.03z)| lr 5.79e-04 | 4260.88 ms | 31.7% bf16 MFU | 123018 tok/s step 7549/19560 | loss 3.485208 (+0.14z)| norm 0.2622 (-0.09z)| lr 5.78e-04 | 4280.55 ms | 31.5% bf16 MFU | 122991 tok/s step 7550/19560 | loss 3.491361 (+0.27z)| norm 0.2845 (+0.14z)| lr 5.78e-04 | 4247.40 ms | 31.8% bf16 MFU | 123013 tok/s step 7551/19560 | loss 3.540738 (+1.29z)| norm 0.2674 (-0.04z)| lr 5.78e-04 | 4236.42 ms | 31.9% bf16 MFU | 123051 tok/s step 7552/19560 | loss 3.467565 (-0.25z)| norm 0.2693 (-0.02z)| lr 5.78e-04 | 4410.05 ms | 30.6% bf16 MFU | 122842 tok/s step 7553/19560 | loss 3.460691 (-0.39z)| norm 0.2468 (-0.26z)| lr 5.78e-04 | 4334.00 ms | 31.2% bf16 MFU | 122749 tok/s step 7554/19560 | loss 3.453146 (-0.54z)| norm 0.2609 (-0.11z)| lr 5.78e-04 | 4276.39 ms | 31.6% bf16 MFU | 122741 tok/s step 7555/19560 | loss 3.483818 (+0.09z)| norm 0.2718 (+0.00z)| lr 5.78e-04 | 4322.36 ms | 31.2% bf16 MFU | 122669 tok/s step 7556/19560 | loss 3.451977 (-0.58z)| norm 0.2388 (-0.35z)| lr 5.78e-04 | 4167.52 ms | 32.4% bf16 MFU | 122826 tok/s step 7557/19560 | loss 3.437649 (-0.86z)| norm 0.2520 (-0.21z)| lr 5.78e-04 | 4347.42 ms | 31.1% bf16 MFU | 122714 tok/s step 7558/19560 | loss 3.407966 (-1.49z)| norm 0.2490 (-0.24z)| lr 5.78e-04 | 4286.51 ms | 31.5% bf16 MFU | 122694 tok/s step 7559/19560 | loss 3.478430 (+0.03z)| norm 0.2749 (+0.03z)| lr 5.78e-04 | 4171.78 ms | 32.4% bf16 MFU | 122843 tok/s step 7560/19560 | loss 3.485641 (+0.18z)| norm 0.2653 (-0.07z)| lr 5.78e-04 | 4178.95 ms | 32.3% bf16 MFU | 122974 tok/s step 7561/19560 | loss 3.455951 (-0.46z)| norm 0.2477 (-0.26z)| lr 5.78e-04 | 4393.44 ms | 30.7% bf16 MFU | 122792 tok/s step 7562/19560 | loss 3.442203 (-0.75z)| norm 0.2586 (-0.15z)| lr 5.78e-04 | 4203.22 ms | 32.1% bf16 MFU | 122889 tok/s step 7563/19560 | loss 3.489033 (+0.25z)| norm 0.2547 (-0.19z)| lr 5.78e-04 | 4166.84 ms | 32.4% bf16 MFU | 123036 tok/s step 7564/19560 | loss 3.475667 (-0.03z)| norm 0.2596 (-0.14z)| lr 5.78e-04 | 4180.53 ms | 32.3% bf16 MFU | 123155 tok/s step 7565/19560 | loss 3.466712 (-0.22z)| norm 0.2549 (-0.19z)| lr 5.77e-04 | 4170.57 ms | 32.4% bf16 MFU | 123283 tok/s step 7566/19560 | loss 3.469499 (-0.17z)| norm 0.2513 (-0.23z)| lr 5.77e-04 | 4189.65 ms | 32.2% bf16 MFU | 123375 tok/s step 7567/19560 | loss 3.472843 (-0.10z)| norm 0.2786 (+0.06z)| lr 5.77e-04 | 4184.88 ms | 32.3% bf16 MFU | 123471 tok/s step 7568/19560 | loss 3.469633 (-0.17z)| norm 0.2786 (+0.06z)| lr 5.77e-04 | 4172.26 ms | 32.4% bf16 MFU | 123580 tok/s step 7569/19560 | loss 3.497627 (+0.43z)| norm 0.2509 (-0.23z)| lr 5.77e-04 | 4178.82 ms | 32.3% bf16 MFU | 123674 tok/s step 7570/19560 | loss 3.463437 (-0.30z)| norm 0.2576 (-0.16z)| lr 5.77e-04 | 4179.21 ms | 32.3% bf16 MFU | 123763 tok/s step 7571/19560 | loss 3.441230 (-0.80z)| norm 0.2744 (+0.01z)| lr 5.77e-04 | 4169.83 ms | 32.4% bf16 MFU | 123862 tok/s step 7572/19560 | loss 3.511548 (+0.74z)| norm 0.2835 (+0.11z)| lr 5.77e-04 | 4167.00 ms | 32.4% bf16 MFU | 123960 tok/s step 7573/19560 | loss 3.496687 (+0.42z)| norm 0.2458 (-0.29z)| lr 5.77e-04 | 4179.84 ms | 32.3% bf16 MFU | 124033 tok/s step 7574/19560 | loss 3.446558 (-0.72z)| norm 0.2886 (+0.16z)| lr 5.77e-04 | 4201.87 ms | 32.1% bf16 MFU | 124070 tok/s step 7575/19560 | loss 3.481858 (+0.09z)| norm 0.2498 (-0.25z)| lr 5.77e-04 | 4175.81 ms | 32.3% bf16 MFU | 124144 tok/s step 7576/19560 | loss 3.429378 (-1.10z)| norm 0.2655 (-0.09z)| lr 5.77e-04 | 4174.68 ms | 32.3% bf16 MFU | 124217 tok/s step 7577/19560 | loss 3.429404 (-1.09z)| norm 0.2549 (-0.20z)| lr 5.77e-04 | 4179.83 ms | 32.3% bf16 MFU | 124277 tok/s step 7578/19560 | loss 3.446795 (-0.69z)| norm 0.2635 (-0.11z)| lr 5.77e-04 | 4180.49 ms | 32.3% bf16 MFU | 124334 tok/s step 7579/19560 | loss 3.482151 (+0.12z)| norm 0.2534 (-0.21z)| lr 5.77e-04 | 4197.29 ms | 32.2% bf16 MFU | 124363 tok/s step 7580/19560 | loss 3.448998 (-0.63z)| norm 0.2443 (-0.31z)| lr 5.77e-04 | 4184.55 ms | 32.3% bf16 MFU | 124409 tok/s step 7581/19560 | loss 3.529164 (+1.18z)| norm 0.2402 (-0.35z)| lr 5.77e-04 | 4187.74 ms | 32.2% bf16 MFU | 124449 tok/s step 7582/19560 | loss 3.461927 (-0.35z)| norm 0.2568 (-0.18z)| lr 5.76e-04 | 4170.66 ms | 32.4% bf16 MFU | 124512 tok/s step 7583/19560 | loss 3.506407 (+0.65z)| norm 0.2592 (-0.15z)| lr 5.76e-04 | 4187.58 ms | 32.2% bf16 MFU | 124546 tok/s step 7584/19560 | loss 3.435194 (-0.97z)| norm 0.2435 (-0.31z)| lr 5.76e-04 | 4180.01 ms | 32.3% bf16 MFU | 124590 tok/s step 7585/19560 | loss 3.458180 (-0.44z)| norm 0.2287 (-0.46z)| lr 5.76e-04 | 4179.59 ms | 32.3% bf16 MFU | 124633 tok/s step 7586/19560 | loss 3.438184 (-0.90z)| norm 0.2492 (-0.24z)| lr 5.76e-04 | 4181.96 ms | 32.3% bf16 MFU | 124670 tok/s step 7587/19560 | loss 3.445729 (-0.72z)| norm 0.2448 (-0.29z)| lr 5.76e-04 | 4171.42 ms | 32.4% bf16 MFU | 124720 tok/s step 7588/19560 | loss 3.467259 (-0.24z)| norm 0.2427 (-0.31z)| lr 5.76e-04 | 4180.51 ms | 32.3% bf16 MFU | 124755 tok/s step 7589/19560 | loss 3.441113 (-0.85z)| norm 0.2389 (-0.35z)| lr 5.76e-04 | 4391.47 ms | 30.7% bf16 MFU | 124487 tok/s step 7590/19560 | loss 3.465144 (-0.29z)| norm 0.2468 (-0.27z)| lr 5.76e-04 | 4176.65 ms | 32.3% bf16 MFU | 124539 tok/s step 7591/19560 | loss 3.452159 (-0.58z)| norm 0.2411 (-0.33z)| lr 5.76e-04 | 4183.13 ms | 32.3% bf16 MFU | 124578 tok/s step 7592/19560 | loss 3.498056 (+0.47z)| norm 0.2387 (-0.35z)| lr 5.76e-04 | 4171.53 ms | 32.4% bf16 MFU | 124634 tok/s step 7593/19560 | loss 3.455636 (-0.50z)| norm 0.2653 (-0.07z)| lr 5.76e-04 | 4175.69 ms | 32.3% bf16 MFU | 124680 tok/s step 7594/19560 | loss 3.435477 (-0.96z)| norm 0.2171 (-0.58z)| lr 5.76e-04 | 4186.89 ms | 32.2% bf16 MFU | 124707 tok/s step 7595/19560 | loss 3.415059 (-1.42z)| norm 0.2643 (-0.08z)| lr 5.76e-04 | 4172.43 ms | 32.4% bf16 MFU | 124754 tok/s step 7596/19560 | loss 3.452781 (-0.54z)| norm 0.2454 (-0.28z)| lr 5.76e-04 | 4163.19 ms | 32.4% bf16 MFU | 124813 tok/s step 7597/19560 | loss 3.502254 (+0.59z)| norm 0.2334 (-0.40z)| lr 5.76e-04 | 4170.78 ms | 32.4% bf16 MFU | 124858 tok/s step 7598/19560 | loss 3.485915 (+0.23z)| norm 0.2412 (-0.31z)| lr 5.75e-04 | 4172.92 ms | 32.4% bf16 MFU | 124897 tok/s step 7599/19560 | loss 3.468372 (-0.18z)| norm 0.2477 (-0.25z)| lr 5.75e-04 | 4167.77 ms | 32.4% bf16 MFU | 124942 tok/s step 7600/19560 | loss 3.458564 (-0.40z)| norm 0.2597 (-0.12z)| lr 5.75e-04 | 4181.62 ms | 32.3% bf16 MFU | 124964 tok/s step 7601/19560 | loss 3.416772 (-1.36z)| norm 0.2328 (-0.40z)| lr 5.75e-04 | 4206.46 ms | 32.1% bf16 MFU | 124948 tok/s step 7602/19560 | loss 3.456157 (-0.44z)| norm 0.2404 (-0.32z)| lr 5.75e-04 | 4185.69 ms | 32.3% bf16 MFU | 124963 tok/s step 7603/19560 | loss 3.486242 (+0.25z)| norm 0.2418 (-0.30z)| lr 5.75e-04 | 4178.24 ms | 32.3% bf16 MFU | 124989 tok/s step 7604/19560 | loss 3.547350 (+1.66z)| norm 0.2385 (-0.33z)| lr 5.75e-04 | 4167.22 ms | 32.4% bf16 MFU | 125030 tok/s step 7605/19560 | loss 3.493613 (+0.41z)| norm 0.2490 (-0.22z)| lr 5.75e-04 | 4202.24 ms | 32.1% bf16 MFU | 125017 tok/s step 7606/19560 | loss 3.395580 (-1.82z)| norm 0.2786 (+0.09z)| lr 5.75e-04 | 4180.10 ms | 32.3% bf16 MFU | 125037 tok/s step 7607/19560 | loss 3.466679 (-0.19z)| norm 0.2580 (-0.13z)| lr 5.75e-04 | 4210.51 ms | 32.1% bf16 MFU | 125011 tok/s step 7608/19560 | loss 3.480898 (+0.14z)| norm 0.2471 (-0.24z)| lr 5.75e-04 | 4172.16 ms | 32.4% bf16 MFU | 125044 tok/s step 7609/19560 | loss 3.518960 (+1.00z)| norm 0.2563 (-0.14z)| lr 5.75e-04 | 4182.23 ms | 32.3% bf16 MFU | 125060 tok/s step 7610/19560 | loss 3.468534 (-0.16z)| norm 0.2706 (+0.01z)| lr 5.75e-04 | 4171.15 ms | 32.4% bf16 MFU | 125091 tok/s step 7611/19560 | loss 3.501211 (+0.58z)| norm 0.2652 (-0.05z)| lr 5.75e-04 | 4172.63 ms | 32.4% bf16 MFU | 125119 tok/s step 7612/19560 | loss 3.455804 (-0.46z)| norm 0.2678 (-0.02z)| lr 5.75e-04 | 4186.64 ms | 32.2% bf16 MFU | 125125 tok/s step 7613/19560 | loss 3.496330 (+0.47z)| norm 0.2398 (-0.32z)| lr 5.75e-04 | 4189.82 ms | 32.2% bf16 MFU | 125125 tok/s step 7614/19560 | loss 3.450200 (-0.58z)| norm 0.2725 (+0.02z)| lr 5.75e-04 | 4163.76 ms | 32.4% bf16 MFU | 125165 tok/s step 7615/19560 | loss 3.449020 (-0.60z)| norm 0.2372 (-0.35z)| lr 5.74e-04 | 4180.07 ms | 32.3% bf16 MFU | 125178 tok/s step 7616/19560 | loss 3.458880 (-0.36z)| norm 0.2366 (-0.35z)| lr 5.74e-04 | 4167.36 ms | 32.4% bf16 MFU | 125209 tok/s step 7617/19560 | loss 3.438504 (-0.84z)| norm 0.2376 (-0.34z)| lr 5.74e-04 | 4174.98 ms | 32.3% bf16 MFU | 125228 tok/s step 7618/19560 | loss 3.481809 (+0.18z)| norm 0.2462 (-0.25z)| lr 5.74e-04 | 4180.09 ms | 32.3% bf16 MFU | 125238 tok/s step 7619/19560 | loss 3.484216 (+0.23z)| norm 0.2632 (-0.07z)| lr 5.74e-04 | 4166.05 ms | 32.4% bf16 MFU | 125268 tok/s step 7620/19560 | loss 3.473365 (-0.02z)| norm 0.2614 (-0.08z)| lr 5.74e-04 | 4178.77 ms | 32.3% bf16 MFU | 125278 tok/s step 7621/19560 | loss 3.485882 (+0.26z)| norm 0.2434 (-0.27z)| lr 5.74e-04 | 4167.98 ms | 32.4% bf16 MFU | 125304 tok/s step 7622/19560 | loss 3.498853 (+0.56z)| norm 0.2739 (+0.05z)| lr 5.74e-04 | 4178.58 ms | 32.3% bf16 MFU | 125312 tok/s step 7623/19560 | loss 3.426886 (-1.16z)| norm 0.2891 (+0.21z)| lr 5.74e-04 | 4219.90 ms | 32.0% bf16 MFU | 125258 tok/s step 7624/19560 | loss 3.411638 (-1.53z)| norm 0.2438 (-0.27z)| lr 5.74e-04 | 4175.09 ms | 32.3% bf16 MFU | 125274 tok/s step 7625/19560 | loss 3.465354 (-0.24z)| norm 0.2623 (-0.08z)| lr 5.74e-04 | 4160.64 ms | 32.5% bf16 MFU | 125311 tok/s step 7626/19560 | loss 3.543529 (+1.62z)| norm 0.2447 (-0.26z)| lr 5.74e-04 | 4170.93 ms | 32.4% bf16 MFU | 125331 tok/s step 7627/19560 | loss 3.470874 (-0.11z)| norm 0.2556 (-0.15z)| lr 5.74e-04 | 4182.55 ms | 32.3% bf16 MFU | 125332 tok/s step 7628/19560 | loss 3.464760 (-0.26z)| norm 0.2547 (-0.16z)| lr 5.74e-04 | 4180.55 ms | 32.3% bf16 MFU | 125336 tok/s step 7629/19560 | loss 3.545566 (+1.81z)| norm 0.2434 (-0.27z)| lr 5.74e-04 | 4183.12 ms | 32.3% bf16 MFU | 125336 tok/s step 7630/19560 | loss 3.517678 (+1.12z)| norm 0.2600 (-0.09z)| lr 5.74e-04 | 4163.62 ms | 32.4% bf16 MFU | 125365 tok/s step 7631/19560 | loss 3.425271 (-1.27z)| norm 0.2683 (-0.01z)| lr 5.73e-04 | 4182.57 ms | 32.3% bf16 MFU | 125364 tok/s step 7632/19560 | loss 3.422533 (-1.33z)| norm 0.2413 (-0.28z)| lr 5.73e-04 | 4184.93 ms | 32.3% bf16 MFU | 125360 tok/s step 7633/19560 | loss 3.434968 (-1.00z)| norm 0.2591 (-0.08z)| lr 5.73e-04 | 4168.47 ms | 32.4% bf16 MFU | 125381 tok/s step 7634/19560 | loss 3.368572 (-2.64z)| norm 0.2601 (-0.07z)| lr 5.73e-04 | 4172.07 ms | 32.4% bf16 MFU | 125395 tok/s step 7635/19560 | loss 3.489875 (+0.46z)| norm 0.2832 (+0.19z)| lr 5.73e-04 | 4196.53 ms | 32.2% bf16 MFU | 125372 tok/s step 7636/19560 | loss 3.450644 (-0.54z)| norm 0.2563 (-0.11z)| lr 5.73e-04 | 4181.88 ms | 32.3% bf16 MFU | 125372 tok/s step 7637/19560 | loss 3.493535 (+0.56z)| norm 0.2530 (-0.14z)| lr 5.73e-04 | 4261.98 ms | 31.7% bf16 MFU | 125254 tok/s step 7638/19560 | loss 3.433832 (-0.97z)| norm 0.2549 (-0.12z)| lr 5.73e-04 | 4167.10 ms | 32.4% bf16 MFU | 125282 tok/s step 7639/19560 | loss 3.505775 (+0.87z)| norm 0.2569 (-0.10z)| lr 5.73e-04 | 4177.81 ms | 32.3% bf16 MFU | 125293 tok/s step 7640/19560 | loss 3.479228 (+0.18z)| norm 0.2640 (-0.02z)| lr 5.73e-04 | 4182.08 ms | 32.3% bf16 MFU | 125296 tok/s step 7641/19560 | loss 3.413723 (-1.49z)| norm 0.2491 (-0.18z)| lr 5.73e-04 | 4187.74 ms | 32.2% bf16 MFU | 125291 tok/s step 7642/19560 | loss 3.496537 (+0.65z)| norm 0.2369 (-0.32z)| lr 5.73e-04 | 4174.68 ms | 32.3% bf16 MFU | 125306 tok/s step 7643/19560 | loss 3.438768 (-0.84z)| norm 0.2498 (-0.18z)| lr 5.73e-04 | 4168.70 ms | 32.4% bf16 MFU | 125329 tok/s step 7644/19560 | loss 3.457300 (-0.35z)| norm 0.2419 (-0.26z)| lr 5.73e-04 | 4179.07 ms | 32.3% bf16 MFU | 125335 tok/s step 7645/19560 | loss 3.364365 (-2.67z)| norm 0.2487 (-0.19z)| lr 5.73e-04 | 4205.06 ms | 32.1% bf16 MFU | 125303 tok/s step 7646/19560 | loss 3.439729 (-0.76z)| norm 0.2483 (-0.19z)| lr 5.73e-04 | 4182.26 ms | 32.3% bf16 MFU | 125306 tok/s step 7647/19560 | loss 3.379075 (-2.34z)| norm 0.2395 (-0.29z)| lr 5.72e-04 | 4172.29 ms | 32.4% bf16 MFU | 125323 tok/s step 7648/19560 | loss 3.493816 (+0.65z)| norm 0.2492 (-0.18z)| lr 5.72e-04 | 4184.54 ms | 32.3% bf16 MFU | 125322 tok/s step 7649/19560 | loss 3.455663 (-0.34z)| norm 0.2429 (-0.25z)| lr 5.72e-04 | 4177.44 ms | 32.3% bf16 MFU | 125331 tok/s step 7650/19560 | loss 3.434259 (-0.90z)| norm 0.2770 (+0.12z)| lr 5.72e-04 | 4165.50 ms | 32.4% bf16 MFU | 125358 tok/s step 7651/19560 | loss 3.459198 (-0.25z)| norm 0.2438 (-0.24z)| lr 5.72e-04 | 4164.29 ms | 32.4% bf16 MFU | 125385 tok/s step 7652/19560 | loss 3.393577 (-1.93z)| norm 0.2470 (-0.20z)| lr 5.72e-04 | 4180.29 ms | 32.3% bf16 MFU | 125386 tok/s step 7653/19560 | loss 3.564150 (+2.44z)| norm 0.2381 (-0.30z)| lr 5.72e-04 | 4172.14 ms | 32.4% bf16 MFU | 125400 tok/s step 7654/19560 | loss 3.514402 (+1.16z)| norm 0.2730 (+0.08z)| lr 5.72e-04 | 4163.36 ms | 32.4% bf16 MFU | 125427 tok/s step 7655/19560 | loss 3.405595 (-1.60z)| norm 0.2671 (+0.02z)| lr 5.72e-04 | 4179.70 ms | 32.3% bf16 MFU | 125427 tok/s step 7656/19560 | loss 3.445068 (-0.56z)| norm 0.2402 (-0.27z)| lr 5.72e-04 | 4165.89 ms | 32.4% bf16 MFU | 125448 tok/s step 7657/19560 | loss 3.535291 (+1.77z)| norm 0.2784 (+0.15z)| lr 5.72e-04 | 4168.82 ms | 32.4% bf16 MFU | 125464 tok/s step 7658/19560 | loss 3.445742 (-0.55z)| norm 0.2889 (+0.26z)| lr 5.72e-04 | 4179.54 ms | 32.3% bf16 MFU | 125463 tok/s step 7659/19560 | loss 3.448611 (-0.47z)| norm 0.2262 (-0.42z)| lr 5.72e-04 | 4174.21 ms | 32.3% bf16 MFU | 125470 tok/s step 7660/19560 | loss 3.538626 (+1.84z)| norm 0.2694 (+0.05z)| lr 5.72e-04 | 4177.98 ms | 32.3% bf16 MFU | 125471 tok/s step 7661/19560 | loss 3.439815 (-0.68z)| norm 0.2516 (-0.14z)| lr 5.72e-04 | 4182.21 ms | 32.3% bf16 MFU | 125466 tok/s step 7662/19560 | loss 3.490738 (+0.61z)| norm 0.2445 (-0.22z)| lr 5.72e-04 | 4178.38 ms | 32.3% bf16 MFU | 125466 tok/s step 7663/19560 | loss 3.537599 (+1.78z)| norm 0.2750 (+0.11z)| lr 5.72e-04 | 4158.36 ms | 32.5% bf16 MFU | 125497 tok/s step 7664/19560 | loss 3.436857 (-0.78z)| norm 0.2520 (-0.14z)| lr 5.71e-04 | 4174.60 ms | 32.3% bf16 MFU | 125501 tok/s step 7665/19560 | loss 3.453230 (-0.36z)| norm 0.3023 (+0.41z)| lr 5.71e-04 | 4172.17 ms | 32.4% bf16 MFU | 125510 tok/s step 7666/19560 | loss 3.455265 (-0.30z)| norm 0.2836 (+0.20z)| lr 5.71e-04 | 4185.14 ms | 32.3% bf16 MFU | 125498 tok/s step 7667/19560 | loss 3.387522 (-1.97z)| norm 0.2519 (-0.14z)| lr 5.71e-04 | 4186.18 ms | 32.3% bf16 MFU | 125485 tok/s step 7668/19560 | loss 3.407096 (-1.50z)| norm 0.2737 (+0.09z)| lr 5.71e-04 | 4163.02 ms | 32.4% bf16 MFU | 125508 tok/s step 7669/19560 | loss 3.439666 (-0.65z)| norm 0.2748 (+0.10z)| lr 5.71e-04 | 4178.74 ms | 32.3% bf16 MFU | 125506 tok/s step 7670/19560 | loss 3.439152 (-0.66z)| norm 0.2606 (+0.17z)| lr 5.71e-04 | 4191.44 ms | 32.2% bf16 MFU | 125485 tok/s step 7671/19560 | loss 3.495195 (+0.84z)| norm 0.2339 (-1.40z)| lr 5.71e-04 | 4170.97 ms | 32.4% bf16 MFU | 125495 tok/s step 7672/19560 | loss 3.482293 (+0.48z)| norm 0.2741 (+1.07z)| lr 5.71e-04 | 4185.19 ms | 32.3% bf16 MFU | 125484 tok/s step 7673/19560 | loss 3.450061 (-0.38z)| norm 0.2510 (-0.34z)| lr 5.71e-04 | 4174.30 ms | 32.3% bf16 MFU | 125490 tok/s step 7674/19560 | loss 3.414836 (-1.32z)| norm 0.2673 (+0.74z)| lr 5.71e-04 | 4186.37 ms | 32.3% bf16 MFU | 125477 tok/s step 7675/19560 | loss 3.389034 (-1.96z)| norm 0.2457 (-0.66z)| lr 5.71e-04 | 4181.12 ms | 32.3% bf16 MFU | 125473 tok/s step 7676/19560 | loss 3.443724 (-0.51z)| norm 0.2470 (-0.57z)| lr 5.71e-04 | 4178.74 ms | 32.3% bf16 MFU | 125473 tok/s step 7677/19560 | loss 3.430757 (-0.84z)| norm 0.2334 (-1.44z)| lr 5.71e-04 | 4172.01 ms | 32.4% bf16 MFU | 125482 tok/s step 7678/19560 | loss 3.441489 (-0.55z)| norm 0.2306 (-1.61z)| lr 5.71e-04 | 4175.59 ms | 32.3% bf16 MFU | 125486 tok/s step 7679/19560 | loss 3.451924 (-0.26z)| norm 0.2529 (-0.13z)| lr 5.71e-04 | 4169.32 ms | 32.4% bf16 MFU | 125499 tok/s step 7680/19560 | loss 3.547835 (+2.25z)| norm 0.2378 (-1.11z)| lr 5.70e-04 | 4178.31 ms | 32.3% bf16 MFU | 125498 tok/s step 7681/19560 | loss 3.457036 (-0.13z)| norm 0.2496 (-0.33z)| lr 5.70e-04 | 4189.63 ms | 32.2% bf16 MFU | 125480 tok/s step 7682/19560 | loss 3.476939 (+0.38z)| norm 0.2416 (-0.85z)| lr 5.70e-04 | 4169.72 ms | 32.4% bf16 MFU | 125493 tok/s step 7683/19560 | loss 3.453149 (-0.24z)| norm 0.2505 (-0.25z)| lr 5.70e-04 | 4182.87 ms | 32.3% bf16 MFU | 125486 tok/s step 7684/19560 | loss 3.506522 (+1.15z)| norm 0.2689 (+0.95z)| lr 5.70e-04 | 4157.95 ms | 32.5% bf16 MFU | 125516 tok/s step 7685/19560 | loss 3.440714 (-0.57z)| norm 0.2584 (+0.26z)| lr 5.70e-04 | 4185.01 ms | 32.3% bf16 MFU | 125504 tok/s step 7686/19560 | loss 3.511289 (+1.25z)| norm 0.2655 (+0.72z)| lr 5.70e-04 | 4176.66 ms | 32.3% bf16 MFU | 125505 tok/s step 7687/19560 | loss 3.478274 (+0.39z)| norm 0.2422 (-0.82z)| lr 5.70e-04 | 4171.60 ms | 32.4% bf16 MFU | 125514 tok/s step 7688/19560 | loss 3.436919 (-0.68z)| norm 0.2704 (+1.05z)| lr 5.70e-04 | 4185.16 ms | 32.3% bf16 MFU | 125502 tok/s step 7689/19560 | loss 3.470762 (+0.20z)| norm 0.2741 (+1.28z)| lr 5.70e-04 | 4182.20 ms | 32.3% bf16 MFU | 125495 tok/s step 7690/19560 | loss 3.468886 (+0.15z)| norm 0.2596 (+0.32z)| lr 5.70e-04 | 4168.90 ms | 32.4% bf16 MFU | 125508 tok/s step 7691/19560 | loss 3.510933 (+1.24z)| norm 0.2483 (-0.42z)| lr 5.70e-04 | 4214.87 ms | 32.0% bf16 MFU | 125452 tok/s step 7692/19560 | loss 3.471529 (+0.21z)| norm 0.2661 (+0.75z)| lr 5.70e-04 | 4180.54 ms | 32.3% bf16 MFU | 125450 tok/s step 7693/19560 | loss 3.432083 (-0.81z)| norm 0.2831 (+1.83z)| lr 5.70e-04 | 4178.96 ms | 32.3% bf16 MFU | 125451 tok/s step 7694/19560 | loss 3.448548 (-0.38z)| norm 0.2338 (-1.35z)| lr 5.70e-04 | 4183.09 ms | 32.3% bf16 MFU | 125445 tok/s step 7695/19560 | loss 3.498889 (+0.93z)| norm 0.2866 (+2.03z)| lr 5.70e-04 | 4172.96 ms | 32.4% bf16 MFU | 125455 tok/s step 7696/19560 | loss 3.491704 (+0.73z)| norm 0.2669 (+0.78z)| lr 5.69e-04 | 4177.15 ms | 32.3% bf16 MFU | 125458 tok/s step 7697/19560 | loss 3.495440 (+0.83z)| norm 0.2803 (+1.61z)| lr 5.69e-04 | 4162.80 ms | 32.4% bf16 MFU | 125482 tok/s step 7698/19560 | loss 3.448849 (-0.37z)| norm 0.2266 (-1.77z)| lr 5.69e-04 | 4179.97 ms | 32.3% bf16 MFU | 125479 tok/s step 7699/19560 | loss 3.500092 (+0.94z)| norm 0.2907 (+2.22z)| lr 5.69e-04 | 4179.17 ms | 32.3% bf16 MFU | 125478 tok/s step 7700/19560 | loss 3.475714 (+0.32z)| norm 0.2452 (-0.59z)| lr 5.69e-04 | 4227.05 ms | 31.9% bf16 MFU | 125406 tok/s step 7701/19560 | loss 3.579977 (+2.92z)| norm 0.2845 (+1.84z)| lr 5.69e-04 | 4166.77 ms | 32.4% bf16 MFU | 125427 tok/s step 7702/19560 | loss 3.493340 (+0.73z)| norm 0.3144 (+3.55z)| lr 5.69e-04 | 4167.57 ms | 32.4% bf16 MFU | 125446 tok/s step 7703/19560 | loss 3.399269 (-1.61z)| norm 0.2546 (-0.03z)| lr 5.69e-04 | 4329.34 ms | 31.2% bf16 MFU | 125228 tok/s step 7704/19560 | loss 3.523740 (+1.46z)| norm 0.2777 (+1.34z)| lr 5.69e-04 | 4176.50 ms | 32.3% bf16 MFU | 125244 tok/s step 7705/19560 | loss 3.491987 (+0.67z)| norm 0.2577 (+0.15z)| lr 5.69e-04 | 4163.47 ms | 32.4% bf16 MFU | 125278 tok/s step 7706/19560 | loss 3.470814 (+0.14z)| norm 0.2669 (+0.69z)| lr 5.69e-04 | 4169.42 ms | 32.4% bf16 MFU | 125301 tok/s step 7707/19560 | loss 3.449822 (-0.38z)| norm 0.2568 (+0.09z)| lr 5.69e-04 | 4172.03 ms | 32.4% bf16 MFU | 125319 tok/s step 7708/19560 | loss 3.485295 (+0.50z)| norm 0.2558 (+0.02z)| lr 5.69e-04 | 4220.83 ms | 32.0% bf16 MFU | 125264 tok/s step 7709/19560 | loss 3.483105 (+0.46z)| norm 0.2725 (+1.00z)| lr 5.69e-04 | 4161.67 ms | 32.4% bf16 MFU | 125300 tok/s step 7710/19560 | loss 3.482681 (+0.44z)| norm 0.2497 (-0.35z)| lr 5.69e-04 | 4157.97 ms | 32.5% bf16 MFU | 125340 tok/s step 7711/19560 | loss 3.447804 (-0.42z)| norm 0.2544 (-0.07z)| lr 5.69e-04 | 4171.12 ms | 32.4% bf16 MFU | 125357 tok/s step 7712/19560 | loss 3.484761 (+0.50z)| norm 0.2508 (-0.28z)| lr 5.69e-04 | 4170.46 ms | 32.4% bf16 MFU | 125375 tok/s step 7713/19560 | loss 3.474273 (+0.23z)| norm 0.2676 (+0.70z)| lr 5.68e-04 | 4167.34 ms | 32.4% bf16 MFU | 125397 tok/s step 7714/19560 | loss 3.529621 (+1.59z)| norm 0.2531 (-0.17z)| lr 5.68e-04 | 4184.74 ms | 32.3% bf16 MFU | 125391 tok/s step 7715/19560 | loss 3.400047 (-1.62z)| norm 0.2440 (-0.72z)| lr 5.68e-04 | 4173.04 ms | 32.4% bf16 MFU | 125404 tok/s step 7716/19560 | loss 3.483450 (+0.44z)| norm 0.2470 (-0.54z)| lr 5.68e-04 | 4174.67 ms | 32.3% bf16 MFU | 125413 tok/s step 7717/19560 | loss 3.426320 (-0.96z)| norm 0.2632 (+0.43z)| lr 5.68e-04 | 4174.66 ms | 32.3% bf16 MFU | 125422 tok/s step 7718/19560 | loss 3.472588 (+0.17z)| norm 0.2262 (-1.78z)| lr 5.68e-04 | 4165.10 ms | 32.4% bf16 MFU | 125444 tok/s step 7719/19560 | loss 3.463817 (-0.04z)| norm 0.2463 (-0.58z)| lr 5.68e-04 | 4187.72 ms | 32.2% bf16 MFU | 125432 tok/s step 7720/19560 | loss 3.435906 (-0.72z)| norm 0.2433 (-0.76z)| lr 5.68e-04 | 4174.86 ms | 32.3% bf16 MFU | 125439 tok/s step 7721/19560 | loss 3.429347 (-0.87z)| norm 0.2437 (-0.73z)| lr 5.68e-04 | 4167.80 ms | 32.4% bf16 MFU | 125457 tok/s step 7722/19560 | loss 3.455361 (-0.24z)| norm 0.2483 (-0.48z)| lr 5.68e-04 | 4168.97 ms | 32.4% bf16 MFU | 125472 tok/s step 7723/19560 | loss 3.466883 (+0.03z)| norm 0.2404 (-0.95z)| lr 5.68e-04 | 4159.99 ms | 32.5% bf16 MFU | 125500 tok/s step 7724/19560 | loss 3.451695 (-0.34z)| norm 0.2418 (-0.86z)| lr 5.68e-04 | 4185.36 ms | 32.3% bf16 MFU | 125489 tok/s step 7725/19560 | loss 3.435325 (-0.73z)| norm 0.2404 (-0.95z)| lr 5.68e-04 | 4170.04 ms | 32.4% bf16 MFU | 125501 tok/s step 7726/19560 | loss 3.426931 (-0.93z)| norm 0.2587 (+0.16z)| lr 5.68e-04 | 4168.28 ms | 32.4% bf16 MFU | 125515 tok/s step 7727/19560 | loss 3.429351 (-0.86z)| norm 0.2664 (+0.62z)| lr 5.68e-04 | 4160.50 ms | 32.5% bf16 MFU | 125540 tok/s step 7728/19560 | loss 3.420472 (-1.07z)| norm 0.2364 (-1.20z)| lr 5.68e-04 | 4175.97 ms | 32.3% bf16 MFU | 125540 tok/s step 7729/19560 | loss 3.453582 (-0.26z)| norm 0.2418 (-0.88z)| lr 5.67e-04 | 4178.28 ms | 32.3% bf16 MFU | 125537 tok/s step 7730/19560 | loss 3.434802 (-0.72z)| norm 0.2404 (-0.96z)| lr 5.67e-04 | 4203.52 ms | 32.1% bf16 MFU | 125496 tok/s step 7731/19560 | loss 3.460380 (-0.09z)| norm 0.2299 (-1.59z)| lr 5.67e-04 | 4162.19 ms | 32.4% bf16 MFU | 125520 tok/s step 7732/19560 | loss 3.452875 (-0.26z)| norm 0.2436 (-0.77z)| lr 5.67e-04 | 4164.54 ms | 32.4% bf16 MFU | 125539 tok/s step 7733/19560 | loss 3.437757 (-0.62z)| norm 0.2496 (-0.40z)| lr 5.67e-04 | 4169.84 ms | 32.4% bf16 MFU | 125548 tok/s step 7734/19560 | loss 3.476079 (+0.32z)| norm 0.2470 (-0.54z)| lr 5.67e-04 | 5030.39 ms | 26.8% bf16 MFU | 124482 tok/s step 7735/19560 | loss 3.484385 (+0.53z)| norm 0.2505 (-0.32z)| lr 5.67e-04 | 4605.87 ms | 29.3% bf16 MFU | 123949 tok/s step 7736/19560 | loss 3.405863 (-1.43z)| norm 0.2330 (-1.38z)| lr 5.67e-04 | 4826.68 ms | 28.0% bf16 MFU | 123183 tok/s step 7737/19560 | loss 3.461632 (-0.02z)| norm 0.2416 (-0.85z)| lr 5.67e-04 | 4317.08 ms | 31.3% bf16 MFU | 123096 tok/s step 7738/19560 | loss 3.364781 (-2.40z)| norm 0.2651 (+0.59z)| lr 5.67e-04 | 4253.09 ms | 31.7% bf16 MFU | 123105 tok/s step 7739/19560 | loss 3.425256 (-0.89z)| norm 0.2715 (+0.97z)| lr 5.67e-04 | 4401.19 ms | 30.7% bf16 MFU | 122906 tok/s step 7740/19560 | loss 3.431556 (-0.72z)| norm 0.2535 (-0.12z)| lr 5.67e-04 | 4233.11 ms | 31.9% bf16 MFU | 122953 tok/s step 7741/19560 | loss 3.444709 (-0.39z)| norm 0.2763 (+1.25z)| lr 5.67e-04 | 4320.66 ms | 31.2% bf16 MFU | 122873 tok/s step 7742/19560 | loss 3.451112 (-0.23z)| norm 0.2440 (-0.70z)| lr 5.67e-04 | 4356.25 ms | 31.0% bf16 MFU | 122747 tok/s step 7743/19560 | loss 3.503595 (+1.06z)| norm 0.2608 (+0.31z)| lr 5.67e-04 | 4163.56 ms | 32.4% bf16 MFU | 122906 tok/s step 7744/19560 | loss 3.499458 (+0.94z)| norm 0.2714 (+0.94z)| lr 5.67e-04 | 4163.97 ms | 32.4% bf16 MFU | 123056 tok/s step 7745/19560 | loss 3.456711 (-0.11z)| norm 0.2467 (-0.58z)| lr 5.66e-04 | 4170.54 ms | 32.4% bf16 MFU | 123189 tok/s step 7746/19560 | loss 3.478862 (+0.43z)| norm 0.2762 (+1.22z)| lr 5.66e-04 | 4178.93 ms | 32.3% bf16 MFU | 123302 tok/s step 7747/19560 | loss 3.388861 (-1.75z)| norm 0.2876 (+1.88z)| lr 5.66e-04 | 4163.64 ms | 32.4% bf16 MFU | 123433 tok/s step 7748/19560 | loss 3.473914 (+0.33z)| norm 0.2489 (-0.46z)| lr 5.66e-04 | 4260.21 ms | 31.7% bf16 MFU | 123415 tok/s step 7749/19560 | loss 3.471739 (+0.28z)| norm 0.2329 (-1.41z)| lr 5.66e-04 | 4284.12 ms | 31.5% bf16 MFU | 123363 tok/s step 7750/19560 | loss 3.465719 (+0.14z)| norm 0.2790 (+1.36z)| lr 5.66e-04 | 4176.08 ms | 32.3% bf16 MFU | 123472 tok/s val loss 3.440479 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2874/10042 = 0.286198 step 7751/19560 | loss 3.430940 (-0.72z)| norm 0.2560 (-0.01z)| lr 5.66e-04 | 4344.79 ms | 31.1% bf16 MFU | 123332 tok/s step 7752/19560 | loss 3.465163 (+0.11z)| norm 0.2760 (+1.19z)| lr 5.66e-04 | 4218.80 ms | 32.0% bf16 MFU | 123379 tok/s step 7753/19560 | loss 3.524225 (+1.54z)| norm 0.3081 (+3.01z)| lr 5.66e-04 | 4212.67 ms | 32.1% bf16 MFU | 123433 tok/s step 7754/19560 | loss 3.464309 (+0.10z)| norm 0.2786 (+1.26z)| lr 5.66e-04 | 4194.76 ms | 32.2% bf16 MFU | 123511 tok/s step 7755/19560 | loss 3.462926 (+0.06z)| norm 0.2749 (+1.03z)| lr 5.66e-04 | 4216.45 ms | 32.0% bf16 MFU | 123552 tok/s step 7756/19560 | loss 3.576380 (+2.77z)| norm 0.2706 (+0.77z)| lr 5.66e-04 | 4181.11 ms | 32.3% bf16 MFU | 123644 tok/s step 7757/19560 | loss 3.434791 (-0.62z)| norm 0.2562 (-0.06z)| lr 5.66e-04 | 4177.50 ms | 32.3% bf16 MFU | 123737 tok/s step 7758/19560 | loss 3.440320 (-0.48z)| norm 0.2760 (+1.07z)| lr 5.66e-04 | 4174.57 ms | 32.3% bf16 MFU | 123830 tok/s step 7759/19560 | loss 3.462150 (+0.05z)| norm 0.2401 (-0.99z)| lr 5.66e-04 | 4172.87 ms | 32.4% bf16 MFU | 123921 tok/s step 7760/19560 | loss 3.474829 (+0.35z)| norm 0.2616 (+0.24z)| lr 5.66e-04 | 4301.56 ms | 31.4% bf16 MFU | 123819 tok/s step 7761/19560 | loss 3.445096 (-0.38z)| norm 0.2424 (-0.85z)| lr 5.65e-04 | 4170.21 ms | 32.4% bf16 MFU | 123914 tok/s step 7762/19560 | loss 3.429246 (-0.80z)| norm 0.2327 (-1.39z)| lr 5.65e-04 | 4180.00 ms | 32.3% bf16 MFU | 123990 tok/s step 7763/19560 | loss 3.452874 (-0.20z)| norm 0.2463 (-0.60z)| lr 5.65e-04 | 4171.00 ms | 32.4% bf16 MFU | 124075 tok/s step 7764/19560 | loss 3.486324 (+0.64z)| norm 0.2420 (-0.84z)| lr 5.65e-04 | 4211.55 ms | 32.1% bf16 MFU | 124096 tok/s step 7765/19560 | loss 3.462401 (+0.04z)| norm 0.2445 (-0.69z)| lr 5.65e-04 | 4199.26 ms | 32.2% bf16 MFU | 124134 tok/s step 7766/19560 | loss 3.481952 (+0.53z)| norm 0.2752 (+1.05z)| lr 5.65e-04 | 4164.40 ms | 32.4% bf16 MFU | 124222 tok/s step 7767/19560 | loss 3.461596 (+0.02z)| norm 0.2602 (+0.19z)| lr 5.65e-04 | 4245.70 ms | 31.8% bf16 MFU | 124185 tok/s step 7768/19560 | loss 3.517915 (+1.43z)| norm 0.2813 (+1.38z)| lr 5.65e-04 | 4179.59 ms | 32.3% bf16 MFU | 124248 tok/s step 7769/19560 | loss 3.437826 (-0.60z)| norm 0.3198 (+3.37z)| lr 5.65e-04 | 4162.03 ms | 32.4% bf16 MFU | 124334 tok/s step 7770/19560 | loss 3.504521 (+1.09z)| norm 0.2587 (+0.06z)| lr 5.65e-04 | 4198.28 ms | 32.2% bf16 MFU | 124361 tok/s step 7771/19560 | loss 3.499479 (+0.95z)| norm 0.2594 (+0.09z)| lr 5.65e-04 | 4185.15 ms | 32.3% bf16 MFU | 124407 tok/s step 7772/19560 | loss 3.448617 (-0.33z)| norm 0.2532 (-0.25z)| lr 5.65e-04 | 4388.00 ms | 30.8% bf16 MFU | 124161 tok/s step 7773/19560 | loss 3.481713 (+0.49z)| norm 0.2552 (-0.14z)| lr 5.65e-04 | 4279.71 ms | 31.5% bf16 MFU | 124078 tok/s step 7774/19560 | loss 3.464457 (+0.04z)| norm 0.2676 (+0.53z)| lr 5.65e-04 | 4295.15 ms | 31.4% bf16 MFU | 123977 tok/s step 7775/19560 | loss 3.491678 (+0.73z)| norm 0.2610 (+0.16z)| lr 5.65e-04 | 4352.75 ms | 31.0% bf16 MFU | 123801 tok/s step 7776/19560 | loss 3.445339 (-0.48z)| norm 0.2638 (+0.30z)| lr 5.65e-04 | 4213.86 ms | 32.0% bf16 MFU | 123832 tok/s step 7777/19560 | loss 3.413522 (-1.30z)| norm 0.2442 (-0.78z)| lr 5.65e-04 | 4173.83 ms | 32.3% bf16 MFU | 123921 tok/s step 7778/19560 | loss 3.425410 (-0.98z)| norm 0.2623 (+0.23z)| lr 5.64e-04 | 4182.51 ms | 32.3% bf16 MFU | 123992 tok/s step 7779/19560 | loss 3.473725 (+0.28z)| norm 0.2321 (-1.43z)| lr 5.64e-04 | 4227.02 ms | 31.9% bf16 MFU | 123994 tok/s step 7780/19560 | loss 3.495289 (+0.83z)| norm 0.2540 (-0.23z)| lr 5.64e-04 | 4187.25 ms | 32.2% bf16 MFU | 124055 tok/s step 7781/19560 | loss 3.482129 (+0.51z)| norm 0.2554 (-0.16z)| lr 5.64e-04 | 4179.90 ms | 32.3% bf16 MFU | 124124 tok/s step 7782/19560 | loss 3.436688 (-0.71z)| norm 0.2708 (+0.69z)| lr 5.64e-04 | 4240.35 ms | 31.8% bf16 MFU | 124100 tok/s step 7783/19560 | loss 3.486142 (+0.63z)| norm 0.2674 (+0.50z)| lr 5.64e-04 | 4164.31 ms | 32.4% bf16 MFU | 124190 tok/s step 7784/19560 | loss 3.441065 (-0.61z)| norm 0.2560 (-0.13z)| lr 5.64e-04 | 4172.80 ms | 32.4% bf16 MFU | 124263 tok/s step 7785/19560 | loss 3.440166 (-0.62z)| norm 0.2645 (+0.34z)| lr 5.64e-04 | 4186.28 ms | 32.3% bf16 MFU | 124311 tok/s step 7786/19560 | loss 3.452547 (-0.28z)| norm 0.2513 (-0.38z)| lr 5.64e-04 | 4169.74 ms | 32.4% bf16 MFU | 124383 tok/s step 7787/19560 | loss 3.445135 (-0.49z)| norm 0.2584 (+0.01z)| lr 5.64e-04 | 4176.41 ms | 32.3% bf16 MFU | 124440 tok/s step 7788/19560 | loss 3.461147 (-0.02z)| norm 0.2608 (+0.15z)| lr 5.64e-04 | 4187.11 ms | 32.2% bf16 MFU | 124479 tok/s step 7789/19560 | loss 3.431198 (-0.87z)| norm 0.2592 (+0.05z)| lr 5.64e-04 | 4191.56 ms | 32.2% bf16 MFU | 124509 tok/s step 7790/19560 | loss 3.421391 (-1.13z)| norm 0.2770 (+1.05z)| lr 5.64e-04 | 4184.71 ms | 32.3% bf16 MFU | 124548 tok/s step 7791/19560 | loss 3.524985 (+1.82z)| norm 0.2871 (+1.62z)| lr 5.64e-04 | 4203.19 ms | 32.1% bf16 MFU | 124557 tok/s step 7792/19560 | loss 3.434208 (-0.77z)| norm 0.2404 (-1.02z)| lr 5.64e-04 | 4168.83 ms | 32.4% bf16 MFU | 124618 tok/s step 7793/19560 | loss 3.477149 (+0.45z)| norm 0.2914 (+1.88z)| lr 5.64e-04 | 4220.85 ms | 32.0% bf16 MFU | 124598 tok/s step 7794/19560 | loss 3.510678 (+1.38z)| norm 0.2678 (+0.55z)| lr 5.63e-04 | 4179.19 ms | 32.3% bf16 MFU | 124640 tok/s step 7795/19560 | loss 3.427854 (-0.98z)| norm 0.2851 (+1.51z)| lr 5.63e-04 | 4178.97 ms | 32.3% bf16 MFU | 124681 tok/s step 7796/19560 | loss 3.453670 (-0.25z)| norm 0.2415 (-0.95z)| lr 5.63e-04 | 4208.93 ms | 32.1% bf16 MFU | 124675 tok/s step 7797/19560 | loss 3.449006 (-0.39z)| norm 0.2659 (+0.44z)| lr 5.63e-04 | 4183.06 ms | 32.3% bf16 MFU | 124708 tok/s step 7798/19560 | loss 3.440534 (-0.64z)| norm 0.2666 (+0.47z)| lr 5.63e-04 | 4176.25 ms | 32.3% bf16 MFU | 124750 tok/s step 7799/19560 | loss 3.505023 (+1.22z)| norm 0.2788 (+1.15z)| lr 5.63e-04 | 4203.17 ms | 32.1% bf16 MFU | 124749 tok/s step 7800/19560 | loss 3.425585 (-1.05z)| norm 0.2531 (-0.31z)| lr 5.63e-04 | 4163.18 ms | 32.4% bf16 MFU | 124809 tok/s step 7801/19560 | loss 3.470297 (+0.23z)| norm 0.2758 (+0.98z)| lr 5.63e-04 | 4173.57 ms | 32.4% bf16 MFU | 124849 tok/s step 7802/19560 | loss 3.464230 (+0.04z)| norm 0.3041 (+2.52z)| lr 5.63e-04 | 4244.62 ms | 31.8% bf16 MFU | 124783 tok/s step 7803/19560 | loss 3.450187 (-0.39z)| norm 0.2603 (+0.07z)| lr 5.63e-04 | 4173.88 ms | 32.3% bf16 MFU | 124824 tok/s step 7804/19560 | loss 3.449925 (-0.39z)| norm 0.2699 (+0.60z)| lr 5.63e-04 | 4177.57 ms | 32.3% bf16 MFU | 124858 tok/s step 7805/19560 | loss 3.490078 (+0.78z)| norm 0.3005 (+2.25z)| lr 5.63e-04 | 4186.48 ms | 32.3% bf16 MFU | 124877 tok/s step 7806/19560 | loss 3.465050 (+0.03z)| norm 0.2881 (+1.54z)| lr 5.63e-04 | 4199.24 ms | 32.2% bf16 MFU | 124876 tok/s step 7807/19560 | loss 3.452748 (-0.33z)| norm 0.3160 (+2.95z)| lr 5.63e-04 | 4230.72 ms | 31.9% bf16 MFU | 124828 tok/s step 7808/19560 | loss 3.482203 (+0.57z)| norm 0.2794 (+0.98z)| lr 5.63e-04 | 4175.96 ms | 32.3% bf16 MFU | 124864 tok/s step 7809/19560 | loss 3.479413 (+0.48z)| norm 0.3057 (+2.32z)| lr 5.63e-04 | 4180.66 ms | 32.3% bf16 MFU | 124891 tok/s step 7810/19560 | loss 3.488590 (+0.75z)| norm 0.2480 (-0.71z)| lr 5.62e-04 | 4177.89 ms | 32.3% bf16 MFU | 124921 tok/s step 7811/19560 | loss 3.431179 (-0.98z)| norm 0.3039 (+2.17z)| lr 5.62e-04 | 4159.13 ms | 32.5% bf16 MFU | 124978 tok/s step 7812/19560 | loss 3.481372 (+0.55z)| norm 0.2439 (-0.92z)| lr 5.62e-04 | 4179.38 ms | 32.3% bf16 MFU | 125001 tok/s step 7813/19560 | loss 3.465618 (+0.06z)| norm 0.3121 (+2.51z)| lr 5.62e-04 | 4184.25 ms | 32.3% bf16 MFU | 125016 tok/s step 7814/19560 | loss 3.472576 (+0.29z)| norm 0.2665 (+0.22z)| lr 5.62e-04 | 4178.76 ms | 32.3% bf16 MFU | 125039 tok/s step 7815/19560 | loss 3.491072 (+0.85z)| norm 0.2546 (-0.38z)| lr 5.62e-04 | 4181.67 ms | 32.3% bf16 MFU | 125056 tok/s step 7816/19560 | loss 3.457043 (-0.20z)| norm 0.2576 (-0.23z)| lr 5.62e-04 | 4170.53 ms | 32.4% bf16 MFU | 125089 tok/s step 7817/19560 | loss 3.452980 (-0.32z)| norm 0.2688 (+0.34z)| lr 5.62e-04 | 4170.42 ms | 32.4% bf16 MFU | 125120 tok/s step 7818/19560 | loss 3.413198 (-1.51z)| norm 0.2412 (-1.04z)| lr 5.62e-04 | 4181.29 ms | 32.3% bf16 MFU | 125133 tok/s step 7819/19560 | loss 3.511703 (+1.48z)| norm 0.2697 (+0.38z)| lr 5.62e-04 | 4182.38 ms | 32.3% bf16 MFU | 125145 tok/s step 7820/19560 | loss 3.493353 (+0.92z)| norm 0.2724 (+0.51z)| lr 5.62e-04 | 4170.84 ms | 32.4% bf16 MFU | 125173 tok/s step 7821/19560 | loss 3.428074 (-1.06z)| norm 0.2630 (+0.05z)| lr 5.62e-04 | 4169.80 ms | 32.4% bf16 MFU | 125201 tok/s step 7822/19560 | loss 3.460248 (-0.09z)| norm 0.2517 (-0.53z)| lr 5.62e-04 | 4179.45 ms | 32.3% bf16 MFU | 125213 tok/s step 7823/19560 | loss 3.421138 (-1.25z)| norm 0.2679 (+0.30z)| lr 5.62e-04 | 4176.41 ms | 32.3% bf16 MFU | 125229 tok/s step 7824/19560 | loss 3.470330 (+0.24z)| norm 0.2597 (-0.12z)| lr 5.62e-04 | 4176.34 ms | 32.3% bf16 MFU | 125244 tok/s step 7825/19560 | loss 3.499927 (+1.14z)| norm 0.2661 (+0.22z)| lr 5.62e-04 | 4168.68 ms | 32.4% bf16 MFU | 125271 tok/s step 7826/19560 | loss 3.437074 (-0.76z)| norm 0.2418 (-1.04z)| lr 5.61e-04 | 4175.88 ms | 32.3% bf16 MFU | 125285 tok/s step 7827/19560 | loss 3.423741 (-1.15z)| norm 0.2370 (-1.27z)| lr 5.61e-04 | 4169.06 ms | 32.4% bf16 MFU | 125308 tok/s step 7828/19560 | loss 3.441180 (-0.61z)| norm 0.2417 (-1.02z)| lr 5.61e-04 | 4203.51 ms | 32.1% bf16 MFU | 125279 tok/s step 7829/19560 | loss 3.470352 (+0.31z)| norm 0.2516 (-0.50z)| lr 5.61e-04 | 4237.27 ms | 31.9% bf16 MFU | 125202 tok/s step 7830/19560 | loss 3.452223 (-0.26z)| norm 0.2188 (-2.19z)| lr 5.61e-04 | 4179.06 ms | 32.3% bf16 MFU | 125214 tok/s step 7831/19560 | loss 3.397251 (-2.01z)| norm 0.2296 (-1.60z)| lr 5.61e-04 | 4172.25 ms | 32.4% bf16 MFU | 125237 tok/s step 7832/19560 | loss 3.550015 (+2.81z)| norm 0.2779 (+0.91z)| lr 5.61e-04 | 4173.62 ms | 32.4% bf16 MFU | 125256 tok/s step 7833/19560 | loss 3.445495 (-0.46z)| norm 0.2622 (+0.09z)| lr 5.61e-04 | 4183.70 ms | 32.3% bf16 MFU | 125259 tok/s step 7834/19560 | loss 3.431850 (-0.88z)| norm 0.2384 (-1.12z)| lr 5.61e-04 | 4166.47 ms | 32.4% bf16 MFU | 125288 tok/s step 7835/19560 | loss 3.541111 (+2.47z)| norm 0.2624 (+0.12z)| lr 5.61e-04 | 4164.77 ms | 32.4% bf16 MFU | 125318 tok/s step 7836/19560 | loss 3.487034 (+0.81z)| norm 0.2471 (-0.67z)| lr 5.61e-04 | 4233.74 ms | 31.9% bf16 MFU | 125244 tok/s step 7837/19560 | loss 3.514658 (+1.64z)| norm 0.2618 (+0.09z)| lr 5.61e-04 | 4176.02 ms | 32.3% bf16 MFU | 125259 tok/s step 7838/19560 | loss 3.446863 (-0.41z)| norm 0.2660 (+0.30z)| lr 5.61e-04 | 4177.25 ms | 32.3% bf16 MFU | 125271 tok/s step 7839/19560 | loss 3.481752 (+0.64z)| norm 0.2614 (+0.06z)| lr 5.61e-04 | 4164.52 ms | 32.4% bf16 MFU | 125303 tok/s step 7840/19560 | loss 3.438127 (-0.68z)| norm 0.2555 (-0.25z)| lr 5.61e-04 | 4185.19 ms | 32.3% bf16 MFU | 125301 tok/s step 7841/19560 | loss 3.440844 (-0.59z)| norm 0.2546 (-0.29z)| lr 5.61e-04 | 4240.29 ms | 31.8% bf16 MFU | 125218 tok/s step 7842/19560 | loss 3.442818 (-0.51z)| norm 0.2665 (+0.32z)| lr 5.60e-04 | 4223.31 ms | 32.0% bf16 MFU | 125164 tok/s step 7843/19560 | loss 3.553384 (+2.81z)| norm 0.2628 (+0.12z)| lr 5.60e-04 | 4163.49 ms | 32.4% bf16 MFU | 125202 tok/s step 7844/19560 | loss 3.443284 (-0.52z)| norm 0.2549 (-0.29z)| lr 5.60e-04 | 4167.41 ms | 32.4% bf16 MFU | 125233 tok/s step 7845/19560 | loss 3.441299 (-0.58z)| norm 0.2750 (+0.75z)| lr 5.60e-04 | 4184.46 ms | 32.3% bf16 MFU | 125236 tok/s step 7846/19560 | loss 3.495784 (+1.07z)| norm 0.2696 (+0.46z)| lr 5.60e-04 | 4180.60 ms | 32.3% bf16 MFU | 125244 tok/s step 7847/19560 | loss 3.468357 (+0.23z)| norm 0.3297 (+3.43z)| lr 5.60e-04 | 4173.52 ms | 32.4% bf16 MFU | 125263 tok/s step 7848/19560 | loss 3.436460 (-0.73z)| norm 0.2805 (+0.94z)| lr 5.60e-04 | 4176.15 ms | 32.3% bf16 MFU | 125277 tok/s step 7849/19560 | loss 3.453206 (-0.23z)| norm 0.2494 (-0.63z)| lr 5.60e-04 | 4172.99 ms | 32.4% bf16 MFU | 125295 tok/s step 7850/19560 | loss 3.467060 (+0.19z)| norm 0.2682 (+0.31z)| lr 5.60e-04 | 4172.95 ms | 32.4% bf16 MFU | 125313 tok/s step 7851/19560 | loss 3.430429 (-0.92z)| norm 0.2630 (+0.04z)| lr 5.60e-04 | 4168.55 ms | 32.4% bf16 MFU | 125336 tok/s step 7852/19560 | loss 3.446268 (-0.43z)| norm 0.2746 (+0.62z)| lr 5.60e-04 | 4167.19 ms | 32.4% bf16 MFU | 125359 tok/s step 7853/19560 | loss 3.447749 (-0.39z)| norm 0.2597 (-0.15z)| lr 5.60e-04 | 4188.74 ms | 32.2% bf16 MFU | 125350 tok/s step 7854/19560 | loss 3.579091 (+3.41z)| norm 0.2660 (+0.17z)| lr 5.60e-04 | 4174.98 ms | 32.3% bf16 MFU | 125361 tok/s step 7855/19560 | loss 3.431558 (-0.88z)| norm 0.2430 (-0.99z)| lr 5.60e-04 | 4179.08 ms | 32.3% bf16 MFU | 125366 tok/s step 7856/19560 | loss 3.461738 (-0.01z)| norm 0.2856 (+1.16z)| lr 5.60e-04 | 4181.28 ms | 32.3% bf16 MFU | 125367 tok/s step 7857/19560 | loss 3.458041 (-0.12z)| norm 0.2443 (-0.95z)| lr 5.60e-04 | 4175.39 ms | 32.3% bf16 MFU | 125377 tok/s step 7858/19560 | loss 3.474687 (+0.36z)| norm 0.2710 (+0.40z)| lr 5.59e-04 | 4182.44 ms | 32.3% bf16 MFU | 125376 tok/s step 7859/19560 | loss 3.464548 (+0.06z)| norm 0.2619 (-0.08z)| lr 5.59e-04 | 4181.79 ms | 32.3% bf16 MFU | 125376 tok/s step 7860/19560 | loss 3.429882 (-0.95z)| norm 0.2640 (+0.02z)| lr 5.59e-04 | 4169.79 ms | 32.4% bf16 MFU | 125394 tok/s step 7861/19560 | loss 3.409961 (-1.52z)| norm 0.2308 (-1.68z)| lr 5.59e-04 | 4176.08 ms | 32.3% bf16 MFU | 125401 tok/s step 7862/19560 | loss 3.445962 (-0.46z)| norm 0.2574 (-0.31z)| lr 5.59e-04 | 4169.67 ms | 32.4% bf16 MFU | 125418 tok/s step 7863/19560 | loss 3.479548 (+0.51z)| norm 0.2380 (-1.30z)| lr 5.59e-04 | 4190.02 ms | 32.2% bf16 MFU | 125404 tok/s step 7864/19560 | loss 3.525004 (+1.80z)| norm 0.2276 (-1.83z)| lr 5.59e-04 | 4167.23 ms | 32.4% bf16 MFU | 125424 tok/s step 7865/19560 | loss 3.436341 (-0.76z)| norm 0.2471 (-0.84z)| lr 5.59e-04 | 4179.87 ms | 32.3% bf16 MFU | 125424 tok/s step 7866/19560 | loss 3.413608 (-1.46z)| norm 0.2407 (-1.15z)| lr 5.59e-04 | 4211.42 ms | 32.1% bf16 MFU | 125378 tok/s step 7867/19560 | loss 3.452012 (-0.33z)| norm 0.2201 (-2.14z)| lr 5.59e-04 | 4180.89 ms | 32.3% bf16 MFU | 125379 tok/s step 7868/19560 | loss 3.401116 (-1.82z)| norm 0.2522 (-0.53z)| lr 5.59e-04 | 4177.87 ms | 32.3% bf16 MFU | 125385 tok/s step 7869/19560 | loss 3.422217 (-1.19z)| norm 0.2378 (-1.23z)| lr 5.59e-04 | 4277.43 ms | 31.6% bf16 MFU | 125244 tok/s step 7870/19560 | loss 3.421705 (-1.19z)| norm 0.2299 (-1.61z)| lr 5.59e-04 | 4170.16 ms | 32.4% bf16 MFU | 125268 tok/s step 7871/19560 | loss 3.419784 (-1.23z)| norm 0.2305 (-1.55z)| lr 5.59e-04 | 4166.00 ms | 32.4% bf16 MFU | 125297 tok/s step 7872/19560 | loss 3.428794 (-0.95z)| norm 0.2243 (-1.82z)| lr 5.59e-04 | 4186.47 ms | 32.3% bf16 MFU | 125294 tok/s step 7873/19560 | loss 3.418101 (-1.25z)| norm 0.2411 (-1.00z)| lr 5.59e-04 | 4163.14 ms | 32.4% bf16 MFU | 125326 tok/s step 7874/19560 | loss 3.448223 (-0.37z)| norm 0.2337 (-1.33z)| lr 5.58e-04 | 4176.93 ms | 32.3% bf16 MFU | 125336 tok/s step 7875/19560 | loss 3.441733 (-0.58z)| norm 0.2551 (-0.29z)| lr 5.58e-04 | 4171.78 ms | 32.4% bf16 MFU | 125353 tok/s step 7876/19560 | loss 3.439506 (-0.63z)| norm 0.2583 (-0.14z)| lr 5.58e-04 | 4168.56 ms | 32.4% bf16 MFU | 125374 tok/s step 7877/19560 | loss 3.462845 (+0.06z)| norm 0.2457 (-0.76z)| lr 5.58e-04 | 4174.72 ms | 32.3% bf16 MFU | 125384 tok/s step 7878/19560 | loss 3.449191 (-0.34z)| norm 0.2414 (-0.96z)| lr 5.58e-04 | 4157.18 ms | 32.5% bf16 MFU | 125421 tok/s step 7879/19560 | loss 3.492781 (+0.93z)| norm 0.2474 (-0.66z)| lr 5.58e-04 | 4168.96 ms | 32.4% bf16 MFU | 125438 tok/s step 7880/19560 | loss 3.441807 (-0.57z)| norm 0.2296 (-1.50z)| lr 5.58e-04 | 4178.93 ms | 32.3% bf16 MFU | 125439 tok/s step 7881/19560 | loss 3.440189 (-0.60z)| norm 0.2391 (-1.03z)| lr 5.58e-04 | 4176.98 ms | 32.3% bf16 MFU | 125443 tok/s step 7882/19560 | loss 3.410350 (-1.47z)| norm 0.2550 (-0.24z)| lr 5.58e-04 | 4199.89 ms | 32.1% bf16 MFU | 125412 tok/s step 7883/19560 | loss 3.446489 (-0.40z)| norm 0.2524 (-0.36z)| lr 5.58e-04 | 4170.76 ms | 32.4% bf16 MFU | 125427 tok/s step 7884/19560 | loss 3.516396 (+1.74z)| norm 0.2557 (-0.19z)| lr 5.58e-04 | 4166.49 ms | 32.4% bf16 MFU | 125447 tok/s step 7885/19560 | loss 3.445359 (-0.43z)| norm 0.2578 (-0.09z)| lr 5.58e-04 | 4164.36 ms | 32.4% bf16 MFU | 125470 tok/s step 7886/19560 | loss 3.498729 (+1.18z)| norm 0.2444 (-0.74z)| lr 5.58e-04 | 4176.14 ms | 32.3% bf16 MFU | 125474 tok/s step 7887/19560 | loss 3.492116 (+0.97z)| norm 0.2679 (+0.42z)| lr 5.58e-04 | 4165.24 ms | 32.4% bf16 MFU | 125494 tok/s step 7888/19560 | loss 3.423242 (-1.11z)| norm 0.2867 (+1.33z)| lr 5.58e-04 | 4181.71 ms | 32.3% bf16 MFU | 125488 tok/s step 7889/19560 | loss 3.434100 (-0.78z)| norm 0.2569 (-0.14z)| lr 5.58e-04 | 4170.75 ms | 32.4% bf16 MFU | 125499 tok/s step 7890/19560 | loss 3.446115 (-0.42z)| norm 0.2779 (+0.88z)| lr 5.58e-04 | 4250.03 ms | 31.8% bf16 MFU | 125392 tok/s step 7891/19560 | loss 3.463958 (+0.12z)| norm 0.2994 (+1.90z)| lr 5.57e-04 | 4179.93 ms | 32.3% bf16 MFU | 125394 tok/s step 7892/19560 | loss 3.456557 (-0.10z)| norm 0.2933 (+1.57z)| lr 5.57e-04 | 4173.19 ms | 32.4% bf16 MFU | 125406 tok/s step 7893/19560 | loss 3.441138 (-0.56z)| norm 0.2820 (+1.01z)| lr 5.57e-04 | 4237.45 ms | 31.9% bf16 MFU | 125322 tok/s step 7894/19560 | loss 3.451322 (-0.24z)| norm 0.2518 (-0.45z)| lr 5.57e-04 | 4172.61 ms | 32.4% bf16 MFU | 125338 tok/s step 7895/19560 | loss 3.420599 (-1.16z)| norm 0.2822 (+1.01z)| lr 5.57e-04 | 4164.87 ms | 32.4% bf16 MFU | 125365 tok/s step 7896/19560 | loss 3.432946 (-0.78z)| norm 0.2558 (-0.26z)| lr 5.57e-04 | 4173.71 ms | 32.3% bf16 MFU | 125378 tok/s step 7897/19560 | loss 3.498859 (+1.22z)| norm 0.2482 (-0.62z)| lr 5.57e-04 | 4242.72 ms | 31.8% bf16 MFU | 125288 tok/s step 7898/19560 | loss 3.378303 (-2.38z)| norm 0.2798 (+0.96z)| lr 5.57e-04 | 4165.67 ms | 32.4% bf16 MFU | 125316 tok/s step 7899/19560 | loss 3.430836 (-0.80z)| norm 0.2501 (-0.53z)| lr 5.57e-04 | 4175.48 ms | 32.3% bf16 MFU | 125329 tok/s step 7900/19560 | loss 3.432155 (-0.75z)| norm 0.2378 (-1.13z)| lr 5.57e-04 | 4169.39 ms | 32.4% bf16 MFU | 125350 tok/s step 7901/19560 | loss 3.436806 (-0.60z)| norm 0.2367 (-1.17z)| lr 5.57e-04 | 4305.75 ms | 31.4% bf16 MFU | 125170 tok/s step 7902/19560 | loss 3.444992 (-0.35z)| norm 0.2494 (-0.54z)| lr 5.57e-04 | 4179.09 ms | 32.3% bf16 MFU | 125185 tok/s step 7903/19560 | loss 3.463849 (+0.22z)| norm 0.2262 (-1.65z)| lr 5.57e-04 | 4172.49 ms | 32.4% bf16 MFU | 125208 tok/s step 7904/19560 | loss 3.446961 (-0.29z)| norm 0.2316 (-1.37z)| lr 5.57e-04 | 4177.69 ms | 32.3% bf16 MFU | 125222 tok/s step 7905/19560 | loss 3.425469 (-0.94z)| norm 0.2779 (+0.87z)| lr 5.57e-04 | 4171.61 ms | 32.4% bf16 MFU | 125245 tok/s step 7906/19560 | loss 3.479734 (+0.69z)| norm 0.2347 (-1.21z)| lr 5.57e-04 | 4172.25 ms | 32.4% bf16 MFU | 125266 tok/s step 7907/19560 | loss 3.438222 (-0.56z)| norm 0.2530 (-0.34z)| lr 5.56e-04 | 4166.26 ms | 32.4% bf16 MFU | 125295 tok/s step 7908/19560 | loss 3.566829 (+3.20z)| norm 0.2588 (-0.06z)| lr 5.56e-04 | 4173.61 ms | 32.4% bf16 MFU | 125311 tok/s step 7909/19560 | loss 3.409023 (-1.38z)| norm 0.2468 (-0.63z)| lr 5.56e-04 | 4165.25 ms | 32.4% bf16 MFU | 125339 tok/s step 7910/19560 | loss 3.474150 (+0.50z)| norm 0.2562 (-0.17z)| lr 5.56e-04 | 4178.77 ms | 32.3% bf16 MFU | 125345 tok/s step 7911/19560 | loss 3.482447 (+0.74z)| norm 0.2332 (-1.27z)| lr 5.56e-04 | 4170.72 ms | 32.4% bf16 MFU | 125363 tok/s step 7912/19560 | loss 3.442885 (-0.41z)| norm 0.2832 (+1.13z)| lr 5.56e-04 | 4172.08 ms | 32.4% bf16 MFU | 125379 tok/s step 7913/19560 | loss 3.443935 (-0.38z)| norm 0.2722 (+0.60z)| lr 5.56e-04 | 4168.81 ms | 32.4% bf16 MFU | 125398 tok/s step 7914/19560 | loss 3.504651 (+1.37z)| norm 0.2620 (+0.10z)| lr 5.56e-04 | 4163.23 ms | 32.4% bf16 MFU | 125425 tok/s step 7915/19560 | loss 3.448626 (-0.25z)| norm 0.2945 (+1.63z)| lr 5.56e-04 | 4175.79 ms | 32.3% bf16 MFU | 125431 tok/s step 7916/19560 | loss 3.472163 (+0.42z)| norm 0.2834 (+1.09z)| lr 5.56e-04 | 4178.72 ms | 32.3% bf16 MFU | 125433 tok/s step 7917/19560 | loss 3.455578 (-0.06z)| norm 0.2816 (+0.99z)| lr 5.56e-04 | 4174.19 ms | 32.3% bf16 MFU | 125441 tok/s step 7918/19560 | loss 3.445387 (-0.36z)| norm 0.2559 (-0.21z)| lr 5.56e-04 | 4161.38 ms | 32.4% bf16 MFU | 125469 tok/s step 7919/19560 | loss 3.492403 (+1.02z)| norm 0.2648 (+0.22z)| lr 5.56e-04 | 4172.35 ms | 32.4% bf16 MFU | 125478 tok/s step 7920/19560 | loss 3.448152 (-0.28z)| norm 0.2685 (+0.39z)| lr 5.56e-04 | 4246.65 ms | 31.8% bf16 MFU | 125377 tok/s step 7921/19560 | loss 3.484518 (+0.79z)| norm 0.2644 (+0.21z)| lr 5.56e-04 | 4182.15 ms | 32.3% bf16 MFU | 125377 tok/s step 7922/19560 | loss 3.397581 (-1.74z)| norm 0.2457 (-0.69z)| lr 5.56e-04 | 4161.12 ms | 32.4% bf16 MFU | 125408 tok/s step 7923/19560 | loss 3.461434 (+0.13z)| norm 0.2705 (+0.51z)| lr 5.55e-04 | 4177.64 ms | 32.3% bf16 MFU | 125412 tok/s step 7924/19560 | loss 3.419551 (-1.09z)| norm 0.2664 (+0.31z)| lr 5.55e-04 | 4539.73 ms | 29.7% bf16 MFU | 124916 tok/s step 7925/19560 | loss 3.441135 (-0.46z)| norm 0.2619 (+0.09z)| lr 5.55e-04 | 4643.83 ms | 29.1% bf16 MFU | 124315 tok/s step 7926/19560 | loss 3.470795 (+0.40z)| norm 0.2557 (-0.21z)| lr 5.55e-04 | 4353.62 ms | 31.0% bf16 MFU | 124121 tok/s step 7927/19560 | loss 3.437392 (-0.56z)| norm 0.2703 (+0.50z)| lr 5.55e-04 | 4469.19 ms | 30.2% bf16 MFU | 123780 tok/s step 7928/19560 | loss 3.363780 (-2.65z)| norm 0.2478 (-0.58z)| lr 5.55e-04 | 4312.77 ms | 31.3% bf16 MFU | 123670 tok/s step 7929/19560 | loss 3.523067 (+1.89z)| norm 0.2491 (-0.51z)| lr 5.55e-04 | 4202.75 ms | 32.1% bf16 MFU | 123723 tok/s step 7930/19560 | loss 3.454905 (-0.04z)| norm 0.2871 (+1.35z)| lr 5.55e-04 | 4268.89 ms | 31.6% bf16 MFU | 123678 tok/s step 7931/19560 | loss 3.484713 (+0.79z)| norm 0.2539 (-0.27z)| lr 5.55e-04 | 4232.47 ms | 31.9% bf16 MFU | 123688 tok/s step 7932/19560 | loss 3.447496 (-0.26z)| norm 0.2925 (+1.60z)| lr 5.55e-04 | 4260.18 ms | 31.7% bf16 MFU | 123657 tok/s step 7933/19560 | loss 3.476751 (+0.57z)| norm 0.2552 (-0.20z)| lr 5.55e-04 | 4241.46 ms | 31.8% bf16 MFU | 123654 tok/s step 7934/19560 | loss 3.423681 (-0.92z)| norm 0.2700 (+0.54z)| lr 5.55e-04 | 4256.35 ms | 31.7% bf16 MFU | 123631 tok/s step 7935/19560 | loss 3.483110 (+0.75z)| norm 0.2778 (+0.97z)| lr 5.55e-04 | 4160.63 ms | 32.5% bf16 MFU | 123750 tok/s step 7936/19560 | loss 3.368265 (-2.41z)| norm 0.2472 (-0.59z)| lr 5.55e-04 | 4429.73 ms | 30.5% bf16 MFU | 123480 tok/s step 7937/19560 | loss 3.368184 (-2.34z)| norm 0.2675 (+0.49z)| lr 5.55e-04 | 4222.19 ms | 32.0% bf16 MFU | 123515 tok/s step 7938/19560 | loss 3.459601 (+0.14z)| norm 0.2473 (-0.58z)| lr 5.55e-04 | 4162.76 ms | 32.4% bf16 MFU | 123636 tok/s step 7939/19560 | loss 3.433107 (-0.58z)| norm 0.2702 (+0.65z)| lr 5.54e-04 | 4239.31 ms | 31.8% bf16 MFU | 123638 tok/s step 7940/19560 | loss 3.388816 (-1.74z)| norm 0.2572 (-0.05z)| lr 5.54e-04 | 4293.47 ms | 31.4% bf16 MFU | 123562 tok/s step 7941/19560 | loss 3.405401 (-1.28z)| norm 0.2554 (-0.13z)| lr 5.54e-04 | 4158.73 ms | 32.5% bf16 MFU | 123687 tok/s step 7942/19560 | loss 3.411802 (-1.09z)| norm 0.2742 (+0.91z)| lr 5.54e-04 | 4206.54 ms | 32.1% bf16 MFU | 123735 tok/s step 7943/19560 | loss 3.476257 (+0.63z)| norm 0.2647 (+0.38z)| lr 5.54e-04 | 4160.52 ms | 32.5% bf16 MFU | 123849 tok/s step 7944/19560 | loss 3.408345 (-1.17z)| norm 0.2596 (+0.10z)| lr 5.54e-04 | 4190.96 ms | 32.2% bf16 MFU | 123911 tok/s step 7945/19560 | loss 3.476695 (+0.64z)| norm 0.2828 (+1.37z)| lr 5.54e-04 | 4176.86 ms | 32.3% bf16 MFU | 123992 tok/s step 7946/19560 | loss 3.394383 (-1.53z)| norm 0.3015 (+2.33z)| lr 5.54e-04 | 4194.67 ms | 32.2% bf16 MFU | 124042 tok/s step 7947/19560 | loss 3.443501 (-0.22z)| norm 0.2603 (+0.11z)| lr 5.54e-04 | 4257.04 ms | 31.7% bf16 MFU | 123998 tok/s step 7948/19560 | loss 3.492083 (+1.07z)| norm 0.5773 (+9.43z)| lr 5.54e-04 | 4161.40 ms | 32.4% bf16 MFU | 124097 tok/s step 7949/19560 | loss 3.460117 (+0.21z)| norm 0.3463 (+2.47z)| lr 5.54e-04 | 4166.77 ms | 32.4% bf16 MFU | 124184 tok/s step 7950/19560 | loss 3.571898 (+3.05z)| norm 0.3129 (+1.47z)| lr 5.54e-04 | 4162.10 ms | 32.4% bf16 MFU | 124273 tok/s step 7951/19560 | loss 3.444163 (-0.23z)| norm 0.2880 (+0.75z)| lr 5.54e-04 | 6714.79 ms | 20.1% bf16 MFU | 121963 tok/s step 7952/19560 | loss 3.431922 (-0.54z)| norm 0.2842 (+0.63z)| lr 5.54e-04 | 4208.83 ms | 32.1% bf16 MFU | 122093 tok/s step 7953/19560 | loss 3.437770 (-0.38z)| norm 0.2790 (+0.48z)| lr 5.54e-04 | 4152.29 ms | 32.5% bf16 MFU | 122302 tok/s step 7954/19560 | loss 3.452347 (-0.00z)| norm 0.2941 (+0.90z)| lr 5.54e-04 | 4198.26 ms | 32.2% bf16 MFU | 122431 tok/s step 7955/19560 | loss 3.421268 (-0.81z)| norm 0.2662 (+0.09z)| lr 5.53e-04 | 4157.45 ms | 32.5% bf16 MFU | 122615 tok/s step 7956/19560 | loss 3.459757 (+0.19z)| norm 0.2817 (+0.53z)| lr 5.53e-04 | 4166.44 ms | 32.4% bf16 MFU | 122776 tok/s step 7957/19560 | loss 3.444640 (-0.20z)| norm 0.2617 (-0.05z)| lr 5.53e-04 | 4159.28 ms | 32.5% bf16 MFU | 122940 tok/s step 7958/19560 | loss 3.453529 (+0.03z)| norm 0.2666 (+0.08z)| lr 5.53e-04 | 4154.70 ms | 32.5% bf16 MFU | 123102 tok/s step 7959/19560 | loss 3.505242 (+1.35z)| norm 0.2698 (+0.17z)| lr 5.53e-04 | 4211.61 ms | 32.1% bf16 MFU | 123171 tok/s step 7960/19560 | loss 3.402215 (-1.32z)| norm 0.2672 (+0.09z)| lr 5.53e-04 | 4294.57 ms | 31.4% bf16 MFU | 123117 tok/s step 7961/19560 | loss 3.452074 (-0.00z)| norm 0.2722 (+0.24z)| lr 5.53e-04 | 4157.26 ms | 32.5% bf16 MFU | 123267 tok/s step 7962/19560 | loss 3.395489 (-1.48z)| norm 0.2520 (-0.35z)| lr 5.53e-04 | 4154.33 ms | 32.5% bf16 MFU | 123414 tok/s step 7963/19560 | loss 3.440793 (-0.27z)| norm 0.2292 (-1.00z)| lr 5.53e-04 | 4154.37 ms | 32.5% bf16 MFU | 123553 tok/s step 7964/19560 | loss 3.399459 (-1.36z)| norm 0.2597 (-0.12z)| lr 5.53e-04 | 4183.42 ms | 32.3% bf16 MFU | 123642 tok/s step 7965/19560 | loss 3.374063 (-2.00z)| norm 0.2291 (-1.00z)| lr 5.53e-04 | 4211.31 ms | 32.1% bf16 MFU | 123684 tok/s step 7966/19560 | loss 3.454169 (+0.13z)| norm 0.2613 (-0.07z)| lr 5.53e-04 | 4159.58 ms | 32.5% bf16 MFU | 123802 tok/s step 7967/19560 | loss 3.444249 (-0.13z)| norm 0.2689 (+0.15z)| lr 5.53e-04 | 4161.33 ms | 32.4% bf16 MFU | 123912 tok/s step 7968/19560 | loss 3.421094 (-0.74z)| norm 0.2768 (+0.37z)| lr 5.53e-04 | 4206.71 ms | 32.1% bf16 MFU | 123948 tok/s step 7969/19560 | loss 3.437959 (-0.29z)| norm 0.2645 (+0.02z)| lr 5.53e-04 | 4219.66 ms | 32.0% bf16 MFU | 123963 tok/s step 7970/19560 | loss 3.411935 (-0.97z)| norm 0.2575 (-0.19z)| lr 5.53e-04 | 4158.45 ms | 32.5% bf16 MFU | 124068 tok/s step 7971/19560 | loss 3.404223 (-1.18z)| norm 0.2870 (+0.66z)| lr 5.52e-04 | 4169.73 ms | 32.4% bf16 MFU | 124152 tok/s step 7972/19560 | loss 3.502462 (+1.47z)| norm 0.2343 (-0.85z)| lr 5.52e-04 | 4165.34 ms | 32.4% bf16 MFU | 124238 tok/s step 7973/19560 | loss 3.440520 (-0.20z)| norm 0.2401 (-0.68z)| lr 5.52e-04 | 4175.81 ms | 32.3% bf16 MFU | 124304 tok/s step 7974/19560 | loss 3.447799 (+0.01z)| norm 0.2517 (-0.34z)| lr 5.52e-04 | 4194.42 ms | 32.2% bf16 MFU | 124338 tok/s step 7975/19560 | loss 3.376869 (-1.87z)| norm 0.2379 (-0.72z)| lr 5.52e-04 | 4170.48 ms | 32.4% bf16 MFU | 124407 tok/s step 7976/19560 | loss 3.436119 (-0.29z)| norm 0.2538 (-0.25z)| lr 5.52e-04 | 4170.44 ms | 32.4% bf16 MFU | 124472 tok/s step 7977/19560 | loss 3.384772 (-1.63z)| norm 0.2348 (-0.80z)| lr 5.52e-04 | 4160.92 ms | 32.4% bf16 MFU | 124549 tok/s step 7978/19560 | loss 3.461918 (+0.41z)| norm 0.2741 (+0.34z)| lr 5.52e-04 | 4172.08 ms | 32.4% bf16 MFU | 124605 tok/s step 7979/19560 | loss 3.479310 (+0.86z)| norm 0.2234 (-1.12z)| lr 5.52e-04 | 4161.61 ms | 32.4% bf16 MFU | 124674 tok/s step 7980/19560 | loss 3.438396 (-0.22z)| norm 0.2516 (-0.30z)| lr 5.52e-04 | 4176.67 ms | 32.3% bf16 MFU | 124716 tok/s step 7981/19560 | loss 3.415679 (-0.81z)| norm 0.2449 (-0.49z)| lr 5.52e-04 | 4168.09 ms | 32.4% bf16 MFU | 124770 tok/s step 7982/19560 | loss 3.461079 (+0.43z)| norm 0.2413 (-0.59z)| lr 5.52e-04 | 4160.24 ms | 32.5% bf16 MFU | 124832 tok/s step 7983/19560 | loss 3.428874 (-0.46z)| norm 0.2449 (-0.49z)| lr 5.52e-04 | 4166.38 ms | 32.4% bf16 MFU | 124883 tok/s step 7984/19560 | loss 3.437532 (-0.21z)| norm 0.2506 (-0.31z)| lr 5.52e-04 | 4167.37 ms | 32.4% bf16 MFU | 124929 tok/s step 7985/19560 | loss 3.454380 (+0.25z)| norm 0.2720 (+0.30z)| lr 5.52e-04 | 4159.74 ms | 32.5% bf16 MFU | 124984 tok/s step 7986/19560 | loss 3.414401 (-0.84z)| norm 0.2690 (+0.21z)| lr 5.52e-04 | 4173.01 ms | 32.4% bf16 MFU | 125017 tok/s step 7987/19560 | loss 3.419162 (-0.70z)| norm 0.2517 (-0.29z)| lr 5.51e-04 | 4172.97 ms | 32.4% bf16 MFU | 125048 tok/s step 7988/19560 | loss 3.453110 (+0.24z)| norm 0.2901 (+0.82z)| lr 5.51e-04 | 4189.10 ms | 32.2% bf16 MFU | 125054 tok/s step 7989/19560 | loss 3.383713 (-1.67z)| norm 0.2688 (+0.19z)| lr 5.51e-04 | 4181.03 ms | 32.3% bf16 MFU | 125071 tok/s step 7990/19560 | loss 3.388520 (-1.51z)| norm 0.2478 (-0.41z)| lr 5.51e-04 | 4165.79 ms | 32.4% bf16 MFU | 125110 tok/s step 7991/19560 | loss 3.424664 (-0.51z)| norm 0.2674 (+0.15z)| lr 5.51e-04 | 4162.11 ms | 32.4% bf16 MFU | 125153 tok/s step 7992/19560 | loss 3.439623 (-0.09z)| norm 0.2342 (-0.81z)| lr 5.51e-04 | 4173.42 ms | 32.4% bf16 MFU | 125176 tok/s step 7993/19560 | loss 3.433815 (-0.25z)| norm 0.2351 (-0.78z)| lr 5.51e-04 | 4170.63 ms | 32.4% bf16 MFU | 125203 tok/s step 7994/19560 | loss 3.464645 (+0.60z)| norm 0.2363 (-0.75z)| lr 5.51e-04 | 4170.52 ms | 32.4% bf16 MFU | 125229 tok/s step 7995/19560 | loss 3.414447 (-0.79z)| norm 0.2683 (+0.17z)| lr 5.51e-04 | 4164.31 ms | 32.4% bf16 MFU | 125262 tok/s step 7996/19560 | loss 3.396926 (-1.28z)| norm 0.2635 (+0.02z)| lr 5.51e-04 | 4174.92 ms | 32.3% bf16 MFU | 125278 tok/s step 7997/19560 | loss 3.417045 (-0.72z)| norm 0.2696 (+0.20z)| lr 5.51e-04 | 4166.44 ms | 32.4% bf16 MFU | 125306 tok/s step 7998/19560 | loss 3.431597 (-0.31z)| norm 0.2652 (+0.06z)| lr 5.51e-04 | 4172.44 ms | 32.4% bf16 MFU | 125323 tok/s step 7999/19560 | loss 3.441014 (-0.06z)| norm 0.2553 (-0.24z)| lr 5.51e-04 | 4183.47 ms | 32.3% bf16 MFU | 125323 tok/s step 8000/19560 | loss 3.443932 (+0.02z)| norm 0.2694 (+0.17z)| lr 5.51e-04 | 4162.22 ms | 32.4% bf16 MFU | 125355 tok/s val loss 3.432278 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2873/10042 = 0.286098 step 8001/19560 | loss 3.440153 (-0.09z)| norm 0.3149 (+1.48z)| lr 5.51e-04 | 4626.60 ms | 29.2% bf16 MFU | 124754 tok/s step 8002/19560 | loss 3.468331 (+0.69z)| norm 0.3009 (+1.06z)| lr 5.51e-04 | 4319.59 ms | 31.3% bf16 MFU | 124585 tok/s step 8003/19560 | loss 3.478934 (+0.98z)| norm 0.2853 (+0.59z)| lr 5.50e-04 | 4256.86 ms | 31.7% bf16 MFU | 124514 tok/s step 8004/19560 | loss 3.413050 (-0.85z)| norm 0.3274 (+1.78z)| lr 5.50e-04 | 4169.25 ms | 32.4% bf16 MFU | 124576 tok/s step 8005/19560 | loss 3.398211 (-1.24z)| norm 0.2763 (+0.30z)| lr 5.50e-04 | 4251.25 ms | 31.8% bf16 MFU | 124513 tok/s step 8006/19560 | loss 3.429173 (-0.38z)| norm 0.3053 (+1.12z)| lr 5.50e-04 | 4454.29 ms | 30.3% bf16 MFU | 124173 tok/s step 8007/19560 | loss 3.434933 (-0.21z)| norm 0.2969 (+0.87z)| lr 5.50e-04 | 4163.99 ms | 32.4% bf16 MFU | 124259 tok/s step 8008/19560 | loss 3.572289 (+3.41z)| norm 0.2942 (+0.78z)| lr 5.50e-04 | 4163.49 ms | 32.4% bf16 MFU | 124343 tok/s step 8009/19560 | loss 3.416808 (-0.70z)| norm 0.3228 (+1.57z)| lr 5.50e-04 | 4324.91 ms | 31.2% bf16 MFU | 124187 tok/s step 8010/19560 | loss 3.449762 (+0.16z)| norm 0.2981 (+0.85z)| lr 5.50e-04 | 4168.76 ms | 32.4% bf16 MFU | 124266 tok/s step 8011/19560 | loss 3.444295 (+0.02z)| norm 0.2883 (+0.57z)| lr 5.50e-04 | 4282.02 ms | 31.5% bf16 MFU | 124174 tok/s step 8012/19560 | loss 3.491711 (+1.29z)| norm 0.2660 (-0.07z)| lr 5.50e-04 | 4344.38 ms | 31.1% bf16 MFU | 124000 tok/s step 8013/19560 | loss 3.455935 (+0.33z)| norm 0.2836 (+0.42z)| lr 5.50e-04 | 4299.47 ms | 31.4% bf16 MFU | 123897 tok/s step 8014/19560 | loss 3.462088 (+0.51z)| norm 0.2555 (-0.38z)| lr 5.50e-04 | 4164.63 ms | 32.4% bf16 MFU | 123997 tok/s step 8015/19560 | loss 3.407644 (-0.94z)| norm 0.2644 (-0.13z)| lr 5.50e-04 | 4160.08 ms | 32.5% bf16 MFU | 124098 tok/s step 8016/19560 | loss 3.418804 (-0.64z)| norm 0.2660 (-0.08z)| lr 5.50e-04 | 4185.12 ms | 32.3% bf16 MFU | 124157 tok/s step 8017/19560 | loss 3.453521 (+0.29z)| norm 0.2584 (-0.29z)| lr 5.50e-04 | 4179.24 ms | 32.3% bf16 MFU | 124222 tok/s step 8018/19560 | loss 3.450475 (+0.21z)| norm 0.2486 (-0.57z)| lr 5.50e-04 | 4164.42 ms | 32.4% bf16 MFU | 124305 tok/s step 8019/19560 | loss 3.376867 (-1.74z)| norm 0.2478 (-0.58z)| lr 5.49e-04 | 4248.06 ms | 31.8% bf16 MFU | 124261 tok/s step 8020/19560 | loss 3.406341 (-0.94z)| norm 0.2594 (-0.24z)| lr 5.49e-04 | 4191.95 ms | 32.2% bf16 MFU | 124302 tok/s step 8021/19560 | loss 3.433645 (-0.21z)| norm 0.2664 (-0.03z)| lr 5.49e-04 | 4166.23 ms | 32.4% bf16 MFU | 124379 tok/s step 8022/19560 | loss 3.416566 (-0.66z)| norm 0.2517 (-0.46z)| lr 5.49e-04 | 4160.13 ms | 32.5% bf16 MFU | 124461 tok/s step 8023/19560 | loss 3.445585 (+0.11z)| norm 0.2573 (-0.29z)| lr 5.49e-04 | 4260.25 ms | 31.7% bf16 MFU | 124391 tok/s step 8024/19560 | loss 3.414843 (-0.70z)| norm 0.2638 (-0.11z)| lr 5.49e-04 | 4163.46 ms | 32.4% bf16 MFU | 124468 tok/s step 8025/19560 | loss 3.396085 (-1.18z)| norm 0.2565 (-0.32z)| lr 5.49e-04 | 4169.67 ms | 32.4% bf16 MFU | 124531 tok/s step 8026/19560 | loss 3.443192 (+0.06z)| norm 0.2612 (-0.18z)| lr 5.49e-04 | 4298.86 ms | 31.4% bf16 MFU | 124403 tok/s step 8027/19560 | loss 3.396156 (-1.20z)| norm 0.2479 (-0.56z)| lr 5.49e-04 | 4202.97 ms | 32.1% bf16 MFU | 124420 tok/s step 8028/19560 | loss 3.400869 (-1.06z)| norm 0.2868 (+0.55z)| lr 5.49e-04 | 4163.65 ms | 32.4% bf16 MFU | 124495 tok/s step 8029/19560 | loss 3.388989 (-1.36z)| norm 0.2885 (+0.59z)| lr 5.49e-04 | 4209.44 ms | 32.1% bf16 MFU | 124498 tok/s step 8030/19560 | loss 3.486276 (+1.21z)| norm 0.2996 (+0.89z)| lr 5.49e-04 | 4162.21 ms | 32.4% bf16 MFU | 124571 tok/s step 8031/19560 | loss 3.460674 (+0.53z)| norm 0.2609 (-0.23z)| lr 5.49e-04 | 4170.74 ms | 32.4% bf16 MFU | 124628 tok/s step 8032/19560 | loss 3.475996 (+0.93z)| norm 0.2746 (+0.16z)| lr 5.49e-04 | 4230.19 ms | 31.9% bf16 MFU | 124593 tok/s step 8033/19560 | loss 3.442205 (+0.04z)| norm 0.2597 (-0.27z)| lr 5.49e-04 | 4185.06 ms | 32.3% bf16 MFU | 124627 tok/s step 8034/19560 | loss 3.467686 (+0.71z)| norm 0.2510 (-0.53z)| lr 5.48e-04 | 4194.05 ms | 32.2% bf16 MFU | 124646 tok/s step 8035/19560 | loss 3.435439 (-0.14z)| norm 0.2571 (-0.35z)| lr 5.48e-04 | 4162.28 ms | 32.4% bf16 MFU | 124712 tok/s step 8036/19560 | loss 3.485807 (+1.25z)| norm 0.2405 (-0.83z)| lr 5.48e-04 | 4161.33 ms | 32.4% bf16 MFU | 124776 tok/s step 8037/19560 | loss 3.510790 (+1.89z)| norm 0.2824 (+0.38z)| lr 5.48e-04 | 4168.11 ms | 32.4% bf16 MFU | 124827 tok/s step 8038/19560 | loss 3.492652 (+1.39z)| norm 0.2488 (-0.59z)| lr 5.48e-04 | 4166.52 ms | 32.4% bf16 MFU | 124877 tok/s step 8039/19560 | loss 3.442896 (+0.06z)| norm 0.2489 (-0.60z)| lr 5.48e-04 | 4295.22 ms | 31.4% bf16 MFU | 124736 tok/s step 8040/19560 | loss 3.419016 (-0.58z)| norm 0.2496 (-0.57z)| lr 5.48e-04 | 4155.00 ms | 32.5% bf16 MFU | 124809 tok/s step 8041/19560 | loss 3.426524 (-0.37z)| norm 0.2660 (-0.09z)| lr 5.48e-04 | 4161.53 ms | 32.4% bf16 MFU | 124867 tok/s step 8042/19560 | loss 3.559562 (+3.12z)| norm 0.2671 (-0.06z)| lr 5.48e-04 | 4296.77 ms | 31.4% bf16 MFU | 124725 tok/s step 8043/19560 | loss 3.397303 (-1.13z)| norm 0.2998 (+0.89z)| lr 5.48e-04 | 4163.60 ms | 32.4% bf16 MFU | 124785 tok/s step 8044/19560 | loss 3.447737 (+0.20z)| norm 0.2558 (-0.39z)| lr 5.48e-04 | 4169.28 ms | 32.4% bf16 MFU | 124833 tok/s step 8045/19560 | loss 3.471480 (+0.82z)| norm 0.2707 (+0.05z)| lr 5.48e-04 | 4287.95 ms | 31.5% bf16 MFU | 124705 tok/s step 8046/19560 | loss 3.435493 (-0.12z)| norm 0.2676 (-0.04z)| lr 5.48e-04 | 4165.81 ms | 32.4% bf16 MFU | 124762 tok/s step 8047/19560 | loss 3.427708 (-0.32z)| norm 0.2597 (-0.27z)| lr 5.48e-04 | 4197.58 ms | 32.2% bf16 MFU | 124769 tok/s step 8048/19560 | loss 3.418778 (-0.55z)| norm 0.2608 (-0.23z)| lr 5.48e-04 | 4160.34 ms | 32.5% bf16 MFU | 124832 tok/s step 8049/19560 | loss 3.421262 (-0.47z)| norm 0.2441 (-0.72z)| lr 5.48e-04 | 4164.00 ms | 32.4% bf16 MFU | 124886 tok/s step 8050/19560 | loss 3.391653 (-1.25z)| norm 0.2525 (-0.48z)| lr 5.47e-04 | 4163.48 ms | 32.4% bf16 MFU | 124938 tok/s step 8051/19560 | loss 3.442854 (+0.11z)| norm 0.2256 (-1.24z)| lr 5.47e-04 | 4368.10 ms | 30.9% bf16 MFU | 124692 tok/s step 8052/19560 | loss 3.386135 (-1.38z)| norm 0.2776 (+0.26z)| lr 5.47e-04 | 4160.87 ms | 32.4% bf16 MFU | 124758 tok/s step 8053/19560 | loss 3.410367 (-0.73z)| norm 0.2937 (+0.72z)| lr 5.47e-04 | 4162.12 ms | 32.4% bf16 MFU | 124818 tok/s step 8054/19560 | loss 3.457118 (+0.50z)| norm 0.2486 (-0.58z)| lr 5.47e-04 | 4163.57 ms | 32.4% bf16 MFU | 124874 tok/s step 8055/19560 | loss 3.407371 (-0.80z)| norm 0.2526 (-0.46z)| lr 5.47e-04 | 4334.64 ms | 31.1% bf16 MFU | 124677 tok/s step 8056/19560 | loss 3.432694 (-0.16z)| norm 0.2984 (+0.85z)| lr 5.47e-04 | 4299.34 ms | 31.4% bf16 MFU | 124541 tok/s step 8057/19560 | loss 3.419482 (-0.49z)| norm 0.2466 (-0.64z)| lr 5.47e-04 | 4159.31 ms | 32.5% bf16 MFU | 124616 tok/s step 8058/19560 | loss 3.420885 (-0.45z)| norm 0.2567 (-0.34z)| lr 5.47e-04 | 4196.29 ms | 32.2% bf16 MFU | 124633 tok/s step 8059/19560 | loss 3.480015 (+1.16z)| norm 0.2687 (-0.00z)| lr 5.47e-04 | 4158.66 ms | 32.5% bf16 MFU | 124705 tok/s step 8060/19560 | loss 3.415795 (-0.58z)| norm 0.2555 (-0.37z)| lr 5.47e-04 | 4160.96 ms | 32.4% bf16 MFU | 124769 tok/s step 8061/19560 | loss 3.462549 (+0.69z)| norm 0.2687 (+0.00z)| lr 5.47e-04 | 4232.46 ms | 31.9% bf16 MFU | 124725 tok/s step 8062/19560 | loss 3.441050 (+0.11z)| norm 0.2920 (+0.67z)| lr 5.47e-04 | 4214.68 ms | 32.0% bf16 MFU | 124708 tok/s step 8063/19560 | loss 3.490780 (+1.45z)| norm 0.2611 (-0.22z)| lr 5.47e-04 | 4185.25 ms | 32.3% bf16 MFU | 124736 tok/s step 8064/19560 | loss 3.405733 (-0.87z)| norm 0.2947 (+0.74z)| lr 5.47e-04 | 4158.18 ms | 32.5% bf16 MFU | 124804 tok/s step 8065/19560 | loss 3.378277 (-1.64z)| norm 0.2781 (+0.26z)| lr 5.47e-04 | 4218.30 ms | 32.0% bf16 MFU | 124778 tok/s step 8066/19560 | loss 3.510587 (+1.97z)| norm 0.2626 (-0.19z)| lr 5.46e-04 | 4166.49 ms | 32.4% bf16 MFU | 124831 tok/s step 8067/19560 | loss 3.401044 (-0.99z)| norm 0.3061 (+1.05z)| lr 5.46e-04 | 4165.60 ms | 32.4% bf16 MFU | 124882 tok/s step 8068/19560 | loss 3.451805 (+0.37z)| norm 0.2557 (-0.40z)| lr 5.46e-04 | 4225.79 ms | 32.0% bf16 MFU | 124842 tok/s step 8069/19560 | loss 3.452941 (+0.39z)| norm 0.2824 (+0.36z)| lr 5.46e-04 | 4157.30 ms | 32.5% bf16 MFU | 124905 tok/s step 8070/19560 | loss 3.444025 (+0.14z)| norm 0.2424 (-0.78z)| lr 5.46e-04 | 4282.43 ms | 31.5% bf16 MFU | 124781 tok/s step 8071/19560 | loss 3.423334 (-0.42z)| norm 0.2684 (-0.03z)| lr 5.46e-04 | 4211.47 ms | 32.1% bf16 MFU | 124767 tok/s step 8072/19560 | loss 3.511368 (+1.96z)| norm 0.2777 (+0.23z)| lr 5.46e-04 | 4181.15 ms | 32.3% bf16 MFU | 124798 tok/s step 8073/19560 | loss 3.430099 (-0.24z)| norm 0.2862 (+0.47z)| lr 5.46e-04 | 4206.40 ms | 32.1% bf16 MFU | 124790 tok/s step 8074/19560 | loss 3.446369 (+0.19z)| norm 0.2677 (-0.05z)| lr 5.46e-04 | 4203.51 ms | 32.1% bf16 MFU | 124787 tok/s step 8075/19560 | loss 3.425139 (-0.38z)| norm 0.3195 (+1.41z)| lr 5.46e-04 | 4164.19 ms | 32.4% bf16 MFU | 124843 tok/s step 8076/19560 | loss 3.420002 (-0.51z)| norm 0.2737 (+0.28z)| lr 5.46e-04 | 4165.74 ms | 32.4% bf16 MFU | 124894 tok/s step 8077/19560 | loss 3.455838 (+0.48z)| norm 0.2482 (-0.88z)| lr 5.46e-04 | 4233.35 ms | 31.9% bf16 MFU | 124841 tok/s step 8078/19560 | loss 3.359107 (-2.23z)| norm 0.2774 (+0.53z)| lr 5.46e-04 | 4163.42 ms | 32.4% bf16 MFU | 124896 tok/s step 8079/19560 | loss 3.457302 (+0.58z)| norm 0.2510 (-0.74z)| lr 5.46e-04 | 4157.30 ms | 32.5% bf16 MFU | 124956 tok/s step 8080/19560 | loss 3.476604 (+1.12z)| norm 0.2741 (+0.39z)| lr 5.46e-04 | 4339.26 ms | 31.1% bf16 MFU | 124750 tok/s step 8081/19560 | loss 3.461693 (+0.69z)| norm 0.2360 (-1.44z)| lr 5.46e-04 | 4160.32 ms | 32.5% bf16 MFU | 124813 tok/s step 8082/19560 | loss 3.478041 (+1.14z)| norm 0.2889 (+1.13z)| lr 5.45e-04 | 4159.96 ms | 32.5% bf16 MFU | 124874 tok/s step 8083/19560 | loss 3.442309 (+0.12z)| norm 0.2639 (-0.09z)| lr 5.45e-04 | 4158.68 ms | 32.5% bf16 MFU | 124934 tok/s step 8084/19560 | loss 3.521941 (+2.32z)| norm 0.2568 (-0.43z)| lr 5.45e-04 | 4271.00 ms | 31.6% bf16 MFU | 124825 tok/s step 8085/19560 | loss 3.420031 (-0.50z)| norm 0.2566 (-0.43z)| lr 5.45e-04 | 4161.31 ms | 32.4% bf16 MFU | 124883 tok/s step 8086/19560 | loss 3.436165 (-0.05z)| norm 0.2613 (-0.20z)| lr 5.45e-04 | 4223.82 ms | 32.0% bf16 MFU | 124846 tok/s step 8087/19560 | loss 3.404099 (-0.93z)| norm 0.2261 (-1.87z)| lr 5.45e-04 | 4161.23 ms | 32.4% bf16 MFU | 124903 tok/s step 8088/19560 | loss 3.478604 (+1.14z)| norm 0.2487 (-0.77z)| lr 5.45e-04 | 4248.68 ms | 31.8% bf16 MFU | 124828 tok/s step 8089/19560 | loss 3.379178 (-1.61z)| norm 0.2387 (-1.24z)| lr 5.45e-04 | 4158.75 ms | 32.5% bf16 MFU | 124890 tok/s step 8090/19560 | loss 3.433400 (-0.12z)| norm 0.2358 (-1.36z)| lr 5.45e-04 | 4157.58 ms | 32.5% bf16 MFU | 124951 tok/s step 8091/19560 | loss 3.496527 (+1.61z)| norm 0.2337 (-1.47z)| lr 5.45e-04 | 4615.46 ms | 29.3% bf16 MFU | 124383 tok/s step 8092/19560 | loss 3.457532 (+0.53z)| norm 0.2483 (-0.77z)| lr 5.45e-04 | 4162.60 ms | 32.4% bf16 MFU | 124461 tok/s step 8093/19560 | loss 3.403965 (-0.97z)| norm 0.2643 (-0.02z)| lr 5.45e-04 | 4348.65 ms | 31.0% bf16 MFU | 124266 tok/s step 8094/19560 | loss 3.407754 (-0.85z)| norm 0.2485 (-0.77z)| lr 5.45e-04 | 4207.29 ms | 32.1% bf16 MFU | 124284 tok/s step 8095/19560 | loss 3.440629 (+0.06z)| norm 0.2444 (-0.95z)| lr 5.45e-04 | 4261.75 ms | 31.7% bf16 MFU | 124221 tok/s step 8096/19560 | loss 3.430754 (-0.21z)| norm 0.2543 (-0.48z)| lr 5.45e-04 | 4198.18 ms | 32.2% bf16 MFU | 124254 tok/s step 8097/19560 | loss 3.464802 (+0.73z)| norm 0.2308 (-1.57z)| lr 5.45e-04 | 4158.00 ms | 32.5% bf16 MFU | 124346 tok/s step 8098/19560 | loss 3.419257 (-0.54z)| norm 0.2552 (-0.42z)| lr 5.44e-04 | 4214.54 ms | 32.0% bf16 MFU | 124348 tok/s step 8099/19560 | loss 3.465242 (+0.73z)| norm 0.2370 (-1.25z)| lr 5.44e-04 | 4157.94 ms | 32.5% bf16 MFU | 124436 tok/s step 8100/19560 | loss 3.450714 (+0.34z)| norm 0.2520 (-0.56z)| lr 5.44e-04 | 4163.26 ms | 32.4% bf16 MFU | 124510 tok/s step 8101/19560 | loss 3.427095 (-0.33z)| norm 0.2471 (-0.79z)| lr 5.44e-04 | 4179.07 ms | 32.3% bf16 MFU | 124558 tok/s step 8102/19560 | loss 3.414323 (-0.68z)| norm 0.2396 (-1.14z)| lr 5.44e-04 | 4161.06 ms | 32.4% bf16 MFU | 124630 tok/s step 8103/19560 | loss 3.437465 (-0.04z)| norm 0.2317 (-1.51z)| lr 5.44e-04 | 4160.96 ms | 32.4% bf16 MFU | 124698 tok/s step 8104/19560 | loss 3.407448 (-0.89z)| norm 0.2336 (-1.40z)| lr 5.44e-04 | 4250.07 ms | 31.8% bf16 MFU | 124631 tok/s step 8105/19560 | loss 3.433491 (-0.16z)| norm 0.2280 (-1.65z)| lr 5.44e-04 | 4261.00 ms | 31.7% bf16 MFU | 124552 tok/s step 8106/19560 | loss 3.416715 (-0.63z)| norm 0.2542 (-0.43z)| lr 5.44e-04 | 4161.17 ms | 32.4% bf16 MFU | 124624 tok/s step 8107/19560 | loss 3.417403 (-0.60z)| norm 0.2702 (+0.31z)| lr 5.44e-04 | 4157.84 ms | 32.5% bf16 MFU | 124698 tok/s step 8108/19560 | loss 3.403990 (-0.97z)| norm 0.2410 (-1.06z)| lr 5.44e-04 | 4242.41 ms | 31.8% bf16 MFU | 124642 tok/s step 8109/19560 | loss 3.435529 (-0.07z)| norm 0.2493 (-0.67z)| lr 5.44e-04 | 4166.49 ms | 32.4% bf16 MFU | 124702 tok/s step 8110/19560 | loss 3.470896 (+0.94z)| norm 0.2770 (+0.62z)| lr 5.44e-04 | 4211.03 ms | 32.1% bf16 MFU | 124692 tok/s step 8111/19560 | loss 3.422584 (-0.45z)| norm 0.2444 (-0.92z)| lr 5.44e-04 | 4167.39 ms | 32.4% bf16 MFU | 124748 tok/s step 8112/19560 | loss 3.459024 (+0.59z)| norm 0.2648 (+0.04z)| lr 5.44e-04 | 4174.37 ms | 32.3% bf16 MFU | 124790 tok/s step 8113/19560 | loss 3.390570 (-1.34z)| norm 0.2545 (-0.44z)| lr 5.44e-04 | 4173.04 ms | 32.4% bf16 MFU | 124832 tok/s step 8114/19560 | loss 3.447735 (+0.28z)| norm 0.2671 (+0.15z)| lr 5.43e-04 | 4165.24 ms | 32.4% bf16 MFU | 124884 tok/s step 8115/19560 | loss 3.498571 (+1.69z)| norm 0.2691 (+0.24z)| lr 5.43e-04 | 4789.58 ms | 28.2% bf16 MFU | 124113 tok/s step 8116/19560 | loss 3.473990 (+0.99z)| norm 0.2576 (-0.30z)| lr 5.43e-04 | 4510.04 ms | 29.9% bf16 MFU | 123720 tok/s step 8117/19560 | loss 3.474342 (+0.98z)| norm 0.3105 (+2.17z)| lr 5.43e-04 | 4439.57 ms | 30.4% bf16 MFU | 123439 tok/s step 8118/19560 | loss 3.543729 (+2.84z)| norm 0.2338 (-1.41z)| lr 5.43e-04 | 4559.92 ms | 29.6% bf16 MFU | 123016 tok/s step 8119/19560 | loss 3.358238 (-2.22z)| norm 0.2572 (-0.31z)| lr 5.43e-04 | 4433.10 ms | 30.5% bf16 MFU | 122778 tok/s step 8120/19560 | loss 3.464270 (+0.64z)| norm 0.2622 (-0.09z)| lr 5.43e-04 | 4228.35 ms | 31.9% bf16 MFU | 122839 tok/s step 8121/19560 | loss 3.466980 (+0.71z)| norm 0.2389 (-1.19z)| lr 5.43e-04 | 4247.55 ms | 31.8% bf16 MFU | 122869 tok/s step 8122/19560 | loss 3.492644 (+1.39z)| norm 0.2794 (+0.70z)| lr 5.43e-04 | 4313.33 ms | 31.3% bf16 MFU | 122803 tok/s step 8123/19560 | loss 3.475878 (+0.92z)| norm 0.2834 (+0.88z)| lr 5.43e-04 | 5098.49 ms | 26.5% bf16 MFU | 121804 tok/s step 8124/19560 | loss 3.485460 (+1.16z)| norm 0.2459 (-0.87z)| lr 5.43e-04 | 4200.19 ms | 32.1% bf16 MFU | 121955 tok/s step 8125/19560 | loss 3.442692 (+0.01z)| norm 0.2843 (+0.92z)| lr 5.43e-04 | 4189.42 ms | 32.2% bf16 MFU | 122115 tok/s step 8126/19560 | loss 3.489938 (+1.26z)| norm 0.2641 (-0.02z)| lr 5.43e-04 | 4188.43 ms | 32.2% bf16 MFU | 122268 tok/s step 8127/19560 | loss 3.448886 (+0.16z)| norm 0.2444 (-0.94z)| lr 5.43e-04 | 4306.45 ms | 31.4% bf16 MFU | 122242 tok/s step 8128/19560 | loss 3.414153 (-0.76z)| norm 0.2362 (-1.30z)| lr 5.43e-04 | 4216.72 ms | 32.0% bf16 MFU | 122346 tok/s step 8129/19560 | loss 3.465104 (+0.59z)| norm 0.2339 (-1.39z)| lr 5.43e-04 | 4225.32 ms | 32.0% bf16 MFU | 122433 tok/s step 8130/19560 | loss 3.593337 (+3.76z)| norm 0.2375 (-1.21z)| lr 5.42e-04 | 4177.22 ms | 32.3% bf16 MFU | 122587 tok/s step 8131/19560 | loss 3.386277 (-1.42z)| norm 0.2770 (+0.66z)| lr 5.42e-04 | 4249.25 ms | 31.8% bf16 MFU | 122627 tok/s step 8132/19560 | loss 3.453070 (+0.24z)| norm 0.2505 (-0.59z)| lr 5.42e-04 | 4211.67 ms | 32.1% bf16 MFU | 122720 tok/s step 8133/19560 | loss 3.461652 (+0.45z)| norm 0.2558 (-0.32z)| lr 5.42e-04 | 4176.85 ms | 32.3% bf16 MFU | 122860 tok/s step 8134/19560 | loss 3.453808 (+0.25z)| norm 0.2825 (+1.02z)| lr 5.42e-04 | 4247.79 ms | 31.8% bf16 MFU | 122888 tok/s step 8135/19560 | loss 3.444977 (+0.02z)| norm 0.2835 (+1.08z)| lr 5.42e-04 | 4204.51 ms | 32.1% bf16 MFU | 122979 tok/s step 8136/19560 | loss 3.459525 (+0.43z)| norm 0.2609 (-0.04z)| lr 5.42e-04 | 4274.91 ms | 31.6% bf16 MFU | 122962 tok/s step 8137/19560 | loss 3.436124 (-0.19z)| norm 0.2815 (+1.06z)| lr 5.42e-04 | 4303.76 ms | 31.4% bf16 MFU | 122905 tok/s step 8138/19560 | loss 3.409186 (-0.89z)| norm 0.2683 (+0.38z)| lr 5.42e-04 | 4209.48 ms | 32.1% bf16 MFU | 122987 tok/s step 8139/19560 | loss 3.436550 (-0.17z)| norm 0.2542 (-0.36z)| lr 5.42e-04 | 4262.26 ms | 31.7% bf16 MFU | 122988 tok/s step 8140/19560 | loss 3.467538 (+0.65z)| norm 0.2813 (+1.09z)| lr 5.42e-04 | 4188.11 ms | 32.2% bf16 MFU | 123098 tok/s step 8141/19560 | loss 3.437056 (-0.15z)| norm 0.2748 (+0.75z)| lr 5.42e-04 | 4242.83 ms | 31.8% bf16 MFU | 123121 tok/s step 8142/19560 | loss 3.448697 (+0.16z)| norm 0.2599 (-0.06z)| lr 5.42e-04 | 4171.62 ms | 32.4% bf16 MFU | 123249 tok/s step 8143/19560 | loss 3.403666 (-1.02z)| norm 0.3111 (+2.61z)| lr 5.42e-04 | 4195.91 ms | 32.2% bf16 MFU | 123335 tok/s step 8144/19560 | loss 3.449439 (+0.18z)| norm 0.2797 (+0.95z)| lr 5.42e-04 | 4241.45 ms | 31.8% bf16 MFU | 123348 tok/s step 8145/19560 | loss 3.457678 (+0.39z)| norm 0.3315 (+3.45z)| lr 5.41e-04 | 4239.46 ms | 31.8% bf16 MFU | 123364 tok/s step 8146/19560 | loss 3.430869 (-0.31z)| norm 0.3083 (+2.23z)| lr 5.41e-04 | 4220.75 ms | 32.0% bf16 MFU | 123407 tok/s step 8147/19560 | loss 3.486244 (+1.14z)| norm 0.2756 (+0.63z)| lr 5.41e-04 | 4196.76 ms | 32.2% bf16 MFU | 123483 tok/s step 8148/19560 | loss 3.487576 (+1.15z)| norm 0.2655 (+0.14z)| lr 5.41e-04 | 4212.32 ms | 32.1% bf16 MFU | 123532 tok/s step 8149/19560 | loss 3.480862 (+0.96z)| norm 0.2498 (-0.63z)| lr 5.41e-04 | 4172.98 ms | 32.4% bf16 MFU | 123637 tok/s step 8150/19560 | loss 3.424459 (-0.53z)| norm 0.2866 (+1.15z)| lr 5.41e-04 | 4182.07 ms | 32.3% bf16 MFU | 123724 tok/s step 8151/19560 | loss 3.417071 (-0.72z)| norm 0.2421 (-1.00z)| lr 5.41e-04 | 4182.00 ms | 32.3% bf16 MFU | 123806 tok/s step 8152/19560 | loss 3.516047 (+1.85z)| norm 0.3082 (+2.14z)| lr 5.41e-04 | 4184.42 ms | 32.3% bf16 MFU | 123880 tok/s step 8153/19560 | loss 3.472748 (+0.71z)| norm 0.2569 (-0.30z)| lr 5.41e-04 | 4178.92 ms | 32.3% bf16 MFU | 123959 tok/s step 8154/19560 | loss 3.474052 (+0.74z)| norm 0.2832 (+0.94z)| lr 5.41e-04 | 4179.97 ms | 32.3% bf16 MFU | 124033 tok/s step 8155/19560 | loss 3.484423 (+0.99z)| norm 0.2794 (+0.75z)| lr 5.41e-04 | 4192.18 ms | 32.2% bf16 MFU | 124084 tok/s step 8156/19560 | loss 3.498653 (+1.34z)| norm 0.2819 (+0.87z)| lr 5.41e-04 | 4175.21 ms | 32.3% bf16 MFU | 124159 tok/s step 8157/19560 | loss 3.513709 (+1.71z)| norm 0.2662 (+0.14z)| lr 5.41e-04 | 4173.76 ms | 32.3% bf16 MFU | 124232 tok/s step 8158/19560 | loss 3.468848 (+0.54z)| norm 0.2778 (+0.70z)| lr 5.41e-04 | 4171.23 ms | 32.4% bf16 MFU | 124305 tok/s step 8159/19560 | loss 3.405012 (-1.12z)| norm 0.2631 (-0.00z)| lr 5.41e-04 | 4182.58 ms | 32.3% bf16 MFU | 124357 tok/s step 8160/19560 | loss 3.482338 (+0.90z)| norm 0.3244 (+2.84z)| lr 5.41e-04 | 4214.72 ms | 32.0% bf16 MFU | 124359 tok/s step 8161/19560 | loss 3.490903 (+1.11z)| norm 0.2803 (+0.77z)| lr 5.40e-04 | 4177.46 ms | 32.3% bf16 MFU | 124416 tok/s step 8162/19560 | loss 3.526515 (+1.99z)| norm 0.2864 (+1.04z)| lr 5.40e-04 | 4174.11 ms | 32.3% bf16 MFU | 124475 tok/s step 8163/19560 | loss 3.510881 (+1.56z)| norm 0.2874 (+1.07z)| lr 5.40e-04 | 4187.80 ms | 32.2% bf16 MFU | 124511 tok/s step 8164/19560 | loss 3.470268 (+0.54z)| norm 0.2937 (+1.34z)| lr 5.40e-04 | 4177.37 ms | 32.3% bf16 MFU | 124561 tok/s step 8165/19560 | loss 3.462204 (+0.34z)| norm 0.2477 (-0.77z)| lr 5.40e-04 | 4178.76 ms | 32.3% bf16 MFU | 124606 tok/s step 8166/19560 | loss 3.459537 (+0.28z)| norm 0.2705 (+0.27z)| lr 5.40e-04 | 4183.78 ms | 32.3% bf16 MFU | 124642 tok/s step 8167/19560 | loss 3.468831 (+0.52z)| norm 0.2516 (-0.60z)| lr 5.40e-04 | 4176.30 ms | 32.3% bf16 MFU | 124687 tok/s step 8168/19560 | loss 3.447706 (-0.03z)| norm 0.2689 (+0.19z)| lr 5.40e-04 | 4315.92 ms | 31.3% bf16 MFU | 124526 tok/s step 8169/19560 | loss 3.418744 (-0.78z)| norm 0.2527 (-0.55z)| lr 5.40e-04 | 4178.08 ms | 32.3% bf16 MFU | 124574 tok/s step 8170/19560 | loss 3.457115 (+0.24z)| norm 0.2822 (+0.80z)| lr 5.40e-04 | 4176.56 ms | 32.3% bf16 MFU | 124622 tok/s step 8171/19560 | loss 3.488318 (+1.05z)| norm 0.2577 (-0.31z)| lr 5.40e-04 | 4194.90 ms | 32.2% bf16 MFU | 124640 tok/s step 8172/19560 | loss 3.526349 (+2.02z)| norm 0.3217 (+2.58z)| lr 5.40e-04 | 4179.54 ms | 32.3% bf16 MFU | 124680 tok/s step 8173/19560 | loss 3.538909 (+2.29z)| norm 0.3343 (+3.01z)| lr 5.40e-04 | 4214.73 ms | 32.0% bf16 MFU | 124666 tok/s step 8174/19560 | loss 3.521436 (+1.80z)| norm 0.2552 (-0.44z)| lr 5.40e-04 | 4180.23 ms | 32.3% bf16 MFU | 124704 tok/s step 8175/19560 | loss 3.424513 (-0.66z)| norm 0.2839 (+0.80z)| lr 5.40e-04 | 4181.41 ms | 32.3% bf16 MFU | 124738 tok/s step 8176/19560 | loss 3.491771 (+1.03z)| norm 0.2516 (-0.60z)| lr 5.40e-04 | 4168.97 ms | 32.4% bf16 MFU | 124789 tok/s step 8177/19560 | loss 3.439185 (-0.31z)| norm 0.2752 (+0.41z)| lr 5.39e-04 | 4177.93 ms | 32.3% bf16 MFU | 124824 tok/s step 8178/19560 | loss 3.479979 (+0.72z)| norm 0.2345 (-1.35z)| lr 5.39e-04 | 4174.25 ms | 32.3% bf16 MFU | 124863 tok/s step 8179/19560 | loss 3.444993 (-0.18z)| norm 0.2858 (+0.87z)| lr 5.39e-04 | 4175.28 ms | 32.3% bf16 MFU | 124898 tok/s step 8180/19560 | loss 3.493208 (+1.04z)| norm 0.2486 (-0.75z)| lr 5.39e-04 | 4166.94 ms | 32.4% bf16 MFU | 124944 tok/s step 8181/19560 | loss 3.408385 (-1.14z)| norm 0.2623 (-0.14z)| lr 5.39e-04 | 4180.37 ms | 32.3% bf16 MFU | 124968 tok/s step 8182/19560 | loss 3.395617 (-1.45z)| norm 0.2479 (-0.78z)| lr 5.39e-04 | 4179.53 ms | 32.3% bf16 MFU | 124991 tok/s step 8183/19560 | loss 3.505352 (+1.33z)| norm 0.2314 (-1.49z)| lr 5.39e-04 | 4222.57 ms | 32.0% bf16 MFU | 124950 tok/s step 8184/19560 | loss 3.514894 (+1.55z)| norm 0.2567 (-0.37z)| lr 5.39e-04 | 4195.52 ms | 32.2% bf16 MFU | 124951 tok/s step 8185/19560 | loss 3.462835 (+0.22z)| norm 0.2657 (+0.02z)| lr 5.39e-04 | 4307.09 ms | 31.3% bf16 MFU | 124790 tok/s step 8186/19560 | loss 3.432314 (-0.55z)| norm 0.2499 (-0.67z)| lr 5.39e-04 | 4183.20 ms | 32.3% bf16 MFU | 124817 tok/s step 8187/19560 | loss 3.408458 (-1.14z)| norm 0.2524 (-0.56z)| lr 5.39e-04 | 4174.58 ms | 32.3% bf16 MFU | 124855 tok/s step 8188/19560 | loss 3.530365 (+1.90z)| norm 0.2331 (-1.39z)| lr 5.39e-04 | 4180.10 ms | 32.3% bf16 MFU | 124884 tok/s step 8189/19560 | loss 3.487257 (+0.81z)| norm 0.2458 (-0.82z)| lr 5.39e-04 | 4171.48 ms | 32.4% bf16 MFU | 124924 tok/s step 8190/19560 | loss 3.459838 (+0.12z)| norm 0.2234 (-1.77z)| lr 5.39e-04 | 4199.31 ms | 32.2% bf16 MFU | 124920 tok/s step 8191/19560 | loss 3.537774 (+2.03z)| norm 0.2397 (-1.05z)| lr 5.39e-04 | 4181.94 ms | 32.3% bf16 MFU | 124943 tok/s step 8192/19560 | loss 3.389267 (-1.61z)| norm 0.2241 (-1.69z)| lr 5.39e-04 | 4163.30 ms | 32.4% bf16 MFU | 124992 tok/s step 8193/19560 | loss 3.460981 (+0.13z)| norm 0.2757 (+0.53z)| lr 5.38e-04 | 4167.24 ms | 32.4% bf16 MFU | 125033 tok/s step 8194/19560 | loss 3.458925 (+0.09z)| norm 0.2348 (-1.21z)| lr 5.38e-04 | 4180.55 ms | 32.3% bf16 MFU | 125052 tok/s step 8195/19560 | loss 3.435953 (-0.50z)| norm 0.2676 (+0.20z)| lr 5.38e-04 | 4167.40 ms | 32.4% bf16 MFU | 125090 tok/s step 8196/19560 | loss 3.433668 (-0.55z)| norm 0.2576 (-0.23z)| lr 5.38e-04 | 4175.93 ms | 32.3% bf16 MFU | 125113 tok/s step 8197/19560 | loss 3.412541 (-1.07z)| norm 0.2578 (-0.21z)| lr 5.38e-04 | 4209.28 ms | 32.1% bf16 MFU | 125085 tok/s step 8198/19560 | loss 3.380535 (-1.84z)| norm 0.2546 (-0.36z)| lr 5.38e-04 | 4197.78 ms | 32.2% bf16 MFU | 125075 tok/s step 8199/19560 | loss 3.437364 (-0.43z)| norm 0.2355 (-1.17z)| lr 5.38e-04 | 4176.98 ms | 32.3% bf16 MFU | 125098 tok/s step 8200/19560 | loss 3.558950 (+2.53z)| norm 0.2523 (-0.43z)| lr 5.38e-04 | 4179.98 ms | 32.3% bf16 MFU | 125114 tok/s step 8201/19560 | loss 3.464362 (+0.22z)| norm 0.2378 (-1.05z)| lr 5.38e-04 | 4173.33 ms | 32.4% bf16 MFU | 125140 tok/s step 8202/19560 | loss 3.422905 (-0.79z)| norm 0.2458 (-0.69z)| lr 5.38e-04 | 4188.97 ms | 32.2% bf16 MFU | 125141 tok/s step 8203/19560 | loss 3.450097 (-0.13z)| norm 0.2325 (-1.26z)| lr 5.38e-04 | 4171.83 ms | 32.4% bf16 MFU | 125167 tok/s step 8204/19560 | loss 3.548877 (+2.22z)| norm 0.2370 (-1.05z)| lr 5.38e-04 | 4186.02 ms | 32.3% bf16 MFU | 125171 tok/s step 8205/19560 | loss 3.619648 (+3.67z)| norm 0.2498 (-0.48z)| lr 5.38e-04 | 4182.89 ms | 32.3% bf16 MFU | 125180 tok/s step 8206/19560 | loss 3.469687 (+0.26z)| norm 0.2420 (-0.82z)| lr 5.38e-04 | 4187.62 ms | 32.2% bf16 MFU | 125181 tok/s step 8207/19560 | loss 3.407353 (-1.17z)| norm 0.2255 (-1.52z)| lr 5.38e-04 | 4192.53 ms | 32.2% bf16 MFU | 125174 tok/s step 8208/19560 | loss 3.435583 (-0.51z)| norm 0.2576 (-0.12z)| lr 5.37e-04 | 4178.73 ms | 32.3% bf16 MFU | 125189 tok/s step 8209/19560 | loss 3.510269 (+1.19z)| norm 0.2426 (-0.77z)| lr 5.37e-04 | 4173.80 ms | 32.3% bf16 MFU | 125210 tok/s step 8210/19560 | loss 3.491485 (+0.76z)| norm 0.2688 (+0.38z)| lr 5.37e-04 | 4173.91 ms | 32.3% bf16 MFU | 125230 tok/s step 8211/19560 | loss 3.465441 (+0.16z)| norm 0.2849 (+1.08z)| lr 5.37e-04 | 4173.50 ms | 32.4% bf16 MFU | 125250 tok/s step 8212/19560 | loss 3.502067 (+1.00z)| norm 0.2881 (+1.20z)| lr 5.37e-04 | 4182.72 ms | 32.3% bf16 MFU | 125255 tok/s step 8213/19560 | loss 3.427998 (-0.70z)| norm 0.2584 (-0.09z)| lr 5.37e-04 | 4176.22 ms | 32.3% bf16 MFU | 125269 tok/s step 8214/19560 | loss 3.461065 (+0.06z)| norm 0.2541 (-0.28z)| lr 5.37e-04 | 4189.09 ms | 32.2% bf16 MFU | 125263 tok/s step 8215/19560 | loss 3.427567 (-0.72z)| norm 0.2640 (+0.14z)| lr 5.37e-04 | 4168.77 ms | 32.4% bf16 MFU | 125288 tok/s step 8216/19560 | loss 3.496725 (+0.87z)| norm 0.2635 (+0.12z)| lr 5.37e-04 | 4179.85 ms | 32.3% bf16 MFU | 125296 tok/s step 8217/19560 | loss 3.462494 (+0.07z)| norm 0.2404 (-0.90z)| lr 5.37e-04 | 4185.54 ms | 32.3% bf16 MFU | 125294 tok/s step 8218/19560 | loss 3.462174 (+0.06z)| norm 0.2522 (-0.39z)| lr 5.37e-04 | 4256.46 ms | 31.7% bf16 MFU | 125188 tok/s step 8219/19560 | loss 3.439530 (-0.46z)| norm 0.2550 (-0.28z)| lr 5.37e-04 | 4171.41 ms | 32.4% bf16 MFU | 125213 tok/s step 8220/19560 | loss 3.389695 (-1.60z)| norm 0.2468 (-0.64z)| lr 5.37e-04 | 4224.06 ms | 32.0% bf16 MFU | 125158 tok/s step 8221/19560 | loss 3.408117 (-1.18z)| norm 0.2573 (-0.17z)| lr 5.37e-04 | 4172.97 ms | 32.4% bf16 MFU | 125182 tok/s step 8222/19560 | loss 3.397786 (-1.41z)| norm 0.2462 (-0.66z)| lr 5.37e-04 | 4181.55 ms | 32.3% bf16 MFU | 125192 tok/s step 8223/19560 | loss 3.466742 (+0.18z)| norm 0.2798 (+0.82z)| lr 5.37e-04 | 4179.38 ms | 32.3% bf16 MFU | 125205 tok/s step 8224/19560 | loss 3.445405 (-0.32z)| norm 0.2720 (+0.46z)| lr 5.36e-04 | 4175.83 ms | 32.3% bf16 MFU | 125222 tok/s step 8225/19560 | loss 3.505100 (+1.06z)| norm 0.2870 (+1.11z)| lr 5.36e-04 | 4186.08 ms | 32.3% bf16 MFU | 125223 tok/s step 8226/19560 | loss 3.475702 (+0.37z)| norm 0.3079 (+1.99z)| lr 5.36e-04 | 4921.33 ms | 27.4% bf16 MFU | 124289 tok/s step 8227/19560 | loss 3.465765 (+0.14z)| norm 0.2716 (+0.39z)| lr 5.36e-04 | 4183.36 ms | 32.3% bf16 MFU | 124341 tok/s step 8228/19560 | loss 3.534076 (+1.69z)| norm 0.3005 (+1.63z)| lr 5.36e-04 | 4179.65 ms | 32.3% bf16 MFU | 124396 tok/s step 8229/19560 | loss 3.450683 (-0.23z)| norm 0.2665 (+0.14z)| lr 5.36e-04 | 4193.93 ms | 32.2% bf16 MFU | 124427 tok/s step 8230/19560 | loss 3.472907 (+0.27z)| norm 0.2767 (+0.58z)| lr 5.36e-04 | 4186.00 ms | 32.3% bf16 MFU | 124468 tok/s step 8231/19560 | loss 3.361701 (-2.24z)| norm 0.2883 (+1.07z)| lr 5.36e-04 | 4170.27 ms | 32.4% bf16 MFU | 124530 tok/s step 8232/19560 | loss 3.458914 (-0.05z)| norm 0.2887 (+1.07z)| lr 5.36e-04 | 4174.42 ms | 32.3% bf16 MFU | 124583 tok/s step 8233/19560 | loss 3.433839 (-0.62z)| norm 0.2790 (+0.63z)| lr 5.36e-04 | 4239.37 ms | 31.8% bf16 MFU | 124538 tok/s step 8234/19560 | loss 3.527910 (+1.50z)| norm 0.2671 (+0.10z)| lr 5.36e-04 | 4166.64 ms | 32.4% bf16 MFU | 124602 tok/s step 8235/19560 | loss 3.465143 (+0.07z)| norm 0.2477 (-0.75z)| lr 5.36e-04 | 4182.76 ms | 32.3% bf16 MFU | 124640 tok/s step 8236/19560 | loss 3.466967 (+0.10z)| norm 0.2547 (-0.45z)| lr 5.36e-04 | 4180.62 ms | 32.3% bf16 MFU | 124678 tok/s step 8237/19560 | loss 3.429581 (-0.76z)| norm 0.2565 (-0.37z)| lr 5.36e-04 | 4172.83 ms | 32.4% bf16 MFU | 124726 tok/s step 8238/19560 | loss 3.471774 (+0.21z)| norm 0.2590 (-0.26z)| lr 5.36e-04 | 4181.99 ms | 32.3% bf16 MFU | 124758 tok/s step 8239/19560 | loss 3.459607 (-0.08z)| norm 0.2529 (-0.53z)| lr 5.36e-04 | 4189.32 ms | 32.2% bf16 MFU | 124778 tok/s step 8240/19560 | loss 3.404891 (-1.31z)| norm 0.2504 (-0.63z)| lr 5.35e-04 | 4174.03 ms | 32.3% bf16 MFU | 124819 tok/s step 8241/19560 | loss 3.456952 (-0.14z)| norm 0.2468 (-0.79z)| lr 5.35e-04 | 4240.48 ms | 31.8% bf16 MFU | 124760 tok/s step 8242/19560 | loss 3.485716 (+0.51z)| norm 0.2499 (-0.65z)| lr 5.35e-04 | 4176.17 ms | 32.3% bf16 MFU | 124799 tok/s step 8243/19560 | loss 3.473006 (+0.23z)| norm 0.2479 (-0.73z)| lr 5.35e-04 | 4165.58 ms | 32.4% bf16 MFU | 124853 tok/s step 8244/19560 | loss 3.465762 (+0.06z)| norm 0.2626 (-0.08z)| lr 5.35e-04 | 4205.24 ms | 32.1% bf16 MFU | 124844 tok/s step 8245/19560 | loss 3.420483 (-0.97z)| norm 0.2451 (-0.85z)| lr 5.35e-04 | 4179.47 ms | 32.3% bf16 MFU | 124874 tok/s step 8246/19560 | loss 3.499378 (+0.86z)| norm 0.2453 (-0.84z)| lr 5.35e-04 | 4182.18 ms | 32.3% bf16 MFU | 124898 tok/s step 8247/19560 | loss 3.486882 (+0.56z)| norm 0.2613 (-0.12z)| lr 5.35e-04 | 4195.80 ms | 32.2% bf16 MFU | 124901 tok/s step 8248/19560 | loss 3.400370 (-1.47z)| norm 0.2548 (-0.41z)| lr 5.35e-04 | 4178.76 ms | 32.3% bf16 MFU | 124929 tok/s step 8249/19560 | loss 3.454131 (-0.20z)| norm 0.2692 (+0.23z)| lr 5.35e-04 | 4177.06 ms | 32.3% bf16 MFU | 124959 tok/s step 8250/19560 | loss 3.463948 (+0.03z)| norm 0.2609 (-0.14z)| lr 5.35e-04 | 4175.56 ms | 32.3% bf16 MFU | 124989 tok/s val loss 3.425269 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2899/10042 = 0.288688 step 8251/19560 | loss 3.442279 (-0.47z)| norm 0.2545 (-0.42z)| lr 5.35e-04 | 4400.96 ms | 30.7% bf16 MFU | 124696 tok/s step 8252/19560 | loss 3.483814 (+0.51z)| norm 0.2353 (-1.29z)| lr 5.35e-04 | 4307.86 ms | 31.3% bf16 MFU | 124546 tok/s step 8253/19560 | loss 3.515003 (+1.23z)| norm 0.2384 (-1.13z)| lr 5.35e-04 | 4242.12 ms | 31.8% bf16 MFU | 124498 tok/s step 8254/19560 | loss 3.470514 (+0.19z)| norm 0.2736 (+0.46z)| lr 5.35e-04 | 4319.96 ms | 31.3% bf16 MFU | 124342 tok/s step 8255/19560 | loss 3.483699 (+0.49z)| norm 0.2569 (-0.30z)| lr 5.35e-04 | 4180.20 ms | 32.3% bf16 MFU | 124396 tok/s step 8256/19560 | loss 3.468005 (+0.11z)| norm 0.2747 (+0.50z)| lr 5.34e-04 | 4174.67 ms | 32.3% bf16 MFU | 124455 tok/s step 8257/19560 | loss 3.514445 (+1.19z)| norm 0.2792 (+0.69z)| lr 5.34e-04 | 4165.82 ms | 32.4% bf16 MFU | 124525 tok/s step 8258/19560 | loss 3.412606 (-1.21z)| norm 0.2740 (+0.44z)| lr 5.34e-04 | 4179.43 ms | 32.3% bf16 MFU | 124571 tok/s step 8259/19560 | loss 3.373478 (-2.14z)| norm 0.2631 (-0.06z)| lr 5.34e-04 | 4184.15 ms | 32.3% bf16 MFU | 124608 tok/s step 8260/19560 | loss 3.451457 (-0.26z)| norm 0.2680 (+0.16z)| lr 5.34e-04 | 4167.03 ms | 32.4% bf16 MFU | 124668 tok/s step 8261/19560 | loss 3.583029 (+2.81z)| norm 0.2645 (+0.00z)| lr 5.34e-04 | 4187.29 ms | 32.2% bf16 MFU | 124695 tok/s step 8262/19560 | loss 3.440347 (-0.53z)| norm 0.2609 (-0.16z)| lr 5.34e-04 | 4188.69 ms | 32.2% bf16 MFU | 124719 tok/s step 8263/19560 | loss 3.526368 (+1.45z)| norm 0.2588 (-0.25z)| lr 5.34e-04 | 4181.46 ms | 32.3% bf16 MFU | 124752 tok/s step 8264/19560 | loss 3.429870 (-0.78z)| norm 0.2680 (+0.17z)| lr 5.34e-04 | 4226.46 ms | 31.9% bf16 MFU | 124717 tok/s step 8265/19560 | loss 3.453252 (-0.24z)| norm 0.2462 (-0.83z)| lr 5.34e-04 | 4212.19 ms | 32.1% bf16 MFU | 124705 tok/s step 8266/19560 | loss 3.462015 (-0.05z)| norm 0.2480 (-0.73z)| lr 5.34e-04 | 4177.82 ms | 32.3% bf16 MFU | 124744 tok/s step 8267/19560 | loss 3.442006 (-0.51z)| norm 0.2430 (-0.96z)| lr 5.34e-04 | 4171.10 ms | 32.4% bf16 MFU | 124792 tok/s step 8268/19560 | loss 3.520944 (+1.31z)| norm 0.2341 (-1.35z)| lr 5.34e-04 | 4166.23 ms | 32.4% bf16 MFU | 124844 tok/s step 8269/19560 | loss 3.417748 (-1.08z)| norm 0.2422 (-0.96z)| lr 5.34e-04 | 4180.86 ms | 32.3% bf16 MFU | 124872 tok/s step 8270/19560 | loss 3.409631 (-1.25z)| norm 0.2176 (-2.04z)| lr 5.34e-04 | 4176.41 ms | 32.3% bf16 MFU | 124905 tok/s step 8271/19560 | loss 3.467312 (+0.07z)| norm 0.2471 (-0.70z)| lr 5.33e-04 | 4179.88 ms | 32.3% bf16 MFU | 124932 tok/s step 8272/19560 | loss 3.484663 (+0.46z)| norm 0.2508 (-0.52z)| lr 5.33e-04 | 4182.32 ms | 32.3% bf16 MFU | 124953 tok/s step 8273/19560 | loss 3.455955 (-0.20z)| norm 0.2716 (+0.48z)| lr 5.33e-04 | 4177.45 ms | 32.3% bf16 MFU | 124980 tok/s step 8274/19560 | loss 3.384573 (-1.83z)| norm 0.2346 (-1.28z)| lr 5.33e-04 | 4177.08 ms | 32.3% bf16 MFU | 125007 tok/s step 8275/19560 | loss 3.487804 (+0.54z)| norm 0.2638 (+0.14z)| lr 5.33e-04 | 4178.25 ms | 32.3% bf16 MFU | 125031 tok/s step 8276/19560 | loss 3.506426 (+0.96z)| norm 0.2535 (-0.36z)| lr 5.33e-04 | 4182.02 ms | 32.3% bf16 MFU | 125048 tok/s step 8277/19560 | loss 3.471657 (+0.16z)| norm 0.2665 (+0.27z)| lr 5.33e-04 | 4172.03 ms | 32.4% bf16 MFU | 125079 tok/s step 8278/19560 | loss 3.469247 (+0.10z)| norm 0.2433 (-0.85z)| lr 5.33e-04 | 4174.53 ms | 32.3% bf16 MFU | 125104 tok/s step 8279/19560 | loss 3.431865 (-0.76z)| norm 0.2643 (+0.17z)| lr 5.33e-04 | 4194.78 ms | 32.2% bf16 MFU | 125098 tok/s step 8280/19560 | loss 3.451055 (-0.31z)| norm 0.2706 (+0.51z)| lr 5.33e-04 | 4168.86 ms | 32.4% bf16 MFU | 125132 tok/s step 8281/19560 | loss 3.390865 (-1.67z)| norm 0.2550 (-0.27z)| lr 5.33e-04 | 4179.58 ms | 32.3% bf16 MFU | 125147 tok/s step 8282/19560 | loss 3.464290 (+0.01z)| norm 0.2746 (+0.72z)| lr 5.33e-04 | 4173.26 ms | 32.4% bf16 MFU | 125171 tok/s step 8283/19560 | loss 3.410167 (-1.20z)| norm 0.2456 (-0.73z)| lr 5.33e-04 | 4177.84 ms | 32.3% bf16 MFU | 125187 tok/s step 8284/19560 | loss 3.410525 (-1.18z)| norm 0.2820 (+1.11z)| lr 5.33e-04 | 4168.86 ms | 32.4% bf16 MFU | 125216 tok/s step 8285/19560 | loss 3.414842 (-1.06z)| norm 0.2328 (-1.36z)| lr 5.33e-04 | 4176.91 ms | 32.3% bf16 MFU | 125231 tok/s step 8286/19560 | loss 3.426261 (-0.79z)| norm 0.2403 (-0.97z)| lr 5.33e-04 | 4171.57 ms | 32.4% bf16 MFU | 125254 tok/s step 8287/19560 | loss 3.467514 (+0.13z)| norm 0.2401 (-0.96z)| lr 5.32e-04 | 4184.59 ms | 32.3% bf16 MFU | 125256 tok/s step 8288/19560 | loss 3.511349 (+1.12z)| norm 0.2646 (+0.30z)| lr 5.32e-04 | 4182.09 ms | 32.3% bf16 MFU | 125261 tok/s step 8289/19560 | loss 3.433652 (-0.63z)| norm 0.2360 (-1.17z)| lr 5.32e-04 | 4179.47 ms | 32.3% bf16 MFU | 125270 tok/s step 8290/19560 | loss 3.525475 (+1.45z)| norm 0.2574 (-0.05z)| lr 5.32e-04 | 4204.24 ms | 32.1% bf16 MFU | 125242 tok/s step 8291/19560 | loss 3.555825 (+2.10z)| norm 0.2712 (+0.69z)| lr 5.32e-04 | 4169.33 ms | 32.4% bf16 MFU | 125267 tok/s step 8292/19560 | loss 3.437030 (-0.55z)| norm 0.2688 (+0.58z)| lr 5.32e-04 | 4165.64 ms | 32.4% bf16 MFU | 125297 tok/s step 8293/19560 | loss 3.451472 (-0.23z)| norm 0.2448 (-0.71z)| lr 5.32e-04 | 4172.79 ms | 32.4% bf16 MFU | 125314 tok/s step 8294/19560 | loss 3.401577 (-1.32z)| norm 0.2612 (+0.18z)| lr 5.32e-04 | 4202.43 ms | 32.1% bf16 MFU | 125286 tok/s step 8295/19560 | loss 3.473324 (+0.27z)| norm 0.2650 (+0.38z)| lr 5.32e-04 | 4181.70 ms | 32.3% bf16 MFU | 125291 tok/s step 8296/19560 | loss 3.495360 (+0.75z)| norm 0.2669 (+0.47z)| lr 5.32e-04 | 4177.35 ms | 32.3% bf16 MFU | 125302 tok/s step 8297/19560 | loss 3.435606 (-0.58z)| norm 0.2987 (+2.12z)| lr 5.32e-04 | 4174.23 ms | 32.3% bf16 MFU | 125317 tok/s step 8298/19560 | loss 3.439884 (-0.48z)| norm 0.2696 (+0.60z)| lr 5.32e-04 | 4166.17 ms | 32.4% bf16 MFU | 125343 tok/s step 8299/19560 | loss 3.375794 (-1.86z)| norm 0.2748 (+0.86z)| lr 5.32e-04 | 4191.93 ms | 32.2% bf16 MFU | 125330 tok/s step 8300/19560 | loss 3.435770 (-0.53z)| norm 0.2678 (+0.54z)| lr 5.32e-04 | 4177.32 ms | 32.3% bf16 MFU | 125338 tok/s step 8301/19560 | loss 3.436451 (-0.51z)| norm 0.2571 (-0.02z)| lr 5.32e-04 | 4168.83 ms | 32.4% bf16 MFU | 125360 tok/s step 8302/19560 | loss 3.396215 (-1.38z)| norm 0.2524 (-0.30z)| lr 5.31e-04 | 4195.29 ms | 32.2% bf16 MFU | 125340 tok/s step 8303/19560 | loss 3.465429 (+0.15z)| norm 0.2573 (+0.01z)| lr 5.31e-04 | 4169.24 ms | 32.4% bf16 MFU | 125361 tok/s step 8304/19560 | loss 3.504915 (+1.03z)| norm 0.3025 (+2.63z)| lr 5.31e-04 | 4214.22 ms | 32.0% bf16 MFU | 125313 tok/s step 8305/19560 | loss 3.467773 (+0.20z)| norm 0.2526 (-0.28z)| lr 5.31e-04 | 4213.21 ms | 32.0% bf16 MFU | 125270 tok/s step 8306/19560 | loss 3.461780 (+0.07z)| norm 0.2600 (+0.14z)| lr 5.31e-04 | 5196.59 ms | 26.0% bf16 MFU | 124051 tok/s step 8307/19560 | loss 3.445415 (-0.30z)| norm 0.2732 (+0.93z)| lr 5.31e-04 | 4924.56 ms | 27.4% bf16 MFU | 123171 tok/s step 8308/19560 | loss 3.461685 (+0.07z)| norm 0.2657 (+0.48z)| lr 5.31e-04 | 4994.88 ms | 27.0% bf16 MFU | 122261 tok/s step 8309/19560 | loss 3.467592 (+0.20z)| norm 0.2458 (-0.69z)| lr 5.31e-04 | 4825.24 ms | 28.0% bf16 MFU | 121581 tok/s step 8310/19560 | loss 3.407684 (-1.16z)| norm 0.2517 (-0.35z)| lr 5.31e-04 | 4371.08 ms | 30.9% bf16 MFU | 121499 tok/s step 8311/19560 | loss 3.405858 (-1.18z)| norm 0.2454 (-0.73z)| lr 5.31e-04 | 4494.04 ms | 30.0% bf16 MFU | 121257 tok/s step 8312/19560 | loss 3.445564 (-0.28z)| norm 0.2456 (-0.71z)| lr 5.31e-04 | 4305.89 ms | 31.4% bf16 MFU | 121282 tok/s step 8313/19560 | loss 3.413664 (-0.99z)| norm 0.2548 (-0.16z)| lr 5.31e-04 | 4349.30 ms | 31.0% bf16 MFU | 121245 tok/s step 8314/19560 | loss 3.484969 (+0.62z)| norm 0.2327 (-1.46z)| lr 5.31e-04 | 4211.66 ms | 32.1% bf16 MFU | 121407 tok/s step 8315/19560 | loss 3.485770 (+0.62z)| norm 0.2605 (+0.19z)| lr 5.31e-04 | 4174.78 ms | 32.3% bf16 MFU | 121616 tok/s step 8316/19560 | loss 3.456929 (-0.02z)| norm 0.2389 (-1.10z)| lr 5.31e-04 | 4234.17 ms | 31.9% bf16 MFU | 121727 tok/s step 8317/19560 | loss 3.493066 (+0.81z)| norm 0.2495 (-0.48z)| lr 5.31e-04 | 4195.48 ms | 32.2% bf16 MFU | 121889 tok/s step 8318/19560 | loss 3.425395 (-0.73z)| norm 0.2457 (-0.72z)| lr 5.30e-04 | 4216.49 ms | 32.0% bf16 MFU | 122011 tok/s step 8319/19560 | loss 3.464490 (+0.17z)| norm 0.2154 (-2.49z)| lr 5.30e-04 | 4189.42 ms | 32.2% bf16 MFU | 122168 tok/s step 8320/19560 | loss 3.512966 (+1.28z)| norm 0.2387 (-1.13z)| lr 5.30e-04 | 4776.46 ms | 28.3% bf16 MFU | 121548 tok/s step 8321/19560 | loss 3.467261 (+0.22z)| norm 0.2410 (-0.98z)| lr 5.30e-04 | 4240.65 ms | 31.8% bf16 MFU | 121652 tok/s step 8322/19560 | loss 3.455754 (-0.05z)| norm 0.2468 (-0.64z)| lr 5.30e-04 | 4178.61 ms | 32.3% bf16 MFU | 121843 tok/s step 8323/19560 | loss 3.436519 (-0.50z)| norm 0.2607 (+0.21z)| lr 5.30e-04 | 4266.01 ms | 31.6% bf16 MFU | 121896 tok/s step 8324/19560 | loss 3.533608 (+1.72z)| norm 0.2477 (-0.58z)| lr 5.30e-04 | 4178.59 ms | 32.3% bf16 MFU | 122074 tok/s step 8325/19560 | loss 3.496022 (+0.84z)| norm 0.2389 (-1.10z)| lr 5.30e-04 | 4270.92 ms | 31.6% bf16 MFU | 122109 tok/s step 8326/19560 | loss 3.431922 (-0.65z)| norm 0.2487 (-0.50z)| lr 5.30e-04 | 4193.12 ms | 32.2% bf16 MFU | 122255 tok/s step 8327/19560 | loss 3.494614 (+0.80z)| norm 0.2383 (-1.13z)| lr 5.30e-04 | 4180.67 ms | 32.3% bf16 MFU | 122413 tok/s step 8328/19560 | loss 3.432987 (-0.62z)| norm 0.2582 (+0.06z)| lr 5.30e-04 | 4254.50 ms | 31.7% bf16 MFU | 122454 tok/s step 8329/19560 | loss 3.470504 (+0.27z)| norm 0.2620 (+0.28z)| lr 5.30e-04 | 4198.19 ms | 32.2% bf16 MFU | 122575 tok/s step 8330/19560 | loss 3.421133 (-0.91z)| norm 0.2611 (+0.22z)| lr 5.30e-04 | 4193.04 ms | 32.2% bf16 MFU | 122698 tok/s step 8331/19560 | loss 3.516331 (+1.33z)| norm 0.2531 (-0.27z)| lr 5.30e-04 | 4243.88 ms | 31.8% bf16 MFU | 122740 tok/s step 8332/19560 | loss 3.437205 (-0.52z)| norm 0.2510 (-0.41z)| lr 5.30e-04 | 4179.77 ms | 32.3% bf16 MFU | 122875 tok/s step 8333/19560 | loss 3.481392 (+0.60z)| norm 0.2486 (-0.56z)| lr 5.30e-04 | 4181.95 ms | 32.3% bf16 MFU | 123000 tok/s step 8334/19560 | loss 3.455807 (-0.05z)| norm 0.2509 (-0.42z)| lr 5.29e-04 | 4199.17 ms | 32.2% bf16 MFU | 123092 tok/s step 8335/19560 | loss 3.499725 (+1.05z)| norm 0.2593 (+0.08z)| lr 5.29e-04 | 4566.26 ms | 29.6% bf16 MFU | 122679 tok/s step 8336/19560 | loss 3.611828 (+3.67z)| norm 0.2630 (+0.31z)| lr 5.29e-04 | 4186.34 ms | 32.3% bf16 MFU | 122807 tok/s step 8337/19560 | loss 3.522442 (+1.51z)| norm 0.2654 (+0.45z)| lr 5.29e-04 | 4180.39 ms | 32.3% bf16 MFU | 122937 tok/s step 8338/19560 | loss 3.483028 (+0.56z)| norm 0.2559 (-0.14z)| lr 5.29e-04 | 4178.47 ms | 32.3% bf16 MFU | 123064 tok/s step 8339/19560 | loss 3.461001 (+0.03z)| norm 0.2658 (+0.50z)| lr 5.29e-04 | 4223.47 ms | 32.0% bf16 MFU | 123118 tok/s step 8340/19560 | loss 3.533239 (+1.75z)| norm 0.2577 (-0.00z)| lr 5.29e-04 | 4249.87 ms | 31.8% bf16 MFU | 123130 tok/s step 8341/19560 | loss 3.458713 (-0.04z)| norm 0.2767 (+1.21z)| lr 5.29e-04 | 4237.85 ms | 31.9% bf16 MFU | 123159 tok/s step 8342/19560 | loss 3.383026 (-1.81z)| norm 0.2406 (-1.10z)| lr 5.29e-04 | 4311.55 ms | 31.3% bf16 MFU | 123081 tok/s step 8343/19560 | loss 3.449365 (-0.25z)| norm 0.2829 (+1.58z)| lr 5.29e-04 | 4573.53 ms | 29.5% bf16 MFU | 122659 tok/s step 8344/19560 | loss 3.457358 (-0.05z)| norm 0.2563 (-0.10z)| lr 5.29e-04 | 4175.29 ms | 32.3% bf16 MFU | 122805 tok/s step 8345/19560 | loss 3.528958 (+1.62z)| norm 0.2613 (+0.21z)| lr 5.29e-04 | 4172.67 ms | 32.4% bf16 MFU | 122947 tok/s step 8346/19560 | loss 3.517707 (+1.34z)| norm 0.2867 (+1.78z)| lr 5.29e-04 | 4178.05 ms | 32.3% bf16 MFU | 123074 tok/s step 8347/19560 | loss 3.459512 (-0.03z)| norm 0.2726 (+0.88z)| lr 5.29e-04 | 4181.62 ms | 32.3% bf16 MFU | 123189 tok/s step 8348/19560 | loss 3.440016 (-0.50z)| norm 0.3013 (+2.58z)| lr 5.29e-04 | 4181.44 ms | 32.3% bf16 MFU | 123299 tok/s step 8349/19560 | loss 3.474997 (+0.32z)| norm 0.2873 (+1.70z)| lr 5.28e-04 | 4185.58 ms | 32.3% bf16 MFU | 123397 tok/s step 8350/19560 | loss 3.520338 (+1.38z)| norm 0.2558 (-0.20z)| lr 5.28e-04 | 4213.32 ms | 32.0% bf16 MFU | 123449 tok/s step 8351/19560 | loss 3.464981 (+0.06z)| norm 0.2886 (+1.76z)| lr 5.28e-04 | 4182.19 ms | 32.3% bf16 MFU | 123544 tok/s step 8352/19560 | loss 3.412269 (-1.18z)| norm 0.2541 (-0.30z)| lr 5.28e-04 | 4200.70 ms | 32.1% bf16 MFU | 123608 tok/s step 8353/19560 | loss 3.475499 (+0.32z)| norm 0.2730 (+0.85z)| lr 5.28e-04 | 4188.03 ms | 32.2% bf16 MFU | 123687 tok/s step 8354/19560 | loss 3.431045 (-0.73z)| norm 0.2420 (-1.03z)| lr 5.28e-04 | 4228.50 ms | 31.9% bf16 MFU | 123702 tok/s step 8355/19560 | loss 3.401017 (-1.42z)| norm 0.2393 (-1.18z)| lr 5.28e-04 | 4214.28 ms | 32.0% bf16 MFU | 123737 tok/s step 8356/19560 | loss 3.436880 (-0.56z)| norm 0.2388 (-1.21z)| lr 5.28e-04 | 4177.67 ms | 32.3% bf16 MFU | 123825 tok/s step 8357/19560 | loss 3.461356 (+0.02z)| norm 0.2336 (-1.51z)| lr 5.28e-04 | 4211.23 ms | 32.1% bf16 MFU | 123859 tok/s step 8358/19560 | loss 3.445462 (-0.35z)| norm 0.2441 (-0.84z)| lr 5.28e-04 | 4199.26 ms | 32.2% bf16 MFU | 123908 tok/s step 8359/19560 | loss 3.499418 (+0.92z)| norm 0.2550 (-0.13z)| lr 5.28e-04 | 4179.65 ms | 32.3% bf16 MFU | 123985 tok/s step 8360/19560 | loss 3.481809 (+0.49z)| norm 0.2525 (-0.27z)| lr 5.28e-04 | 4201.93 ms | 32.1% bf16 MFU | 124024 tok/s step 8361/19560 | loss 3.432630 (-0.70z)| norm 0.2433 (-0.86z)| lr 5.28e-04 | 4188.63 ms | 32.2% bf16 MFU | 124082 tok/s step 8362/19560 | loss 3.492503 (+0.76z)| norm 0.2368 (-1.28z)| lr 5.28e-04 | 4428.55 ms | 30.5% bf16 MFU | 123797 tok/s step 8363/19560 | loss 3.418818 (-1.02z)| norm 0.2658 (+0.62z)| lr 5.28e-04 | 4327.20 ms | 31.2% bf16 MFU | 123665 tok/s step 8364/19560 | loss 3.420705 (-0.97z)| norm 0.2564 (+0.01z)| lr 5.28e-04 | 4237.92 ms | 31.9% bf16 MFU | 123668 tok/s step 8365/19560 | loss 3.492478 (+0.76z)| norm 0.2539 (-0.16z)| lr 5.27e-04 | 4185.28 ms | 32.3% bf16 MFU | 123748 tok/s step 8366/19560 | loss 3.442570 (-0.44z)| norm 0.2572 (+0.06z)| lr 5.27e-04 | 4212.91 ms | 32.0% bf16 MFU | 123783 tok/s step 8367/19560 | loss 3.423643 (-0.89z)| norm 0.2911 (+2.22z)| lr 5.27e-04 | 4292.97 ms | 31.5% bf16 MFU | 123700 tok/s step 8368/19560 | loss 3.448298 (-0.31z)| norm 0.2419 (-0.94z)| lr 5.27e-04 | 4213.40 ms | 32.0% bf16 MFU | 123737 tok/s step 8369/19560 | loss 3.441607 (-0.46z)| norm 0.2714 (+0.94z)| lr 5.27e-04 | 4256.21 ms | 31.7% bf16 MFU | 123709 tok/s step 8370/19560 | loss 3.436153 (-0.59z)| norm 0.2516 (-0.33z)| lr 5.27e-04 | 4201.85 ms | 32.1% bf16 MFU | 123762 tok/s step 8371/19560 | loss 3.427490 (-0.79z)| norm 0.2734 (+1.05z)| lr 5.27e-04 | 4173.36 ms | 32.4% bf16 MFU | 123855 tok/s step 8372/19560 | loss 3.427214 (-0.79z)| norm 0.2587 (+0.12z)| lr 5.27e-04 | 4173.19 ms | 32.4% bf16 MFU | 123944 tok/s step 8373/19560 | loss 3.421831 (-0.92z)| norm 0.2943 (+2.32z)| lr 5.27e-04 | 4265.09 ms | 31.7% bf16 MFU | 123893 tok/s step 8374/19560 | loss 3.425346 (-0.82z)| norm 0.2352 (-1.37z)| lr 5.27e-04 | 4299.36 ms | 31.4% bf16 MFU | 123796 tok/s step 8375/19560 | loss 3.452592 (-0.15z)| norm 0.2750 (+1.10z)| lr 5.27e-04 | 4184.33 ms | 32.3% bf16 MFU | 123871 tok/s step 8376/19560 | loss 3.531298 (+1.72z)| norm 0.2522 (-0.32z)| lr 5.27e-04 | 4198.44 ms | 32.2% bf16 MFU | 123921 tok/s step 8377/19560 | loss 3.460546 (+0.01z)| norm 0.2848 (+1.68z)| lr 5.27e-04 | 4183.91 ms | 32.3% bf16 MFU | 123991 tok/s step 8378/19560 | loss 3.475878 (+0.38z)| norm 0.2527 (-0.29z)| lr 5.27e-04 | 4269.21 ms | 31.6% bf16 MFU | 123932 tok/s step 8379/19560 | loss 3.463480 (+0.08z)| norm 0.2656 (+0.50z)| lr 5.27e-04 | 4193.95 ms | 32.2% bf16 MFU | 123986 tok/s step 8380/19560 | loss 3.454158 (-0.14z)| norm 0.2661 (+0.52z)| lr 5.27e-04 | 4264.91 ms | 31.7% bf16 MFU | 123933 tok/s step 8381/19560 | loss 3.434922 (-0.59z)| norm 0.2593 (+0.09z)| lr 5.26e-04 | 4198.09 ms | 32.2% bf16 MFU | 123981 tok/s step 8382/19560 | loss 3.390595 (-1.64z)| norm 0.2637 (+0.37z)| lr 5.26e-04 | 4188.29 ms | 32.2% bf16 MFU | 124040 tok/s step 8383/19560 | loss 3.394461 (-1.52z)| norm 0.2450 (-0.79z)| lr 5.26e-04 | 4198.40 ms | 32.2% bf16 MFU | 124082 tok/s step 8384/19560 | loss 3.414454 (-1.03z)| norm 0.2381 (-1.20z)| lr 5.26e-04 | 4202.15 ms | 32.1% bf16 MFU | 124117 tok/s step 8385/19560 | loss 3.477092 (+0.47z)| norm 0.2353 (-1.35z)| lr 5.26e-04 | 4182.14 ms | 32.3% bf16 MFU | 124179 tok/s step 8386/19560 | loss 3.360482 (-2.27z)| norm 0.2381 (-1.16z)| lr 5.26e-04 | 4191.96 ms | 32.2% bf16 MFU | 124223 tok/s step 8387/19560 | loss 3.475879 (+0.43z)| norm 0.2501 (-0.41z)| lr 5.26e-04 | 4214.39 ms | 32.0% bf16 MFU | 124232 tok/s step 8388/19560 | loss 3.411965 (-1.08z)| norm 0.2624 (+0.36z)| lr 5.26e-04 | 4182.49 ms | 32.3% bf16 MFU | 124289 tok/s step 8389/19560 | loss 3.536126 (+1.92z)| norm 0.2598 (+0.20z)| lr 5.26e-04 | 4187.27 ms | 32.2% bf16 MFU | 124335 tok/s step 8390/19560 | loss 3.550570 (+2.21z)| norm 0.2610 (+0.28z)| lr 5.26e-04 | 4251.67 ms | 31.8% bf16 MFU | 124284 tok/s step 8391/19560 | loss 3.398005 (-1.41z)| norm 0.2556 (-0.06z)| lr 5.26e-04 | 4181.02 ms | 32.3% bf16 MFU | 124339 tok/s step 8392/19560 | loss 3.449761 (-0.17z)| norm 0.2518 (-0.29z)| lr 5.26e-04 | 4188.90 ms | 32.2% bf16 MFU | 124380 tok/s step 8393/19560 | loss 3.478061 (+0.50z)| norm 0.2533 (-0.20z)| lr 5.26e-04 | 4190.46 ms | 32.2% bf16 MFU | 124417 tok/s step 8394/19560 | loss 3.459336 (+0.05z)| norm 0.2593 (+0.17z)| lr 5.26e-04 | 4183.48 ms | 32.3% bf16 MFU | 124462 tok/s step 8395/19560 | loss 3.437334 (-0.48z)| norm 0.2488 (-0.49z)| lr 5.26e-04 | 4186.99 ms | 32.2% bf16 MFU | 124500 tok/s step 8396/19560 | loss 3.495464 (+0.93z)| norm 0.2460 (-0.68z)| lr 5.25e-04 | 4177.49 ms | 32.3% bf16 MFU | 124550 tok/s step 8397/19560 | loss 3.499493 (+1.01z)| norm 0.2555 (-0.08z)| lr 5.25e-04 | 4187.72 ms | 32.2% bf16 MFU | 124583 tok/s step 8398/19560 | loss 3.495594 (+0.90z)| norm 0.2708 (+0.87z)| lr 5.25e-04 | 4268.49 ms | 31.6% bf16 MFU | 124495 tok/s step 8399/19560 | loss 3.473200 (+0.36z)| norm 0.2342 (-1.47z)| lr 5.25e-04 | 4178.81 ms | 32.3% bf16 MFU | 124543 tok/s step 8400/19560 | loss 3.457854 (-0.01z)| norm 0.2899 (+2.05z)| lr 5.25e-04 | 4202.07 ms | 32.1% bf16 MFU | 124555 tok/s step 8401/19560 | loss 3.508003 (+1.19z)| norm 0.2727 (+0.96z)| lr 5.25e-04 | 4179.09 ms | 32.3% bf16 MFU | 124600 tok/s step 8402/19560 | loss 3.482336 (+0.56z)| norm 0.2464 (-0.71z)| lr 5.25e-04 | 4192.02 ms | 32.2% bf16 MFU | 124623 tok/s step 8403/19560 | loss 3.441216 (-0.43z)| norm 0.2601 (+0.16z)| lr 5.25e-04 | 4175.55 ms | 32.3% bf16 MFU | 124670 tok/s step 8404/19560 | loss 3.459351 (+0.02z)| norm 0.2348 (-1.42z)| lr 5.25e-04 | 4179.98 ms | 32.3% bf16 MFU | 124708 tok/s step 8405/19560 | loss 3.463813 (+0.13z)| norm 0.2462 (-0.69z)| lr 5.25e-04 | 4189.58 ms | 32.2% bf16 MFU | 124730 tok/s step 8406/19560 | loss 3.481571 (+0.56z)| norm 0.2345 (-1.42z)| lr 5.25e-04 | 4180.95 ms | 32.3% bf16 MFU | 124763 tok/s step 8407/19560 | loss 3.495090 (+0.88z)| norm 0.2469 (-0.63z)| lr 5.25e-04 | 4176.48 ms | 32.3% bf16 MFU | 124802 tok/s step 8408/19560 | loss 3.444322 (-0.36z)| norm 0.2664 (+0.59z)| lr 5.25e-04 | 4172.86 ms | 32.4% bf16 MFU | 124844 tok/s step 8409/19560 | loss 3.427207 (-0.79z)| norm 0.2582 (+0.08z)| lr 5.25e-04 | 4179.94 ms | 32.3% bf16 MFU | 124873 tok/s step 8410/19560 | loss 3.481487 (+0.54z)| norm 0.2473 (-0.60z)| lr 5.25e-04 | 4189.91 ms | 32.2% bf16 MFU | 124886 tok/s step 8411/19560 | loss 3.477213 (+0.43z)| norm 0.2630 (+0.39z)| lr 5.25e-04 | 4184.68 ms | 32.3% bf16 MFU | 124906 tok/s step 8412/19560 | loss 3.425280 (-0.86z)| norm 0.2555 (-0.08z)| lr 5.24e-04 | 4229.85 ms | 31.9% bf16 MFU | 124858 tok/s step 8413/19560 | loss 3.370898 (-2.17z)| norm 0.2598 (+0.18z)| lr 5.24e-04 | 4178.78 ms | 32.3% bf16 MFU | 124888 tok/s step 8414/19560 | loss 3.441264 (-0.46z)| norm 0.2361 (-1.33z)| lr 5.24e-04 | 4217.32 ms | 32.0% bf16 MFU | 124860 tok/s step 8415/19560 | loss 3.424576 (-0.85z)| norm 0.2482 (-0.56z)| lr 5.24e-04 | 4177.69 ms | 32.3% bf16 MFU | 124892 tok/s step 8416/19560 | loss 3.506490 (+1.15z)| norm 0.2374 (-1.24z)| lr 5.24e-04 | 4185.01 ms | 32.3% bf16 MFU | 124911 tok/s step 8417/19560 | loss 3.421015 (-0.94z)| norm 0.2348 (-1.40z)| lr 5.24e-04 | 4307.26 ms | 31.3% bf16 MFU | 124752 tok/s step 8418/19560 | loss 3.494478 (+0.87z)| norm 0.2606 (+0.25z)| lr 5.24e-04 | 4283.06 ms | 31.5% bf16 MFU | 124634 tok/s step 8419/19560 | loss 3.482234 (+0.59z)| norm 0.2332 (-1.48z)| lr 5.24e-04 | 4185.64 ms | 32.3% bf16 MFU | 124666 tok/s step 8420/19560 | loss 3.477919 (+0.48z)| norm 0.2418 (-0.92z)| lr 5.24e-04 | 4183.72 ms | 32.3% bf16 MFU | 124698 tok/s step 8421/19560 | loss 3.461804 (+0.07z)| norm 0.2806 (+1.52z)| lr 5.24e-04 | 4257.79 ms | 31.7% bf16 MFU | 124620 tok/s step 8422/19560 | loss 3.491127 (+0.79z)| norm 0.2487 (-0.48z)| lr 5.24e-04 | 4387.95 ms | 30.8% bf16 MFU | 124363 tok/s step 8423/19560 | loss 3.444722 (-0.37z)| norm 0.2730 (+1.04z)| lr 5.24e-04 | 4185.12 ms | 32.3% bf16 MFU | 124409 tok/s step 8424/19560 | loss 3.499624 (+1.01z)| norm 0.2471 (-0.58z)| lr 5.24e-04 | 4195.70 ms | 32.2% bf16 MFU | 124436 tok/s step 8425/19560 | loss 3.576724 (+2.84z)| norm 0.2573 (+0.08z)| lr 5.24e-04 | 4188.11 ms | 32.2% bf16 MFU | 124474 tok/s step 8426/19560 | loss 3.402320 (-1.41z)| norm 0.2553 (-0.04z)| lr 5.24e-04 | 4179.88 ms | 32.3% bf16 MFU | 124522 tok/s step 8427/19560 | loss 3.516603 (+1.35z)| norm 0.2787 (+1.48z)| lr 5.23e-04 | 4189.19 ms | 32.2% bf16 MFU | 124553 tok/s step 8428/19560 | loss 3.467171 (+0.14z)| norm 0.2661 (+0.66z)| lr 5.23e-04 | 4196.10 ms | 32.2% bf16 MFU | 124573 tok/s step 8429/19560 | loss 3.459674 (-0.05z)| norm 0.2498 (-0.39z)| lr 5.23e-04 | 4212.42 ms | 32.1% bf16 MFU | 124567 tok/s step 8430/19560 | loss 3.406883 (-1.36z)| norm 0.2820 (+1.66z)| lr 5.23e-04 | 4222.28 ms | 32.0% bf16 MFU | 124548 tok/s step 8431/19560 | loss 3.512832 (+1.24z)| norm 0.2296 (-1.66z)| lr 5.23e-04 | 4207.30 ms | 32.1% bf16 MFU | 124551 tok/s step 8432/19560 | loss 3.417043 (-1.09z)| norm 0.2680 (+0.81z)| lr 5.23e-04 | 4186.57 ms | 32.3% bf16 MFU | 124585 tok/s step 8433/19560 | loss 3.483476 (+0.53z)| norm 0.2464 (-0.60z)| lr 5.23e-04 | 4196.50 ms | 32.2% bf16 MFU | 124602 tok/s step 8434/19560 | loss 3.510522 (+1.18z)| norm 0.2773 (+1.40z)| lr 5.23e-04 | 4178.26 ms | 32.3% bf16 MFU | 124646 tok/s step 8435/19560 | loss 3.413386 (-1.17z)| norm 0.2553 (-0.02z)| lr 5.23e-04 | 4194.58 ms | 32.2% bf16 MFU | 124663 tok/s step 8436/19560 | loss 3.445451 (-0.39z)| norm 0.3098 (+3.36z)| lr 5.23e-04 | 4173.21 ms | 32.4% bf16 MFU | 124712 tok/s step 8437/19560 | loss 3.420715 (-0.98z)| norm 0.2533 (-0.17z)| lr 5.23e-04 | 4208.94 ms | 32.1% bf16 MFU | 124705 tok/s step 8438/19560 | loss 3.407374 (-1.30z)| norm 0.2522 (-0.23z)| lr 5.23e-04 | 4171.25 ms | 32.4% bf16 MFU | 124754 tok/s step 8439/19560 | loss 3.455028 (-0.16z)| norm 0.2504 (-0.35z)| lr 5.23e-04 | 4182.15 ms | 32.3% bf16 MFU | 124784 tok/s step 8440/19560 | loss 3.405507 (-1.35z)| norm 0.2438 (-0.76z)| lr 5.23e-04 | 4236.87 ms | 31.9% bf16 MFU | 124732 tok/s step 8441/19560 | loss 3.488358 (+0.64z)| norm 0.2605 (+0.28z)| lr 5.23e-04 | 4196.77 ms | 32.2% bf16 MFU | 124742 tok/s step 8442/19560 | loss 3.537619 (+1.80z)| norm 0.2608 (+0.28z)| lr 5.23e-04 | 4246.03 ms | 31.8% bf16 MFU | 124679 tok/s step 8443/19560 | loss 3.418655 (-1.03z)| norm 0.2655 (+0.58z)| lr 5.22e-04 | 4184.21 ms | 32.3% bf16 MFU | 124710 tok/s step 8444/19560 | loss 3.496984 (+0.83z)| norm 0.2645 (+0.51z)| lr 5.22e-04 | 4193.68 ms | 32.2% bf16 MFU | 124725 tok/s step 8445/19560 | loss 3.526565 (+1.51z)| norm 0.2429 (-0.85z)| lr 5.22e-04 | 4172.67 ms | 32.4% bf16 MFU | 124772 tok/s step 8446/19560 | loss 3.555357 (+2.14z)| norm 0.2679 (+0.71z)| lr 5.22e-04 | 4199.06 ms | 32.2% bf16 MFU | 124776 tok/s step 8447/19560 | loss 3.421644 (-0.96z)| norm 0.2900 (+2.09z)| lr 5.22e-04 | 4185.18 ms | 32.3% bf16 MFU | 124801 tok/s step 8448/19560 | loss 3.426255 (-0.84z)| norm 0.2708 (+0.85z)| lr 5.22e-04 | 4186.49 ms | 32.3% bf16 MFU | 124822 tok/s step 8449/19560 | loss 3.461170 (-0.03z)| norm 0.2608 (+0.20z)| lr 5.22e-04 | 4195.52 ms | 32.2% bf16 MFU | 124829 tok/s step 8450/19560 | loss 3.458913 (-0.08z)| norm 0.2481 (-0.61z)| lr 5.22e-04 | 4197.29 ms | 32.2% bf16 MFU | 124833 tok/s step 8451/19560 | loss 3.442208 (-0.47z)| norm 0.2737 (+1.02z)| lr 5.22e-04 | 4208.08 ms | 32.1% bf16 MFU | 124821 tok/s step 8452/19560 | loss 3.480357 (+0.43z)| norm 0.2745 (+1.05z)| lr 5.22e-04 | 4174.69 ms | 32.3% bf16 MFU | 124860 tok/s step 8453/19560 | loss 3.488605 (+0.63z)| norm 0.2585 (+0.02z)| lr 5.22e-04 | 4279.58 ms | 31.5% bf16 MFU | 124742 tok/s step 8454/19560 | loss 3.450741 (-0.27z)| norm 0.2592 (+0.06z)| lr 5.22e-04 | 4198.53 ms | 32.2% bf16 MFU | 124749 tok/s step 8455/19560 | loss 3.443465 (-0.43z)| norm 0.2505 (-0.50z)| lr 5.22e-04 | 4235.31 ms | 31.9% bf16 MFU | 124701 tok/s step 8456/19560 | loss 3.518412 (+1.31z)| norm 0.2469 (-0.73z)| lr 5.22e-04 | 4188.19 ms | 32.2% bf16 MFU | 124725 tok/s step 8457/19560 | loss 3.405317 (-1.32z)| norm 0.2604 (+0.14z)| lr 5.22e-04 | 4186.19 ms | 32.3% bf16 MFU | 124751 tok/s step 8458/19560 | loss 3.389991 (-1.66z)| norm 0.2671 (+0.57z)| lr 5.21e-04 | 4178.36 ms | 32.3% bf16 MFU | 124787 tok/s step 8459/19560 | loss 3.460439 (-0.02z)| norm 0.2341 (-1.52z)| lr 5.21e-04 | 4176.03 ms | 32.3% bf16 MFU | 124825 tok/s step 8460/19560 | loss 3.456301 (-0.12z)| norm 0.2713 (+0.83z)| lr 5.21e-04 | 4214.32 ms | 32.0% bf16 MFU | 124804 tok/s step 8461/19560 | loss 3.456170 (-0.12z)| norm 0.2405 (-1.12z)| lr 5.21e-04 | 4223.77 ms | 32.0% bf16 MFU | 124770 tok/s step 8462/19560 | loss 3.484618 (+0.54z)| norm 0.2342 (-1.49z)| lr 5.21e-04 | 4185.88 ms | 32.3% bf16 MFU | 124794 tok/s step 8463/19560 | loss 3.473390 (+0.28z)| norm 0.2526 (-0.34z)| lr 5.21e-04 | 4172.22 ms | 32.4% bf16 MFU | 124838 tok/s step 8464/19560 | loss 3.444979 (-0.37z)| norm 0.2489 (-0.56z)| lr 5.21e-04 | 4182.10 ms | 32.3% bf16 MFU | 124864 tok/s step 8465/19560 | loss 3.484799 (+0.62z)| norm 0.2673 (+0.59z)| lr 5.21e-04 | 4197.27 ms | 32.2% bf16 MFU | 124866 tok/s step 8466/19560 | loss 3.467030 (+0.19z)| norm 0.2683 (+0.65z)| lr 5.21e-04 | 4204.44 ms | 32.1% bf16 MFU | 124858 tok/s step 8467/19560 | loss 3.421179 (-0.94z)| norm 0.2454 (-0.78z)| lr 5.21e-04 | 4187.21 ms | 32.2% bf16 MFU | 124876 tok/s step 8468/19560 | loss 3.413497 (-1.12z)| norm 0.2584 (+0.04z)| lr 5.21e-04 | 4210.00 ms | 32.1% bf16 MFU | 124859 tok/s step 8469/19560 | loss 3.439445 (-0.47z)| norm 0.2487 (-0.56z)| lr 5.21e-04 | 4194.73 ms | 32.2% bf16 MFU | 124865 tok/s step 8470/19560 | loss 3.480599 (+0.55z)| norm 0.2424 (-0.95z)| lr 5.21e-04 | 4193.86 ms | 32.2% bf16 MFU | 124872 tok/s step 8471/19560 | loss 3.532926 (+1.83z)| norm 0.2628 (+0.34z)| lr 5.21e-04 | 4230.22 ms | 31.9% bf16 MFU | 124826 tok/s step 8472/19560 | loss 3.445054 (-0.36z)| norm 0.2200 (-2.31z)| lr 5.21e-04 | 4189.49 ms | 32.2% bf16 MFU | 124842 tok/s step 8473/19560 | loss 3.454496 (-0.11z)| norm 0.2556 (-0.09z)| lr 5.21e-04 | 4181.45 ms | 32.3% bf16 MFU | 124869 tok/s step 8474/19560 | loss 3.412143 (-1.16z)| norm 0.2267 (-1.86z)| lr 5.20e-04 | 4176.55 ms | 32.3% bf16 MFU | 124902 tok/s step 8475/19560 | loss 3.484027 (+0.65z)| norm 0.2420 (-0.90z)| lr 5.20e-04 | 4192.70 ms | 32.2% bf16 MFU | 124909 tok/s step 8476/19560 | loss 3.421517 (-0.92z)| norm 0.2606 (+0.29z)| lr 5.20e-04 | 4186.74 ms | 32.2% bf16 MFU | 124925 tok/s step 8477/19560 | loss 3.431347 (-0.66z)| norm 0.2244 (-2.00z)| lr 5.20e-04 | 4193.79 ms | 32.2% bf16 MFU | 124930 tok/s step 8478/19560 | loss 3.485652 (+0.72z)| norm 0.2821 (+1.67z)| lr 5.20e-04 | 4252.16 ms | 31.8% bf16 MFU | 124848 tok/s step 8479/19560 | loss 3.512172 (+1.37z)| norm 0.2559 (+0.02z)| lr 5.20e-04 | 4180.27 ms | 32.3% bf16 MFU | 124877 tok/s step 8480/19560 | loss 3.502665 (+1.11z)| norm 0.2335 (-1.40z)| lr 5.20e-04 | 4230.46 ms | 31.9% bf16 MFU | 124829 tok/s step 8481/19560 | loss 3.414446 (-1.09z)| norm 0.2661 (+0.69z)| lr 5.20e-04 | 4187.87 ms | 32.2% bf16 MFU | 124848 tok/s step 8482/19560 | loss 3.448420 (-0.25z)| norm 0.2458 (-0.62z)| lr 5.20e-04 | 4185.30 ms | 32.3% bf16 MFU | 124869 tok/s step 8483/19560 | loss 3.447006 (-0.29z)| norm 0.2342 (-1.35z)| lr 5.20e-04 | 4199.63 ms | 32.1% bf16 MFU | 124867 tok/s step 8484/19560 | loss 3.429857 (-0.72z)| norm 0.2427 (-0.81z)| lr 5.20e-04 | 4183.98 ms | 32.3% bf16 MFU | 124889 tok/s step 8485/19560 | loss 3.475466 (+0.43z)| norm 0.2402 (-0.98z)| lr 5.20e-04 | 4189.29 ms | 32.2% bf16 MFU | 124902 tok/s step 8486/19560 | loss 3.467481 (+0.22z)| norm 0.2359 (-1.25z)| lr 5.20e-04 | 4169.88 ms | 32.4% bf16 MFU | 124944 tok/s step 8487/19560 | loss 3.472719 (+0.36z)| norm 0.2482 (-0.45z)| lr 5.20e-04 | 4239.27 ms | 31.8% bf16 MFU | 124880 tok/s step 8488/19560 | loss 3.438910 (-0.49z)| norm 0.2559 (+0.04z)| lr 5.20e-04 | 4189.06 ms | 32.2% bf16 MFU | 124894 tok/s step 8489/19560 | loss 3.487422 (+0.73z)| norm 0.2357 (-1.25z)| lr 5.19e-04 | 4176.59 ms | 32.3% bf16 MFU | 124926 tok/s step 8490/19560 | loss 3.457009 (-0.03z)| norm 0.2492 (-0.39z)| lr 5.19e-04 | 4232.94 ms | 31.9% bf16 MFU | 124873 tok/s step 8491/19560 | loss 3.479062 (+0.52z)| norm 0.2477 (-0.49z)| lr 5.19e-04 | 4174.06 ms | 32.3% bf16 MFU | 124909 tok/s step 8492/19560 | loss 3.468348 (+0.23z)| norm 0.2608 (+0.36z)| lr 5.19e-04 | 4167.31 ms | 32.4% bf16 MFU | 124954 tok/s step 8493/19560 | loss 3.397541 (-1.55z)| norm 0.2332 (-1.40z)| lr 5.19e-04 | 4186.23 ms | 32.3% bf16 MFU | 124969 tok/s step 8494/19560 | loss 3.445385 (-0.33z)| norm 0.2770 (+1.37z)| lr 5.19e-04 | 4235.31 ms | 31.9% bf16 MFU | 124910 tok/s step 8495/19560 | loss 3.489620 (+0.78z)| norm 0.2675 (+0.80z)| lr 5.19e-04 | 4373.63 ms | 30.9% bf16 MFU | 124658 tok/s step 8496/19560 | loss 3.492233 (+0.84z)| norm 0.2420 (-0.84z)| lr 5.19e-04 | 4401.75 ms | 30.7% bf16 MFU | 124380 tok/s step 8497/19560 | loss 3.445424 (-0.36z)| norm 0.2848 (+1.89z)| lr 5.19e-04 | 4975.66 ms | 27.1% bf16 MFU | 123430 tok/s step 8498/19560 | loss 3.409477 (-1.26z)| norm 0.2681 (+0.82z)| lr 5.19e-04 | 4512.12 ms | 29.9% bf16 MFU | 123068 tok/s step 8499/19560 | loss 3.428265 (-0.78z)| norm 0.2220 (-2.07z)| lr 5.19e-04 | 4289.21 ms | 31.5% bf16 MFU | 123027 tok/s step 8500/19560 | loss 3.465673 (+0.16z)| norm 0.2658 (+0.68z)| lr 5.19e-04 | 4227.45 ms | 31.9% bf16 MFU | 123076 tok/s val loss 3.418805 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2913/10042 = 0.290082 step 8501/19560 | loss 3.417589 (-1.06z)| norm 0.2487 (-0.38z)| lr 5.19e-04 | 4469.59 ms | 30.2% bf16 MFU | 122787 tok/s step 8502/19560 | loss 3.466159 (+0.16z)| norm 0.2478 (-0.45z)| lr 5.19e-04 | 4353.80 ms | 31.0% bf16 MFU | 122669 tok/s step 8503/19560 | loss 3.431692 (-0.71z)| norm 0.2722 (+1.13z)| lr 5.19e-04 | 4254.31 ms | 31.7% bf16 MFU | 122697 tok/s step 8504/19560 | loss 3.436277 (-0.58z)| norm 0.2398 (-0.95z)| lr 5.19e-04 | 4484.53 ms | 30.1% bf16 MFU | 122408 tok/s step 8505/19560 | loss 3.442698 (-0.41z)| norm 0.2459 (-0.55z)| lr 5.18e-04 | 4252.76 ms | 31.7% bf16 MFU | 122452 tok/s step 8506/19560 | loss 3.380955 (-1.95z)| norm 0.2484 (-0.38z)| lr 5.18e-04 | 4277.87 ms | 31.6% bf16 MFU | 122457 tok/s step 8507/19560 | loss 3.482076 (+0.60z)| norm 0.2393 (-0.97z)| lr 5.18e-04 | 4181.11 ms | 32.3% bf16 MFU | 122604 tok/s step 8508/19560 | loss 3.476051 (+0.45z)| norm 0.2428 (-0.72z)| lr 5.18e-04 | 4180.09 ms | 32.3% bf16 MFU | 122745 tok/s step 8509/19560 | loss 3.381232 (-1.91z)| norm 0.2584 (+0.30z)| lr 5.18e-04 | 4284.56 ms | 31.5% bf16 MFU | 122726 tok/s step 8510/19560 | loss 3.446773 (-0.29z)| norm 0.2653 (+0.75z)| lr 5.18e-04 | 4178.83 ms | 32.3% bf16 MFU | 122863 tok/s step 8511/19560 | loss 3.389607 (-1.73z)| norm 0.2453 (-0.56z)| lr 5.18e-04 | 4181.24 ms | 32.3% bf16 MFU | 122989 tok/s step 8512/19560 | loss 3.407158 (-1.28z)| norm 0.2497 (-0.28z)| lr 5.18e-04 | 4250.28 ms | 31.8% bf16 MFU | 123008 tok/s step 8513/19560 | loss 3.420183 (-0.94z)| norm 0.2676 (+0.88z)| lr 5.18e-04 | 4253.84 ms | 31.7% bf16 MFU | 123020 tok/s step 8514/19560 | loss 3.480865 (+0.57z)| norm 0.2556 (+0.08z)| lr 5.18e-04 | 4167.77 ms | 32.4% bf16 MFU | 123159 tok/s step 8515/19560 | loss 3.443188 (-0.39z)| norm 0.2576 (+0.21z)| lr 5.18e-04 | 4236.26 ms | 31.9% bf16 MFU | 123189 tok/s step 8516/19560 | loss 3.440588 (-0.47z)| norm 0.2396 (-0.96z)| lr 5.18e-04 | 4324.14 ms | 31.2% bf16 MFU | 123092 tok/s step 8517/19560 | loss 3.460714 (+0.07z)| norm 0.2490 (-0.34z)| lr 5.18e-04 | 4267.26 ms | 31.6% bf16 MFU | 123080 tok/s step 8518/19560 | loss 3.419837 (-0.99z)| norm 0.2276 (-1.72z)| lr 5.18e-04 | 4169.68 ms | 32.4% bf16 MFU | 123213 tok/s step 8519/19560 | loss 3.417517 (-1.06z)| norm 0.2503 (-0.23z)| lr 5.18e-04 | 4165.65 ms | 32.4% bf16 MFU | 123345 tok/s step 8520/19560 | loss 3.446865 (-0.28z)| norm 0.2525 (-0.09z)| lr 5.17e-04 | 4196.06 ms | 32.2% bf16 MFU | 123425 tok/s step 8521/19560 | loss 3.530158 (+1.92z)| norm 0.2655 (+0.75z)| lr 5.17e-04 | 4169.61 ms | 32.4% bf16 MFU | 123541 tok/s step 8522/19560 | loss 3.422235 (-0.92z)| norm 0.2406 (-0.86z)| lr 5.17e-04 | 4185.09 ms | 32.3% bf16 MFU | 123628 tok/s step 8523/19560 | loss 3.461793 (+0.11z)| norm 0.2740 (+1.29z)| lr 5.17e-04 | 4180.84 ms | 32.3% bf16 MFU | 123717 tok/s step 8524/19560 | loss 3.434210 (-0.60z)| norm 0.2312 (-1.45z)| lr 5.17e-04 | 4170.51 ms | 32.4% bf16 MFU | 123816 tok/s step 8525/19560 | loss 3.449135 (-0.20z)| norm 0.2626 (+0.55z)| lr 5.17e-04 | 4178.79 ms | 32.3% bf16 MFU | 123899 tok/s step 8526/19560 | loss 3.450511 (-0.15z)| norm 0.2508 (-0.19z)| lr 5.17e-04 | 4184.50 ms | 32.3% bf16 MFU | 123969 tok/s step 8527/19560 | loss 3.476626 (+0.54z)| norm 0.2665 (+0.81z)| lr 5.17e-04 | 4179.99 ms | 32.3% bf16 MFU | 124042 tok/s step 8528/19560 | loss 3.482722 (+0.70z)| norm 0.2606 (+0.45z)| lr 5.17e-04 | 4219.85 ms | 32.0% bf16 MFU | 124052 tok/s step 8529/19560 | loss 3.466637 (+0.28z)| norm 0.2463 (-0.48z)| lr 5.17e-04 | 4170.16 ms | 32.4% bf16 MFU | 124135 tok/s step 8530/19560 | loss 3.386705 (-1.82z)| norm 0.2555 (+0.12z)| lr 5.17e-04 | 4188.85 ms | 32.2% bf16 MFU | 124187 tok/s step 8531/19560 | loss 3.424408 (-0.82z)| norm 0.2269 (-1.74z)| lr 5.17e-04 | 4188.32 ms | 32.2% bf16 MFU | 124236 tok/s step 8532/19560 | loss 3.499555 (+1.16z)| norm 0.2725 (+1.23z)| lr 5.17e-04 | 4174.82 ms | 32.3% bf16 MFU | 124304 tok/s step 8533/19560 | loss 3.420324 (-0.92z)| norm 0.2667 (+0.84z)| lr 5.17e-04 | 4184.33 ms | 32.3% bf16 MFU | 124353 tok/s step 8534/19560 | loss 3.439775 (-0.40z)| norm 0.2486 (-0.35z)| lr 5.17e-04 | 4183.84 ms | 32.3% bf16 MFU | 124401 tok/s step 8535/19560 | loss 3.416614 (-0.99z)| norm 0.2557 (+0.11z)| lr 5.17e-04 | 4190.51 ms | 32.2% bf16 MFU | 124437 tok/s step 8536/19560 | loss 3.579942 (+3.14z)| norm 0.2959 (+2.67z)| lr 5.16e-04 | 4186.38 ms | 32.3% bf16 MFU | 124477 tok/s step 8537/19560 | loss 3.437202 (-0.46z)| norm 0.2745 (+1.28z)| lr 5.16e-04 | 4203.08 ms | 32.1% bf16 MFU | 124490 tok/s step 8538/19560 | loss 3.532133 (+1.90z)| norm 0.2778 (+1.46z)| lr 5.16e-04 | 4179.58 ms | 32.3% bf16 MFU | 124537 tok/s step 8539/19560 | loss 3.410793 (-1.11z)| norm 0.2487 (-0.37z)| lr 5.16e-04 | 4218.78 ms | 32.0% bf16 MFU | 124524 tok/s step 8540/19560 | loss 3.429575 (-0.64z)| norm 0.3054 (+3.08z)| lr 5.16e-04 | 4181.61 ms | 32.3% bf16 MFU | 124567 tok/s step 8541/19560 | loss 3.514102 (+1.45z)| norm 0.2587 (+0.23z)| lr 5.16e-04 | 4167.56 ms | 32.4% bf16 MFU | 124629 tok/s step 8542/19560 | loss 3.512763 (+1.39z)| norm 0.2658 (+0.65z)| lr 5.16e-04 | 4188.24 ms | 32.2% bf16 MFU | 124656 tok/s step 8543/19560 | loss 3.414369 (-1.06z)| norm 0.2449 (-0.62z)| lr 5.16e-04 | 4187.58 ms | 32.2% bf16 MFU | 124684 tok/s step 8544/19560 | loss 3.453556 (-0.08z)| norm 0.2521 (-0.19z)| lr 5.16e-04 | 4174.87 ms | 32.3% bf16 MFU | 124729 tok/s step 8545/19560 | loss 3.486969 (+0.75z)| norm 0.2504 (-0.30z)| lr 5.16e-04 | 4174.81 ms | 32.3% bf16 MFU | 124771 tok/s step 8546/19560 | loss 3.474565 (+0.44z)| norm 0.2670 (+0.72z)| lr 5.16e-04 | 4188.36 ms | 32.2% bf16 MFU | 124792 tok/s step 8547/19560 | loss 3.512403 (+1.38z)| norm 0.2548 (-0.05z)| lr 5.16e-04 | 4165.21 ms | 32.4% bf16 MFU | 124846 tok/s step 8548/19560 | loss 3.507332 (+1.24z)| norm 0.2593 (+0.23z)| lr 5.16e-04 | 4170.36 ms | 32.4% bf16 MFU | 124889 tok/s step 8549/19560 | loss 3.486051 (+0.71z)| norm 0.2536 (-0.12z)| lr 5.16e-04 | 4196.24 ms | 32.2% bf16 MFU | 124892 tok/s step 8550/19560 | loss 3.436379 (-0.52z)| norm 0.2720 (+1.02z)| lr 5.16e-04 | 4166.04 ms | 32.4% bf16 MFU | 124940 tok/s step 8551/19560 | loss 3.381295 (-1.85z)| norm 0.2508 (-0.30z)| lr 5.15e-04 | 4169.76 ms | 32.4% bf16 MFU | 124980 tok/s step 8552/19560 | loss 3.458336 (+0.05z)| norm 0.2382 (-1.08z)| lr 5.15e-04 | 4177.10 ms | 32.3% bf16 MFU | 125006 tok/s step 8553/19560 | loss 3.500947 (+1.15z)| norm 0.2671 (+0.73z)| lr 5.15e-04 | 4182.49 ms | 32.3% bf16 MFU | 125024 tok/s step 8554/19560 | loss 3.509043 (+1.33z)| norm 0.2494 (-0.38z)| lr 5.15e-04 | 4182.51 ms | 32.3% bf16 MFU | 125040 tok/s step 8555/19560 | loss 3.444156 (-0.30z)| norm 0.2679 (+0.79z)| lr 5.15e-04 | 4188.75 ms | 32.2% bf16 MFU | 125046 tok/s step 8556/19560 | loss 3.478318 (+0.57z)| norm 0.2607 (+0.34z)| lr 5.15e-04 | 4214.31 ms | 32.0% bf16 MFU | 125014 tok/s step 8557/19560 | loss 3.440561 (-0.39z)| norm 0.2659 (+0.65z)| lr 5.15e-04 | 4164.53 ms | 32.4% bf16 MFU | 125058 tok/s step 8558/19560 | loss 3.374586 (-2.05z)| norm 0.2574 (+0.13z)| lr 5.15e-04 | 4180.81 ms | 32.3% bf16 MFU | 125076 tok/s step 8559/19560 | loss 3.416560 (-0.98z)| norm 0.2712 (+1.00z)| lr 5.15e-04 | 4167.39 ms | 32.4% bf16 MFU | 125112 tok/s step 8560/19560 | loss 3.409738 (-1.15z)| norm 0.2624 (+0.44z)| lr 5.15e-04 | 4189.05 ms | 32.2% bf16 MFU | 125114 tok/s step 8561/19560 | loss 3.447579 (-0.18z)| norm 0.2408 (-0.94z)| lr 5.15e-04 | 4184.92 ms | 32.3% bf16 MFU | 125123 tok/s step 8562/19560 | loss 3.418830 (-0.90z)| norm 0.2396 (-1.00z)| lr 5.15e-04 | 4191.48 ms | 32.2% bf16 MFU | 125121 tok/s step 8563/19560 | loss 3.469061 (+0.38z)| norm 0.2641 (+0.57z)| lr 5.15e-04 | 4178.32 ms | 32.3% bf16 MFU | 125139 tok/s step 8564/19560 | loss 3.414189 (-1.02z)| norm 0.2703 (+1.03z)| lr 5.15e-04 | 4172.31 ms | 32.4% bf16 MFU | 125165 tok/s step 8565/19560 | loss 3.465503 (+0.28z)| norm 0.2629 (+0.53z)| lr 5.15e-04 | 4286.45 ms | 31.5% bf16 MFU | 125022 tok/s step 8566/19560 | loss 3.426607 (-0.72z)| norm 0.2398 (-1.01z)| lr 5.14e-04 | 4171.91 ms | 32.4% bf16 MFU | 125055 tok/s step 8567/19560 | loss 3.445797 (-0.22z)| norm 0.2374 (-1.16z)| lr 5.14e-04 | 4288.27 ms | 31.5% bf16 MFU | 124915 tok/s step 8568/19560 | loss 3.418109 (-0.94z)| norm 0.2326 (-1.46z)| lr 5.14e-04 | 4182.58 ms | 32.3% bf16 MFU | 124937 tok/s step 8569/19560 | loss 3.425028 (-0.75z)| norm 0.2509 (-0.25z)| lr 5.14e-04 | 4178.89 ms | 32.3% bf16 MFU | 124963 tok/s step 8570/19560 | loss 3.490371 (+0.96z)| norm 0.2328 (-1.42z)| lr 5.14e-04 | 4247.73 ms | 31.8% bf16 MFU | 124886 tok/s step 8571/19560 | loss 3.455747 (+0.04z)| norm 0.2612 (+0.44z)| lr 5.14e-04 | 4228.07 ms | 31.9% bf16 MFU | 124842 tok/s step 8572/19560 | loss 3.387563 (-1.71z)| norm 0.2536 (-0.05z)| lr 5.14e-04 | 4171.40 ms | 32.4% bf16 MFU | 124884 tok/s step 8573/19560 | loss 3.463074 (+0.27z)| norm 0.2422 (-0.80z)| lr 5.14e-04 | 4291.93 ms | 31.5% bf16 MFU | 124748 tok/s step 8574/19560 | loss 3.403408 (-1.30z)| norm 0.2466 (-0.50z)| lr 5.14e-04 | 4366.75 ms | 30.9% bf16 MFU | 124514 tok/s step 8575/19560 | loss 3.404395 (-1.27z)| norm 0.2378 (-1.07z)| lr 5.14e-04 | 4168.03 ms | 32.4% bf16 MFU | 124577 tok/s step 8576/19560 | loss 3.443341 (-0.22z)| norm 0.2497 (-0.26z)| lr 5.14e-04 | 4182.01 ms | 32.3% bf16 MFU | 124617 tok/s step 8577/19560 | loss 3.425978 (-0.68z)| norm 0.2478 (-0.39z)| lr 5.14e-04 | 4178.65 ms | 32.3% bf16 MFU | 124659 tok/s step 8578/19560 | loss 3.421474 (-0.79z)| norm 0.2390 (-0.97z)| lr 5.14e-04 | 4198.60 ms | 32.2% bf16 MFU | 124670 tok/s step 8579/19560 | loss 3.481130 (+0.80z)| norm 0.2633 (+0.67z)| lr 5.14e-04 | 4168.04 ms | 32.4% bf16 MFU | 124726 tok/s step 8580/19560 | loss 3.436470 (-0.39z)| norm 0.2401 (-0.88z)| lr 5.14e-04 | 4196.38 ms | 32.2% bf16 MFU | 124736 tok/s step 8581/19560 | loss 3.466319 (+0.42z)| norm 0.2489 (-0.28z)| lr 5.14e-04 | 4184.62 ms | 32.3% bf16 MFU | 124764 tok/s step 8582/19560 | loss 3.401994 (-1.30z)| norm 0.2143 (-2.54z)| lr 5.13e-04 | 4177.48 ms | 32.3% bf16 MFU | 124801 tok/s step 8583/19560 | loss 3.434824 (-0.41z)| norm 0.2623 (+0.63z)| lr 5.13e-04 | 4171.77 ms | 32.4% bf16 MFU | 124845 tok/s step 8584/19560 | loss 3.399096 (-1.36z)| norm 0.2515 (-0.08z)| lr 5.13e-04 | 4184.17 ms | 32.3% bf16 MFU | 124868 tok/s step 8585/19560 | loss 3.477025 (+0.73z)| norm 0.2412 (-0.76z)| lr 5.13e-04 | 4333.47 ms | 31.2% bf16 MFU | 124674 tok/s step 8586/19560 | loss 3.386545 (-1.71z)| norm 0.2406 (-0.78z)| lr 5.13e-04 | 4172.24 ms | 32.4% bf16 MFU | 124723 tok/s step 8587/19560 | loss 3.450357 (+0.02z)| norm 0.2600 (+0.49z)| lr 5.13e-04 | 4174.53 ms | 32.3% bf16 MFU | 124766 tok/s step 8588/19560 | loss 3.428372 (-0.57z)| norm 0.2447 (-0.52z)| lr 5.13e-04 | 4190.63 ms | 32.2% bf16 MFU | 124784 tok/s step 8589/19560 | loss 3.425363 (-0.65z)| norm 0.2476 (-0.33z)| lr 5.13e-04 | 4169.16 ms | 32.4% bf16 MFU | 124832 tok/s step 8590/19560 | loss 3.409710 (-1.05z)| norm 0.2453 (-0.49z)| lr 5.13e-04 | 4161.11 ms | 32.4% bf16 MFU | 124890 tok/s step 8591/19560 | loss 3.419810 (-0.77z)| norm 0.2374 (-1.01z)| lr 5.13e-04 | 4192.98 ms | 32.2% bf16 MFU | 124898 tok/s step 8592/19560 | loss 3.405455 (-1.14z)| norm 0.2287 (-1.56z)| lr 5.13e-04 | 4165.51 ms | 32.4% bf16 MFU | 124946 tok/s step 8593/19560 | loss 3.459154 (+0.30z)| norm 0.2527 (+0.04z)| lr 5.13e-04 | 4175.20 ms | 32.3% bf16 MFU | 124977 tok/s step 8594/19560 | loss 3.441731 (-0.16z)| norm 0.2401 (-0.79z)| lr 5.13e-04 | 4192.38 ms | 32.2% bf16 MFU | 124981 tok/s step 8595/19560 | loss 3.441468 (-0.17z)| norm 0.2470 (-0.33z)| lr 5.13e-04 | 4175.14 ms | 32.3% bf16 MFU | 125011 tok/s step 8596/19560 | loss 3.392096 (-1.49z)| norm 0.2261 (-1.69z)| lr 5.13e-04 | 4190.50 ms | 32.2% bf16 MFU | 125016 tok/s step 8597/19560 | loss 3.484354 (+0.97z)| norm 0.2547 (+0.20z)| lr 5.12e-04 | 4182.90 ms | 32.3% bf16 MFU | 125032 tok/s step 8598/19560 | loss 3.386673 (-1.61z)| norm 0.2575 (+0.38z)| lr 5.12e-04 | 4184.39 ms | 32.3% bf16 MFU | 125046 tok/s step 8599/19560 | loss 3.454032 (+0.20z)| norm 0.2504 (-0.09z)| lr 5.12e-04 | 4172.80 ms | 32.4% bf16 MFU | 125075 tok/s step 8600/19560 | loss 3.405205 (-1.11z)| norm 0.2603 (+0.56z)| lr 5.12e-04 | 4169.16 ms | 32.4% bf16 MFU | 125109 tok/s step 8601/19560 | loss 3.427126 (-0.51z)| norm 0.2898 (+2.46z)| lr 5.12e-04 | 4177.62 ms | 32.3% bf16 MFU | 125129 tok/s step 8602/19560 | loss 3.402267 (-1.18z)| norm 0.2361 (-1.08z)| lr 5.12e-04 | 4177.15 ms | 32.3% bf16 MFU | 125148 tok/s step 8603/19560 | loss 3.439420 (-0.17z)| norm 0.2715 (+1.24z)| lr 5.12e-04 | 4223.16 ms | 32.0% bf16 MFU | 125098 tok/s step 8604/19560 | loss 3.392894 (-1.41z)| norm 0.2705 (+1.17z)| lr 5.12e-04 | 4199.64 ms | 32.1% bf16 MFU | 125085 tok/s step 8605/19560 | loss 3.452640 (+0.19z)| norm 0.2512 (-0.11z)| lr 5.12e-04 | 4176.48 ms | 32.3% bf16 MFU | 125108 tok/s step 8606/19560 | loss 3.382421 (-1.66z)| norm 0.2413 (-0.77z)| lr 5.12e-04 | 4187.35 ms | 32.2% bf16 MFU | 125113 tok/s step 8607/19560 | loss 3.449876 (+0.15z)| norm 0.2555 (+0.19z)| lr 5.12e-04 | 4198.18 ms | 32.2% bf16 MFU | 125101 tok/s step 8608/19560 | loss 3.454999 (+0.30z)| norm 0.2272 (-1.70z)| lr 5.12e-04 | 4183.70 ms | 32.3% bf16 MFU | 125112 tok/s step 8609/19560 | loss 3.421183 (-0.62z)| norm 0.2567 (+0.28z)| lr 5.12e-04 | 4182.02 ms | 32.3% bf16 MFU | 125125 tok/s step 8610/19560 | loss 3.415105 (-0.78z)| norm 0.2438 (-0.59z)| lr 5.12e-04 | 4167.28 ms | 32.4% bf16 MFU | 125159 tok/s step 8611/19560 | loss 3.456541 (+0.35z)| norm 0.2555 (+0.19z)| lr 5.12e-04 | 4177.12 ms | 32.3% bf16 MFU | 125177 tok/s step 8612/19560 | loss 3.402844 (-1.10z)| norm 0.2619 (+0.62z)| lr 5.12e-04 | 4201.19 ms | 32.1% bf16 MFU | 125158 tok/s step 8613/19560 | loss 3.473063 (+0.80z)| norm 0.2268 (-1.74z)| lr 5.11e-04 | 4173.73 ms | 32.3% bf16 MFU | 125181 tok/s step 8614/19560 | loss 3.447787 (+0.12z)| norm 0.2610 (+0.55z)| lr 5.11e-04 | 4185.88 ms | 32.3% bf16 MFU | 125184 tok/s step 8615/19560 | loss 3.421764 (-0.58z)| norm 0.2786 (+1.70z)| lr 5.11e-04 | 4188.09 ms | 32.2% bf16 MFU | 125184 tok/s step 8616/19560 | loss 3.417160 (-0.70z)| norm 0.2391 (-0.92z)| lr 5.11e-04 | 4188.34 ms | 32.2% bf16 MFU | 125184 tok/s step 8617/19560 | loss 3.413225 (-0.79z)| norm 0.2474 (-0.38z)| lr 5.11e-04 | 4157.89 ms | 32.5% bf16 MFU | 125229 tok/s step 8618/19560 | loss 3.420880 (-0.57z)| norm 0.2446 (-0.56z)| lr 5.11e-04 | 4191.34 ms | 32.2% bf16 MFU | 125222 tok/s step 8619/19560 | loss 3.385273 (-1.52z)| norm 0.2367 (-1.08z)| lr 5.11e-04 | 4173.46 ms | 32.4% bf16 MFU | 125243 tok/s step 8620/19560 | loss 3.402345 (-1.04z)| norm 0.2765 (+1.54z)| lr 5.11e-04 | 4186.55 ms | 32.3% bf16 MFU | 125242 tok/s step 8621/19560 | loss 3.404707 (-0.98z)| norm 0.2414 (-0.78z)| lr 5.11e-04 | 4268.31 ms | 31.6% bf16 MFU | 125121 tok/s step 8622/19560 | loss 3.419010 (-0.58z)| norm 0.2740 (+1.38z)| lr 5.11e-04 | 4193.13 ms | 32.2% bf16 MFU | 125117 tok/s step 8623/19560 | loss 3.447526 (+0.20z)| norm 0.2944 (+2.66z)| lr 5.11e-04 | 4187.61 ms | 32.2% bf16 MFU | 125121 tok/s step 8624/19560 | loss 3.455655 (+0.43z)| norm 0.2471 (-0.40z)| lr 5.11e-04 | 4184.24 ms | 32.3% bf16 MFU | 125130 tok/s step 8625/19560 | loss 3.524085 (+2.24z)| norm 0.2635 (+0.68z)| lr 5.11e-04 | 4164.08 ms | 32.4% bf16 MFU | 125169 tok/s step 8626/19560 | loss 3.510477 (+1.84z)| norm 0.2636 (+0.69z)| lr 5.11e-04 | 4184.83 ms | 32.3% bf16 MFU | 125175 tok/s step 8627/19560 | loss 3.427660 (-0.36z)| norm 0.2531 (-0.02z)| lr 5.11e-04 | 4233.17 ms | 31.9% bf16 MFU | 125109 tok/s step 8628/19560 | loss 3.419888 (-0.56z)| norm 0.2398 (-0.90z)| lr 5.10e-04 | 4185.94 ms | 32.3% bf16 MFU | 125116 tok/s step 8629/19560 | loss 3.397643 (-1.15z)| norm 0.2357 (-1.16z)| lr 5.10e-04 | 4179.45 ms | 32.3% bf16 MFU | 125132 tok/s step 8630/19560 | loss 3.434400 (-0.17z)| norm 0.2448 (-0.55z)| lr 5.10e-04 | 4177.69 ms | 32.3% bf16 MFU | 125150 tok/s step 8631/19560 | loss 3.453389 (+0.33z)| norm 0.2304 (-1.48z)| lr 5.10e-04 | 4188.31 ms | 32.2% bf16 MFU | 125152 tok/s step 8632/19560 | loss 3.405949 (-0.92z)| norm 0.2291 (-1.55z)| lr 5.10e-04 | 4207.47 ms | 32.1% bf16 MFU | 125125 tok/s step 8633/19560 | loss 3.443554 (+0.08z)| norm 0.2307 (-1.43z)| lr 5.10e-04 | 4174.46 ms | 32.3% bf16 MFU | 125148 tok/s step 8634/19560 | loss 3.462152 (+0.56z)| norm 0.2287 (-1.54z)| lr 5.10e-04 | 4171.22 ms | 32.4% bf16 MFU | 125175 tok/s step 8635/19560 | loss 3.400466 (-1.07z)| norm 0.2272 (-1.62z)| lr 5.10e-04 | 4198.98 ms | 32.2% bf16 MFU | 125160 tok/s step 8636/19560 | loss 3.455908 (+0.41z)| norm 0.2412 (-0.71z)| lr 5.10e-04 | 4179.04 ms | 32.3% bf16 MFU | 125174 tok/s step 8637/19560 | loss 3.400947 (-1.07z)| norm 0.2384 (-0.88z)| lr 5.10e-04 | 4210.35 ms | 32.1% bf16 MFU | 125142 tok/s step 8638/19560 | loss 3.387389 (-1.41z)| norm 0.2270 (-1.58z)| lr 5.10e-04 | 4195.76 ms | 32.2% bf16 MFU | 125133 tok/s step 8639/19560 | loss 3.426107 (-0.38z)| norm 0.2489 (-0.19z)| lr 5.10e-04 | 4179.41 ms | 32.3% bf16 MFU | 125148 tok/s step 8640/19560 | loss 3.467676 (+0.72z)| norm 0.2420 (-0.62z)| lr 5.10e-04 | 4168.21 ms | 32.4% bf16 MFU | 125180 tok/s step 8641/19560 | loss 3.431371 (-0.26z)| norm 0.2477 (-0.26z)| lr 5.10e-04 | 4183.08 ms | 32.3% bf16 MFU | 125188 tok/s step 8642/19560 | loss 3.399750 (-1.09z)| norm 0.2408 (-0.68z)| lr 5.10e-04 | 4166.11 ms | 32.4% bf16 MFU | 125221 tok/s step 8643/19560 | loss 3.444166 (+0.10z)| norm 0.2535 (+0.13z)| lr 5.09e-04 | 4181.70 ms | 32.3% bf16 MFU | 125228 tok/s step 8644/19560 | loss 3.432618 (-0.21z)| norm 0.2572 (+0.36z)| lr 5.09e-04 | 4194.78 ms | 32.2% bf16 MFU | 125216 tok/s step 8645/19560 | loss 3.504255 (+1.70z)| norm 0.2393 (-0.78z)| lr 5.09e-04 | 4213.97 ms | 32.0% bf16 MFU | 125176 tok/s step 8646/19560 | loss 3.452247 (+0.30z)| norm 0.2569 (+0.33z)| lr 5.09e-04 | 4179.08 ms | 32.3% bf16 MFU | 125190 tok/s step 8647/19560 | loss 3.379803 (-1.61z)| norm 0.2649 (+0.84z)| lr 5.09e-04 | 4547.63 ms | 29.7% bf16 MFU | 124695 tok/s step 8648/19560 | loss 3.421331 (-0.50z)| norm 0.2570 (+0.33z)| lr 5.09e-04 | 4226.47 ms | 31.9% bf16 MFU | 124663 tok/s step 8649/19560 | loss 3.482278 (+1.14z)| norm 0.2618 (+0.64z)| lr 5.09e-04 | 4214.74 ms | 32.0% bf16 MFU | 124649 tok/s step 8650/19560 | loss 3.463180 (+0.61z)| norm 0.2309 (-1.34z)| lr 5.09e-04 | 4325.68 ms | 31.2% bf16 MFU | 124477 tok/s step 8651/19560 | loss 3.374389 (-1.73z)| norm 0.2551 (+0.22z)| lr 5.09e-04 | 4173.70 ms | 32.3% bf16 MFU | 124534 tok/s step 8652/19560 | loss 3.406930 (-0.86z)| norm 0.2511 (-0.05z)| lr 5.09e-04 | 4170.28 ms | 32.4% bf16 MFU | 124593 tok/s step 8653/19560 | loss 3.483764 (+1.16z)| norm 0.2564 (+0.30z)| lr 5.09e-04 | 4190.00 ms | 32.2% bf16 MFU | 124620 tok/s step 8654/19560 | loss 3.433395 (-0.16z)| norm 0.2407 (-0.71z)| lr 5.09e-04 | 4220.42 ms | 32.0% bf16 MFU | 124601 tok/s step 8655/19560 | loss 3.436922 (-0.06z)| norm 0.2320 (-1.26z)| lr 5.09e-04 | 4168.66 ms | 32.4% bf16 MFU | 124659 tok/s step 8656/19560 | loss 3.440586 (+0.04z)| norm 0.2584 (+0.45z)| lr 5.09e-04 | 4170.38 ms | 32.4% bf16 MFU | 124712 tok/s step 8657/19560 | loss 3.508028 (+1.81z)| norm 0.2558 (+0.28z)| lr 5.09e-04 | 4208.05 ms | 32.1% bf16 MFU | 124706 tok/s step 8658/19560 | loss 3.428539 (-0.29z)| norm 0.2484 (-0.20z)| lr 5.09e-04 | 4186.48 ms | 32.3% bf16 MFU | 124732 tok/s step 8659/19560 | loss 3.440232 (+0.01z)| norm 0.2762 (+1.58z)| lr 5.08e-04 | 4164.33 ms | 32.4% bf16 MFU | 124791 tok/s step 8660/19560 | loss 3.471883 (+0.87z)| norm 0.2577 (+0.39z)| lr 5.08e-04 | 4202.32 ms | 32.1% bf16 MFU | 124789 tok/s step 8661/19560 | loss 3.408268 (-0.83z)| norm 0.2711 (+1.26z)| lr 5.08e-04 | 4177.89 ms | 32.3% bf16 MFU | 124824 tok/s step 8662/19560 | loss 3.425970 (-0.35z)| norm 0.2446 (-0.47z)| lr 5.08e-04 | 4176.35 ms | 32.3% bf16 MFU | 124860 tok/s step 8663/19560 | loss 3.487832 (+1.27z)| norm 0.2782 (+1.69z)| lr 5.08e-04 | 4165.38 ms | 32.4% bf16 MFU | 124910 tok/s step 8664/19560 | loss 3.414558 (-0.67z)| norm 0.2653 (+0.90z)| lr 5.08e-04 | 4171.55 ms | 32.4% bf16 MFU | 124949 tok/s step 8665/19560 | loss 3.447299 (+0.24z)| norm 0.2463 (-0.34z)| lr 5.08e-04 | 4179.04 ms | 32.3% bf16 MFU | 124974 tok/s step 8666/19560 | loss 3.379926 (-1.64z)| norm 0.2495 (-0.11z)| lr 5.08e-04 | 4235.68 ms | 31.9% bf16 MFU | 124914 tok/s step 8667/19560 | loss 3.457505 (+0.56z)| norm 0.2418 (-0.63z)| lr 5.08e-04 | 4177.19 ms | 32.3% bf16 MFU | 124944 tok/s step 8668/19560 | loss 3.395937 (-1.18z)| norm 0.2574 (+0.47z)| lr 5.08e-04 | 4249.73 ms | 31.8% bf16 MFU | 124866 tok/s step 8669/19560 | loss 3.466152 (+0.84z)| norm 0.2404 (-0.73z)| lr 5.08e-04 | 4171.70 ms | 32.4% bf16 MFU | 124906 tok/s step 8670/19560 | loss 3.422700 (-0.40z)| norm 0.2418 (-0.62z)| lr 5.08e-04 | 4177.95 ms | 32.3% bf16 MFU | 124935 tok/s step 8671/19560 | loss 3.483347 (+1.36z)| norm 0.2875 (+2.56z)| lr 5.08e-04 | 4187.35 ms | 32.2% bf16 MFU | 124949 tok/s step 8672/19560 | loss 3.404154 (-0.95z)| norm 0.2540 (+0.22z)| lr 5.08e-04 | 4230.68 ms | 31.9% bf16 MFU | 124898 tok/s step 8673/19560 | loss 3.430727 (-0.16z)| norm 0.2422 (-0.59z)| lr 5.08e-04 | 4174.07 ms | 32.3% bf16 MFU | 124933 tok/s step 8674/19560 | loss 3.412367 (-0.69z)| norm 0.2301 (-1.41z)| lr 5.07e-04 | 4165.35 ms | 32.4% bf16 MFU | 124980 tok/s step 8675/19560 | loss 3.463397 (+0.84z)| norm 0.2590 (+0.59z)| lr 5.07e-04 | 4182.83 ms | 32.3% bf16 MFU | 124998 tok/s step 8676/19560 | loss 3.424946 (-0.30z)| norm 0.2481 (-0.16z)| lr 5.07e-04 | 4170.09 ms | 32.4% bf16 MFU | 125035 tok/s step 8677/19560 | loss 3.425451 (-0.27z)| norm 0.2541 (+0.25z)| lr 5.07e-04 | 4169.69 ms | 32.4% bf16 MFU | 125070 tok/s step 8678/19560 | loss 3.426530 (-0.23z)| norm 0.2683 (+1.25z)| lr 5.07e-04 | 4302.31 ms | 31.4% bf16 MFU | 124909 tok/s step 8679/19560 | loss 3.447789 (+0.41z)| norm 0.2513 (+0.06z)| lr 5.07e-04 | 4226.59 ms | 31.9% bf16 MFU | 124866 tok/s step 8680/19560 | loss 3.451299 (+0.52z)| norm 0.2580 (+0.52z)| lr 5.07e-04 | 4178.72 ms | 32.3% bf16 MFU | 124896 tok/s step 8681/19560 | loss 3.453042 (+0.60z)| norm 0.2803 (+2.04z)| lr 5.07e-04 | 4173.40 ms | 32.4% bf16 MFU | 124933 tok/s step 8682/19560 | loss 3.425692 (-0.25z)| norm 0.2608 (+0.69z)| lr 5.07e-04 | 4176.38 ms | 32.3% bf16 MFU | 124963 tok/s step 8683/19560 | loss 3.380740 (-1.68z)| norm 0.2888 (+2.55z)| lr 5.07e-04 | 4175.24 ms | 32.3% bf16 MFU | 124993 tok/s step 8684/19560 | loss 3.378789 (-1.71z)| norm 0.2672 (+1.09z)| lr 5.07e-04 | 4166.43 ms | 32.4% bf16 MFU | 125035 tok/s step 8685/19560 | loss 3.422528 (-0.31z)| norm 0.2548 (+0.26z)| lr 5.07e-04 | 4187.01 ms | 32.2% bf16 MFU | 125044 tok/s step 8686/19560 | loss 3.409923 (-0.73z)| norm 0.2462 (-0.31z)| lr 5.07e-04 | 4217.25 ms | 32.0% bf16 MFU | 125008 tok/s step 8687/19560 | loss 3.475297 (+1.36z)| norm 0.2664 (+1.06z)| lr 5.07e-04 | 4614.52 ms | 29.3% bf16 MFU | 124439 tok/s step 8688/19560 | loss 3.452253 (+0.61z)| norm 0.2402 (-0.70z)| lr 5.07e-04 | 4745.71 ms | 28.5% bf16 MFU | 123741 tok/s step 8689/19560 | loss 3.448675 (+0.50z)| norm 0.2493 (-0.09z)| lr 5.06e-04 | 4440.84 ms | 30.4% bf16 MFU | 123457 tok/s step 8690/19560 | loss 3.436780 (+0.11z)| norm 0.2264 (-1.62z)| lr 5.06e-04 | 4965.43 ms | 27.2% bf16 MFU | 122563 tok/s step 8691/19560 | loss 3.446460 (+0.43z)| norm 0.2743 (+1.58z)| lr 5.06e-04 | 4288.69 ms | 31.5% bf16 MFU | 122547 tok/s step 8692/19560 | loss 3.448841 (+0.50z)| norm 0.2349 (-1.03z)| lr 5.06e-04 | 4249.61 ms | 31.8% bf16 MFU | 122589 tok/s step 8693/19560 | loss 3.436722 (+0.11z)| norm 0.2585 (+0.55z)| lr 5.06e-04 | 4303.49 ms | 31.4% bf16 MFU | 122551 tok/s step 8694/19560 | loss 3.432228 (-0.03z)| norm 0.2550 (+0.31z)| lr 5.06e-04 | 4225.04 ms | 32.0% bf16 MFU | 122628 tok/s step 8695/19560 | loss 3.545200 (+3.44z)| norm 0.2598 (+0.62z)| lr 5.06e-04 | 4275.61 ms | 31.6% bf16 MFU | 122627 tok/s step 8696/19560 | loss 3.492518 (+1.77z)| norm 0.2496 (-0.08z)| lr 5.06e-04 | 4292.36 ms | 31.5% bf16 MFU | 122603 tok/s step 8697/19560 | loss 3.395840 (-1.17z)| norm 0.2429 (-0.53z)| lr 5.06e-04 | 4167.83 ms | 32.4% bf16 MFU | 122763 tok/s step 8698/19560 | loss 3.439274 (+0.16z)| norm 0.2576 (+0.46z)| lr 5.06e-04 | 4424.54 ms | 30.5% bf16 MFU | 122549 tok/s step 8699/19560 | loss 3.433782 (-0.00z)| norm 0.2466 (-0.28z)| lr 5.06e-04 | 4173.79 ms | 32.3% bf16 MFU | 122703 tok/s step 8700/19560 | loss 3.431710 (-0.08z)| norm 0.2822 (+2.09z)| lr 5.06e-04 | 4241.84 ms | 31.8% bf16 MFU | 122748 tok/s step 8701/19560 | loss 3.467936 (+1.05z)| norm 0.2829 (+2.08z)| lr 5.06e-04 | 4169.90 ms | 32.4% bf16 MFU | 122897 tok/s step 8702/19560 | loss 3.485912 (+1.58z)| norm 0.2504 (-0.06z)| lr 5.06e-04 | 4162.33 ms | 32.4% bf16 MFU | 123050 tok/s step 8703/19560 | loss 3.468687 (+1.03z)| norm 0.2412 (-0.67z)| lr 5.06e-04 | 4172.04 ms | 32.4% bf16 MFU | 123181 tok/s step 8704/19560 | loss 3.400814 (-1.05z)| norm 0.2593 (+0.52z)| lr 5.06e-04 | 4163.78 ms | 32.4% bf16 MFU | 123318 tok/s step 8705/19560 | loss 3.460478 (+0.77z)| norm 0.2563 (+0.32z)| lr 5.05e-04 | 4170.24 ms | 32.4% bf16 MFU | 123438 tok/s step 8706/19560 | loss 3.510163 (+2.23z)| norm 0.2770 (+1.65z)| lr 5.05e-04 | 4176.11 ms | 32.3% bf16 MFU | 123543 tok/s step 8707/19560 | loss 3.405481 (-0.90z)| norm 0.2348 (-1.09z)| lr 5.05e-04 | 4240.29 ms | 31.8% bf16 MFU | 123548 tok/s step 8708/19560 | loss 3.449265 (+0.42z)| norm 0.2361 (-1.00z)| lr 5.05e-04 | 4170.94 ms | 32.4% bf16 MFU | 123656 tok/s step 8709/19560 | loss 3.365036 (-2.07z)| norm 0.2265 (-1.60z)| lr 5.05e-04 | 4199.59 ms | 32.2% bf16 MFU | 123715 tok/s step 8710/19560 | loss 3.465114 (+0.89z)| norm 0.2714 (+1.28z)| lr 5.05e-04 | 4211.52 ms | 32.1% bf16 MFU | 123754 tok/s step 8711/19560 | loss 3.536881 (+2.90z)| norm 0.2862 (+2.20z)| lr 5.05e-04 | 4173.95 ms | 32.3% bf16 MFU | 123847 tok/s step 8712/19560 | loss 3.421948 (-0.41z)| norm 0.2604 (+0.54z)| lr 5.05e-04 | 4270.60 ms | 31.6% bf16 MFU | 123793 tok/s step 8713/19560 | loss 3.439395 (+0.10z)| norm 0.2468 (-0.34z)| lr 5.05e-04 | 4178.15 ms | 32.3% bf16 MFU | 123877 tok/s step 8714/19560 | loss 3.485777 (+1.43z)| norm 0.2553 (+0.20z)| lr 5.05e-04 | 4204.37 ms | 32.1% bf16 MFU | 123918 tok/s step 8715/19560 | loss 3.443587 (+0.20z)| norm 0.2762 (+1.52z)| lr 5.05e-04 | 4182.79 ms | 32.3% bf16 MFU | 123990 tok/s step 8716/19560 | loss 3.508042 (+2.03z)| norm 0.2320 (-1.28z)| lr 5.05e-04 | 4235.25 ms | 31.9% bf16 MFU | 123980 tok/s step 8717/19560 | loss 3.449288 (+0.34z)| norm 0.2867 (+2.13z)| lr 5.05e-04 | 4194.59 ms | 32.2% bf16 MFU | 124030 tok/s step 8718/19560 | loss 3.390516 (-1.33z)| norm 0.2543 (+0.11z)| lr 5.05e-04 | 4169.81 ms | 32.4% bf16 MFU | 124115 tok/s step 8719/19560 | loss 3.443146 (+0.16z)| norm 0.2370 (-0.98z)| lr 5.05e-04 | 4182.19 ms | 32.3% bf16 MFU | 124178 tok/s step 8720/19560 | loss 3.407509 (-0.85z)| norm 0.2545 (+0.11z)| lr 5.04e-04 | 4173.98 ms | 32.3% bf16 MFU | 124249 tok/s step 8721/19560 | loss 3.471147 (+0.96z)| norm 0.2454 (-0.46z)| lr 5.04e-04 | 4177.82 ms | 32.3% bf16 MFU | 124312 tok/s step 8722/19560 | loss 3.411507 (-0.73z)| norm 0.2501 (-0.17z)| lr 5.04e-04 | 4178.61 ms | 32.3% bf16 MFU | 124369 tok/s step 8723/19560 | loss 3.479843 (+1.19z)| norm 0.2411 (-0.74z)| lr 5.04e-04 | 4182.08 ms | 32.3% bf16 MFU | 124419 tok/s step 8724/19560 | loss 3.395830 (-1.18z)| norm 0.2429 (-0.64z)| lr 5.04e-04 | 4174.46 ms | 32.3% bf16 MFU | 124478 tok/s step 8725/19560 | loss 3.401624 (-1.00z)| norm 0.2366 (-1.02z)| lr 5.04e-04 | 4292.63 ms | 31.5% bf16 MFU | 124361 tok/s step 8726/19560 | loss 3.448941 (+0.33z)| norm 0.2450 (-0.49z)| lr 5.04e-04 | 4170.36 ms | 32.4% bf16 MFU | 124429 tok/s step 8727/19560 | loss 3.425037 (-0.35z)| norm 0.2380 (-0.92z)| lr 5.04e-04 | 4194.75 ms | 32.2% bf16 MFU | 124457 tok/s step 8728/19560 | loss 3.430722 (-0.19z)| norm 0.2186 (-2.09z)| lr 5.04e-04 | 4177.42 ms | 32.3% bf16 MFU | 124509 tok/s step 8729/19560 | loss 3.485929 (+1.37z)| norm 0.2716 (+1.23z)| lr 5.04e-04 | 4216.13 ms | 32.0% bf16 MFU | 124501 tok/s step 8730/19560 | loss 3.401284 (-1.04z)| norm 0.2786 (+1.64z)| lr 5.04e-04 | 4181.94 ms | 32.3% bf16 MFU | 124545 tok/s step 8731/19560 | loss 3.530953 (+2.56z)| norm 0.2611 (+0.55z)| lr 5.04e-04 | 4205.25 ms | 32.1% bf16 MFU | 124551 tok/s step 8732/19560 | loss 3.428980 (-0.28z)| norm 0.2955 (+2.64z)| lr 5.04e-04 | 4199.85 ms | 32.1% bf16 MFU | 124565 tok/s step 8733/19560 | loss 3.448063 (+0.26z)| norm 0.2918 (+2.34z)| lr 5.04e-04 | 4172.43 ms | 32.4% bf16 MFU | 124620 tok/s step 8734/19560 | loss 3.403474 (-1.00z)| norm 0.2841 (+1.83z)| lr 5.04e-04 | 4165.42 ms | 32.4% bf16 MFU | 124682 tok/s step 8735/19560 | loss 3.384514 (-1.50z)| norm 0.2872 (+1.97z)| lr 5.03e-04 | 4176.59 ms | 32.3% bf16 MFU | 124725 tok/s step 8736/19560 | loss 3.420146 (-0.50z)| norm 0.2542 (+0.03z)| lr 5.03e-04 | 4172.85 ms | 32.4% bf16 MFU | 124770 tok/s step 8737/19560 | loss 3.360118 (-2.12z)| norm 0.2802 (+1.54z)| lr 5.03e-04 | 4196.45 ms | 32.2% bf16 MFU | 124779 tok/s step 8738/19560 | loss 3.474233 (+0.98z)| norm 0.2504 (-0.21z)| lr 5.03e-04 | 4184.91 ms | 32.3% bf16 MFU | 124804 tok/s step 8739/19560 | loss 3.435799 (-0.06z)| norm 0.2554 (+0.09z)| lr 5.03e-04 | 4171.14 ms | 32.4% bf16 MFU | 124848 tok/s step 8740/19560 | loss 3.436620 (-0.05z)| norm 0.2633 (+0.55z)| lr 5.03e-04 | 4180.15 ms | 32.3% bf16 MFU | 124877 tok/s step 8741/19560 | loss 3.360807 (-2.07z)| norm 0.2433 (-0.64z)| lr 5.03e-04 | 4180.82 ms | 32.3% bf16 MFU | 124903 tok/s step 8742/19560 | loss 3.438799 (+0.04z)| norm 0.2719 (+1.05z)| lr 5.03e-04 | 4179.37 ms | 32.3% bf16 MFU | 124931 tok/s step 8743/19560 | loss 3.481166 (+1.17z)| norm 0.2321 (-1.28z)| lr 5.03e-04 | 4196.04 ms | 32.2% bf16 MFU | 124931 tok/s step 8744/19560 | loss 3.381747 (-1.49z)| norm 0.2752 (+1.25z)| lr 5.03e-04 | 4168.48 ms | 32.4% bf16 MFU | 124974 tok/s step 8745/19560 | loss 3.404656 (-0.88z)| norm 0.2315 (-1.31z)| lr 5.03e-04 | 4185.23 ms | 32.3% bf16 MFU | 124988 tok/s step 8746/19560 | loss 3.530600 (+2.41z)| norm 0.2715 (+1.01z)| lr 5.03e-04 | 4178.64 ms | 32.3% bf16 MFU | 125012 tok/s step 8747/19560 | loss 3.405542 (-0.86z)| norm 0.2634 (+0.53z)| lr 5.03e-04 | 4173.37 ms | 32.4% bf16 MFU | 125043 tok/s step 8748/19560 | loss 3.442035 (+0.08z)| norm 0.2853 (+1.79z)| lr 5.03e-04 | 4173.40 ms | 32.4% bf16 MFU | 125072 tok/s step 8749/19560 | loss 3.450654 (+0.30z)| norm 0.2564 (+0.11z)| lr 5.03e-04 | 4171.84 ms | 32.4% bf16 MFU | 125102 tok/s step 8750/19560 | loss 3.453733 (+0.38z)| norm 0.2540 (-0.02z)| lr 5.03e-04 | 4165.37 ms | 32.4% bf16 MFU | 125141 tok/s val loss 3.412533 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2916/10042 = 0.290380 step 8751/19560 | loss 3.409938 (-0.77z)| norm 0.2526 (-0.09z)| lr 5.02e-04 | 4305.50 ms | 31.4% bf16 MFU | 124972 tok/s step 8752/19560 | loss 3.450240 (+0.29z)| norm 0.2744 (+1.20z)| lr 5.02e-04 | 4253.83 ms | 31.7% bf16 MFU | 124886 tok/s step 8753/19560 | loss 3.429935 (-0.23z)| norm 0.2451 (-0.53z)| lr 5.02e-04 | 4329.73 ms | 31.2% bf16 MFU | 124696 tok/s step 8754/19560 | loss 3.418420 (-0.52z)| norm 0.2507 (-0.20z)| lr 5.02e-04 | 4280.01 ms | 31.5% bf16 MFU | 124586 tok/s step 8755/19560 | loss 3.366371 (-1.90z)| norm 0.2398 (-0.84z)| lr 5.02e-04 | 4169.40 ms | 32.4% bf16 MFU | 124644 tok/s step 8756/19560 | loss 3.373175 (-1.69z)| norm 0.2706 (+0.98z)| lr 5.02e-04 | 4279.93 ms | 31.5% bf16 MFU | 124537 tok/s step 8757/19560 | loss 3.436845 (-0.01z)| norm 0.2573 (+0.18z)| lr 5.02e-04 | 4345.11 ms | 31.1% bf16 MFU | 124343 tok/s step 8758/19560 | loss 3.432378 (-0.13z)| norm 0.2512 (-0.19z)| lr 5.02e-04 | 4199.32 ms | 32.2% bf16 MFU | 124369 tok/s step 8759/19560 | loss 3.432542 (-0.12z)| norm 0.2742 (+1.17z)| lr 5.02e-04 | 4168.81 ms | 32.4% bf16 MFU | 124439 tok/s step 8760/19560 | loss 3.394428 (-1.13z)| norm 0.2528 (-0.13z)| lr 5.02e-04 | 4158.20 ms | 32.5% bf16 MFU | 124521 tok/s step 8761/19560 | loss 3.435985 (-0.02z)| norm 0.2648 (+0.58z)| lr 5.02e-04 | 4159.32 ms | 32.5% bf16 MFU | 124597 tok/s step 8762/19560 | loss 3.405312 (-0.83z)| norm 0.2777 (+1.35z)| lr 5.02e-04 | 4176.26 ms | 32.3% bf16 MFU | 124645 tok/s step 8763/19560 | loss 3.429033 (-0.20z)| norm 0.2740 (+1.11z)| lr 5.02e-04 | 4158.96 ms | 32.5% bf16 MFU | 124715 tok/s step 8764/19560 | loss 3.434806 (-0.04z)| norm 0.2725 (+1.00z)| lr 5.02e-04 | 4162.87 ms | 32.4% bf16 MFU | 124777 tok/s step 8765/19560 | loss 3.502834 (+1.74z)| norm 0.2897 (+2.01z)| lr 5.02e-04 | 4161.99 ms | 32.4% bf16 MFU | 124837 tok/s step 8766/19560 | loss 3.467233 (+0.78z)| norm 0.2511 (-0.35z)| lr 5.01e-04 | 4279.78 ms | 31.5% bf16 MFU | 124720 tok/s step 8767/19560 | loss 3.471554 (+0.89z)| norm 0.3220 (+3.75z)| lr 5.01e-04 | 4238.91 ms | 31.9% bf16 MFU | 124668 tok/s step 8768/19560 | loss 3.485796 (+1.26z)| norm 0.2559 (-0.09z)| lr 5.01e-04 | 4174.13 ms | 32.3% bf16 MFU | 124715 tok/s step 8769/19560 | loss 3.445935 (+0.20z)| norm 0.2722 (+0.85z)| lr 5.01e-04 | 4164.76 ms | 32.4% bf16 MFU | 124773 tok/s step 8770/19560 | loss 3.373610 (-1.70z)| norm 0.2481 (-0.56z)| lr 5.01e-04 | 4161.20 ms | 32.4% bf16 MFU | 124835 tok/s step 8771/19560 | loss 3.467446 (+0.76z)| norm 0.2616 (+0.22z)| lr 5.01e-04 | 4165.78 ms | 32.4% bf16 MFU | 124886 tok/s step 8772/19560 | loss 3.439283 (+0.02z)| norm 0.2758 (+1.04z)| lr 5.01e-04 | 4150.59 ms | 32.5% bf16 MFU | 124957 tok/s step 8773/19560 | loss 3.408772 (-0.77z)| norm 0.2518 (-0.36z)| lr 5.01e-04 | 4166.44 ms | 32.4% bf16 MFU | 125001 tok/s step 8774/19560 | loss 3.417570 (-0.53z)| norm 0.2834 (+1.45z)| lr 5.01e-04 | 4165.10 ms | 32.4% bf16 MFU | 125045 tok/s step 8775/19560 | loss 3.506727 (+1.80z)| norm 0.2857 (+1.56z)| lr 5.01e-04 | 4162.06 ms | 32.4% bf16 MFU | 125091 tok/s step 8776/19560 | loss 3.440187 (+0.04z)| norm 0.2519 (-0.37z)| lr 5.01e-04 | 4216.54 ms | 32.0% bf16 MFU | 125054 tok/s step 8777/19560 | loss 3.367301 (-1.84z)| norm 0.2678 (+0.54z)| lr 5.01e-04 | 4173.02 ms | 32.4% bf16 MFU | 125083 tok/s step 8778/19560 | loss 3.424698 (-0.33z)| norm 0.2548 (-0.22z)| lr 5.01e-04 | 4160.77 ms | 32.5% bf16 MFU | 125129 tok/s step 8779/19560 | loss 3.445137 (+0.19z)| norm 0.2661 (+0.43z)| lr 5.01e-04 | 4168.93 ms | 32.4% bf16 MFU | 125161 tok/s step 8780/19560 | loss 3.458064 (+0.52z)| norm 0.2864 (+1.57z)| lr 5.01e-04 | 4164.02 ms | 32.4% bf16 MFU | 125198 tok/s step 8781/19560 | loss 3.454813 (+0.45z)| norm 0.2626 (+0.21z)| lr 5.00e-04 | 4164.59 ms | 32.4% bf16 MFU | 125233 tok/s step 8782/19560 | loss 3.452039 (+0.37z)| norm 0.2919 (+1.84z)| lr 5.00e-04 | 4169.04 ms | 32.4% bf16 MFU | 125259 tok/s step 8783/19560 | loss 3.448001 (+0.26z)| norm 0.2900 (+1.71z)| lr 5.00e-04 | 4167.87 ms | 32.4% bf16 MFU | 125286 tok/s step 8784/19560 | loss 3.368524 (-1.82z)| norm 0.2418 (-1.00z)| lr 5.00e-04 | 4162.48 ms | 32.4% bf16 MFU | 125319 tok/s step 8785/19560 | loss 3.437806 (+0.02z)| norm 0.2755 (+0.88z)| lr 5.00e-04 | 4196.01 ms | 32.2% bf16 MFU | 125301 tok/s step 8786/19560 | loss 3.484539 (+1.24z)| norm 0.2396 (-1.12z)| lr 5.00e-04 | 4158.58 ms | 32.5% bf16 MFU | 125339 tok/s step 8787/19560 | loss 3.478436 (+1.07z)| norm 0.2589 (-0.04z)| lr 5.00e-04 | 4160.69 ms | 32.5% bf16 MFU | 125373 tok/s step 8788/19560 | loss 3.519204 (+2.10z)| norm 0.2514 (-0.46z)| lr 5.00e-04 | 4170.30 ms | 32.4% bf16 MFU | 125390 tok/s step 8789/19560 | loss 3.466642 (+0.72z)| norm 0.2387 (-1.15z)| lr 5.00e-04 | 4166.27 ms | 32.4% bf16 MFU | 125413 tok/s step 8790/19560 | loss 3.425950 (-0.33z)| norm 0.2546 (-0.27z)| lr 5.00e-04 | 4162.38 ms | 32.4% bf16 MFU | 125440 tok/s step 8791/19560 | loss 3.424684 (-0.36z)| norm 0.2301 (-1.61z)| lr 5.00e-04 | 4169.59 ms | 32.4% bf16 MFU | 125455 tok/s step 8792/19560 | loss 3.445432 (+0.18z)| norm 0.2485 (-0.58z)| lr 5.00e-04 | 4166.79 ms | 32.4% bf16 MFU | 125474 tok/s step 8793/19560 | loss 3.428894 (-0.25z)| norm 0.2531 (-0.32z)| lr 5.00e-04 | 4162.02 ms | 32.4% bf16 MFU | 125498 tok/s step 8794/19560 | loss 3.439311 (+0.01z)| norm 0.2536 (-0.30z)| lr 5.00e-04 | 4168.87 ms | 32.4% bf16 MFU | 125512 tok/s step 8795/19560 | loss 3.409972 (-0.75z)| norm 0.2641 (+0.28z)| lr 5.00e-04 | 4170.06 ms | 32.4% bf16 MFU | 125522 tok/s step 8796/19560 | loss 3.437442 (-0.04z)| norm 0.2526 (-0.36z)| lr 4.99e-04 | 4165.40 ms | 32.4% bf16 MFU | 125540 tok/s step 8797/19560 | loss 3.429979 (-0.23z)| norm 0.2519 (-0.41z)| lr 4.99e-04 | 4164.74 ms | 32.4% bf16 MFU | 125557 tok/s step 8798/19560 | loss 3.504882 (+1.73z)| norm 0.2599 (+0.03z)| lr 4.99e-04 | 4166.92 ms | 32.4% bf16 MFU | 125570 tok/s step 8799/19560 | loss 3.439001 (+0.00z)| norm 0.2357 (-1.31z)| lr 4.99e-04 | 4174.77 ms | 32.3% bf16 MFU | 125571 tok/s step 8800/19560 | loss 3.440760 (+0.04z)| norm 0.2365 (-1.25z)| lr 4.99e-04 | 4183.72 ms | 32.3% bf16 MFU | 125558 tok/s step 8801/19560 | loss 3.498203 (+1.54z)| norm 0.2652 (+0.35z)| lr 4.99e-04 | 4159.46 ms | 32.5% bf16 MFU | 125583 tok/s step 8802/19560 | loss 3.446616 (+0.18z)| norm 0.2549 (-0.24z)| lr 4.99e-04 | 4165.47 ms | 32.4% bf16 MFU | 125597 tok/s step 8803/19560 | loss 3.433083 (-0.17z)| norm 0.2479 (-0.63z)| lr 4.99e-04 | 4170.02 ms | 32.4% bf16 MFU | 125603 tok/s step 8804/19560 | loss 3.379344 (-1.57z)| norm 0.2583 (-0.05z)| lr 4.99e-04 | 4161.03 ms | 32.4% bf16 MFU | 125623 tok/s step 8805/19560 | loss 3.451747 (+0.32z)| norm 0.2940 (+1.94z)| lr 4.99e-04 | 4172.38 ms | 32.4% bf16 MFU | 125625 tok/s step 8806/19560 | loss 3.442042 (+0.06z)| norm 0.2433 (-0.90z)| lr 4.99e-04 | 4168.90 ms | 32.4% bf16 MFU | 125632 tok/s step 8807/19560 | loss 3.394470 (-1.16z)| norm 0.2919 (+1.79z)| lr 4.99e-04 | 4161.14 ms | 32.4% bf16 MFU | 125650 tok/s step 8808/19560 | loss 3.436041 (-0.08z)| norm 0.2514 (-0.45z)| lr 4.99e-04 | 4174.20 ms | 32.3% bf16 MFU | 125648 tok/s step 8809/19560 | loss 3.463190 (+0.62z)| norm 0.2674 (+0.44z)| lr 4.99e-04 | 4170.01 ms | 32.4% bf16 MFU | 125652 tok/s step 8810/19560 | loss 3.415041 (-0.62z)| norm 0.2851 (+1.40z)| lr 4.99e-04 | 4163.33 ms | 32.4% bf16 MFU | 125665 tok/s step 8811/19560 | loss 3.462233 (+0.59z)| norm 0.2664 (+0.38z)| lr 4.99e-04 | 4177.46 ms | 32.3% bf16 MFU | 125657 tok/s step 8812/19560 | loss 3.413857 (-0.69z)| norm 0.2734 (+0.77z)| lr 4.98e-04 | 4168.17 ms | 32.4% bf16 MFU | 125664 tok/s step 8813/19560 | loss 3.435670 (-0.12z)| norm 0.2552 (-0.24z)| lr 4.98e-04 | 4173.42 ms | 32.4% bf16 MFU | 125662 tok/s step 8814/19560 | loss 3.463634 (+0.61z)| norm 0.2665 (+0.38z)| lr 4.98e-04 | 4161.10 ms | 32.4% bf16 MFU | 125679 tok/s step 8815/19560 | loss 3.425214 (-0.40z)| norm 0.2585 (-0.06z)| lr 4.98e-04 | 4172.30 ms | 32.4% bf16 MFU | 125678 tok/s step 8816/19560 | loss 3.449647 (+0.25z)| norm 0.2624 (+0.15z)| lr 4.98e-04 | 4162.03 ms | 32.4% bf16 MFU | 125692 tok/s step 8817/19560 | loss 3.359827 (-2.07z)| norm 0.2444 (-0.86z)| lr 4.98e-04 | 4161.85 ms | 32.4% bf16 MFU | 125706 tok/s step 8818/19560 | loss 3.408712 (-0.79z)| norm 0.2574 (-0.15z)| lr 4.98e-04 | 4167.61 ms | 32.4% bf16 MFU | 125711 tok/s step 8819/19560 | loss 3.470283 (+0.80z)| norm 0.2561 (-0.21z)| lr 4.98e-04 | 4161.06 ms | 32.4% bf16 MFU | 125725 tok/s step 8820/19560 | loss 3.439100 (-0.01z)| norm 0.2680 (+0.45z)| lr 4.98e-04 | 4162.41 ms | 32.4% bf16 MFU | 125737 tok/s step 8821/19560 | loss 3.487569 (+1.23z)| norm 0.2402 (-1.12z)| lr 4.98e-04 | 4180.75 ms | 32.3% bf16 MFU | 125720 tok/s step 8822/19560 | loss 3.439496 (-0.01z)| norm 0.2603 (+0.02z)| lr 4.98e-04 | 4165.73 ms | 32.4% bf16 MFU | 125727 tok/s step 8823/19560 | loss 3.456541 (+0.46z)| norm 0.2409 (-1.08z)| lr 4.98e-04 | 4171.00 ms | 32.4% bf16 MFU | 125726 tok/s step 8824/19560 | loss 3.428720 (-0.26z)| norm 0.2478 (-0.68z)| lr 4.98e-04 | 4159.84 ms | 32.5% bf16 MFU | 125741 tok/s step 8825/19560 | loss 3.383401 (-1.47z)| norm 0.2300 (-1.67z)| lr 4.98e-04 | 4159.44 ms | 32.5% bf16 MFU | 125757 tok/s step 8826/19560 | loss 3.402665 (-0.94z)| norm 0.2419 (-0.99z)| lr 4.98e-04 | 4172.46 ms | 32.4% bf16 MFU | 125752 tok/s step 8827/19560 | loss 3.469501 (+0.82z)| norm 0.2482 (-0.64z)| lr 4.97e-04 | 4166.04 ms | 32.4% bf16 MFU | 125756 tok/s step 8828/19560 | loss 3.403410 (-0.92z)| norm 0.2565 (-0.17z)| lr 4.97e-04 | 4160.26 ms | 32.5% bf16 MFU | 125770 tok/s step 8829/19560 | loss 3.471065 (+0.86z)| norm 0.2635 (+0.24z)| lr 4.97e-04 | 4152.44 ms | 32.5% bf16 MFU | 125794 tok/s step 8830/19560 | loss 3.411432 (-0.70z)| norm 0.2338 (-1.43z)| lr 4.97e-04 | 4165.35 ms | 32.4% bf16 MFU | 125798 tok/s step 8831/19560 | loss 3.388237 (-1.29z)| norm 0.2434 (-0.89z)| lr 4.97e-04 | 4163.62 ms | 32.4% bf16 MFU | 125804 tok/s step 8832/19560 | loss 3.400159 (-0.97z)| norm 0.2374 (-1.21z)| lr 4.97e-04 | 4166.47 ms | 32.4% bf16 MFU | 125806 tok/s step 8833/19560 | loss 3.490318 (+1.39z)| norm 0.2300 (-1.60z)| lr 4.97e-04 | 4167.79 ms | 32.4% bf16 MFU | 125805 tok/s step 8834/19560 | loss 3.441646 (+0.13z)| norm 0.2418 (-0.93z)| lr 4.97e-04 | 4186.10 ms | 32.3% bf16 MFU | 125777 tok/s step 8835/19560 | loss 3.461643 (+0.65z)| norm 0.2503 (-0.46z)| lr 4.97e-04 | 4174.83 ms | 32.3% bf16 MFU | 125767 tok/s step 8836/19560 | loss 3.450520 (+0.35z)| norm 0.2433 (-0.86z)| lr 4.97e-04 | 4163.54 ms | 32.4% bf16 MFU | 125775 tok/s step 8837/19560 | loss 3.476150 (+1.02z)| norm 0.2375 (-1.20z)| lr 4.97e-04 | 4156.59 ms | 32.5% bf16 MFU | 125793 tok/s step 8838/19560 | loss 3.482765 (+1.19z)| norm 0.2783 (+1.09z)| lr 4.97e-04 | 4163.47 ms | 32.4% bf16 MFU | 125800 tok/s step 8839/19560 | loss 3.480591 (+1.18z)| norm 0.2434 (-0.86z)| lr 4.97e-04 | 4158.62 ms | 32.5% bf16 MFU | 125813 tok/s step 8840/19560 | loss 3.419470 (-0.50z)| norm 0.2505 (-0.45z)| lr 4.97e-04 | 4179.57 ms | 32.3% bf16 MFU | 125795 tok/s step 8841/19560 | loss 3.480505 (+1.16z)| norm 0.2532 (-0.30z)| lr 4.97e-04 | 4178.48 ms | 32.3% bf16 MFU | 125779 tok/s step 8842/19560 | loss 3.434379 (-0.09z)| norm 0.2331 (-1.42z)| lr 4.96e-04 | 4171.59 ms | 32.4% bf16 MFU | 125774 tok/s step 8843/19560 | loss 3.387449 (-1.36z)| norm 0.2521 (-0.34z)| lr 4.96e-04 | 4162.88 ms | 32.4% bf16 MFU | 125782 tok/s step 8844/19560 | loss 3.402089 (-0.95z)| norm 0.2471 (-0.63z)| lr 4.96e-04 | 4172.75 ms | 32.4% bf16 MFU | 125775 tok/s step 8845/19560 | loss 3.436979 (+0.02z)| norm 0.2381 (-1.13z)| lr 4.96e-04 | 4169.54 ms | 32.4% bf16 MFU | 125774 tok/s step 8846/19560 | loss 3.429186 (-0.21z)| norm 0.2378 (-1.13z)| lr 4.96e-04 | 4152.67 ms | 32.5% bf16 MFU | 125798 tok/s step 8847/19560 | loss 3.386949 (-1.36z)| norm 0.2561 (-0.10z)| lr 4.96e-04 | 4168.78 ms | 32.4% bf16 MFU | 125796 tok/s step 8848/19560 | loss 3.438510 (+0.06z)| norm 0.2525 (-0.30z)| lr 4.96e-04 | 4159.54 ms | 32.5% bf16 MFU | 125809 tok/s step 8849/19560 | loss 3.438666 (+0.07z)| norm 0.2875 (+1.66z)| lr 4.96e-04 | 4164.77 ms | 32.4% bf16 MFU | 125813 tok/s step 8850/19560 | loss 3.370223 (-1.80z)| norm 0.3103 (+2.83z)| lr 4.96e-04 | 4171.43 ms | 32.4% bf16 MFU | 125806 tok/s step 8851/19560 | loss 3.411947 (-0.64z)| norm 0.2830 (+1.31z)| lr 4.96e-04 | 4168.27 ms | 32.4% bf16 MFU | 125805 tok/s step 8852/19560 | loss 3.481180 (+1.24z)| norm 0.2499 (-0.50z)| lr 4.96e-04 | 4169.65 ms | 32.4% bf16 MFU | 125802 tok/s step 8853/19560 | loss 3.462364 (+0.71z)| norm 0.2953 (+1.94z)| lr 4.96e-04 | 4176.03 ms | 32.3% bf16 MFU | 125789 tok/s step 8854/19560 | loss 3.560262 (+3.25z)| norm 0.2861 (+1.42z)| lr 4.96e-04 | 4178.92 ms | 32.3% bf16 MFU | 125772 tok/s step 8855/19560 | loss 3.474513 (+0.97z)| norm 0.2711 (+0.60z)| lr 4.96e-04 | 4165.18 ms | 32.4% bf16 MFU | 125778 tok/s step 8856/19560 | loss 3.402591 (-0.92z)| norm 0.2493 (-0.61z)| lr 4.96e-04 | 4172.29 ms | 32.4% bf16 MFU | 125772 tok/s step 8857/19560 | loss 3.483471 (+1.21z)| norm 0.2897 (+1.60z)| lr 4.95e-04 | 4160.73 ms | 32.5% bf16 MFU | 125783 tok/s step 8858/19560 | loss 3.434721 (-0.08z)| norm 0.2807 (+1.11z)| lr 4.95e-04 | 4163.13 ms | 32.4% bf16 MFU | 125791 tok/s step 8859/19560 | loss 3.500098 (+1.67z)| norm 0.2625 (+0.11z)| lr 4.95e-04 | 4163.13 ms | 32.4% bf16 MFU | 125798 tok/s step 8860/19560 | loss 3.503423 (+1.72z)| norm 0.2864 (+1.43z)| lr 4.95e-04 | 4161.58 ms | 32.4% bf16 MFU | 125808 tok/s step 8861/19560 | loss 3.434864 (-0.08z)| norm 0.2650 (+0.27z)| lr 4.95e-04 | 4191.75 ms | 32.2% bf16 MFU | 125771 tok/s step 8862/19560 | loss 3.462224 (+0.63z)| norm 0.2672 (+0.40z)| lr 4.95e-04 | 4164.31 ms | 32.4% bf16 MFU | 125777 tok/s step 8863/19560 | loss 3.414967 (-0.63z)| norm 0.2453 (-0.81z)| lr 4.95e-04 | 4197.42 ms | 32.2% bf16 MFU | 125734 tok/s step 8864/19560 | loss 3.467102 (+0.75z)| norm 0.2650 (+0.29z)| lr 4.95e-04 | 4169.75 ms | 32.4% bf16 MFU | 125734 tok/s step 8865/19560 | loss 3.424743 (-0.40z)| norm 0.2353 (-1.36z)| lr 4.95e-04 | 4237.16 ms | 31.9% bf16 MFU | 125634 tok/s step 8866/19560 | loss 3.394701 (-1.19z)| norm 0.2435 (-0.90z)| lr 4.95e-04 | 4167.26 ms | 32.4% bf16 MFU | 125643 tok/s step 8867/19560 | loss 3.434855 (-0.11z)| norm 0.2265 (-1.81z)| lr 4.95e-04 | 4226.66 ms | 31.9% bf16 MFU | 125563 tok/s step 8868/19560 | loss 3.431411 (-0.20z)| norm 0.2480 (-0.61z)| lr 4.95e-04 | 4157.22 ms | 32.5% bf16 MFU | 125591 tok/s step 8869/19560 | loss 3.428551 (-0.30z)| norm 0.2514 (-0.43z)| lr 4.95e-04 | 4163.76 ms | 32.4% bf16 MFU | 125607 tok/s step 8870/19560 | loss 3.411595 (-0.76z)| norm 0.2327 (-1.44z)| lr 4.95e-04 | 4163.61 ms | 32.4% bf16 MFU | 125623 tok/s step 8871/19560 | loss 3.472492 (+0.92z)| norm 0.2493 (-0.54z)| lr 4.95e-04 | 4170.30 ms | 32.4% bf16 MFU | 125627 tok/s step 8872/19560 | loss 3.422581 (-0.47z)| norm 0.2485 (-0.57z)| lr 4.94e-04 | 4175.44 ms | 32.3% bf16 MFU | 125624 tok/s step 8873/19560 | loss 3.414437 (-0.70z)| norm 0.2671 (+0.45z)| lr 4.94e-04 | 4167.83 ms | 32.4% bf16 MFU | 125633 tok/s step 8874/19560 | loss 3.410536 (-0.80z)| norm 0.2284 (-1.69z)| lr 4.94e-04 | 4164.34 ms | 32.4% bf16 MFU | 125646 tok/s step 8875/19560 | loss 3.454700 (+0.45z)| norm 0.2647 (+0.33z)| lr 4.94e-04 | 4174.21 ms | 32.3% bf16 MFU | 125644 tok/s step 8876/19560 | loss 3.445656 (+0.19z)| norm 0.2557 (-0.16z)| lr 4.94e-04 | 4204.57 ms | 32.1% bf16 MFU | 125596 tok/s step 8877/19560 | loss 3.431749 (-0.20z)| norm 0.2458 (-0.70z)| lr 4.94e-04 | 4157.08 ms | 32.5% bf16 MFU | 125623 tok/s step 8878/19560 | loss 3.415743 (-0.65z)| norm 0.2585 (+0.01z)| lr 4.94e-04 | 5053.49 ms | 26.7% bf16 MFU | 124529 tok/s step 8879/19560 | loss 3.384567 (-1.52z)| norm 0.2635 (+0.28z)| lr 4.94e-04 | 4668.46 ms | 28.9% bf16 MFU | 123918 tok/s step 8880/19560 | loss 3.471045 (+0.92z)| norm 0.2716 (+0.74z)| lr 4.94e-04 | 4779.55 ms | 28.2% bf16 MFU | 123206 tok/s step 8881/19560 | loss 3.518699 (+2.20z)| norm 0.2513 (-0.40z)| lr 4.94e-04 | 4525.62 ms | 29.8% bf16 MFU | 122839 tok/s step 8882/19560 | loss 3.402263 (-1.01z)| norm 0.2550 (-0.20z)| lr 4.94e-04 | 4368.53 ms | 30.9% bf16 MFU | 122697 tok/s step 8883/19560 | loss 3.467910 (+0.78z)| norm 0.2406 (-1.01z)| lr 4.94e-04 | 4259.12 ms | 31.7% bf16 MFU | 122717 tok/s step 8884/19560 | loss 3.393154 (-1.32z)| norm 0.2405 (-1.00z)| lr 4.94e-04 | 4437.98 ms | 30.4% bf16 MFU | 122488 tok/s step 8885/19560 | loss 3.427573 (-0.35z)| norm 0.2510 (-0.41z)| lr 4.94e-04 | 4245.77 ms | 31.8% bf16 MFU | 122538 tok/s step 8886/19560 | loss 3.424680 (-0.43z)| norm 0.2445 (-0.77z)| lr 4.94e-04 | 4277.49 ms | 31.6% bf16 MFU | 122540 tok/s step 8887/19560 | loss 3.479456 (+1.10z)| norm 0.2746 (+0.91z)| lr 4.94e-04 | 4376.70 ms | 30.8% bf16 MFU | 122402 tok/s step 8888/19560 | loss 3.513851 (+2.01z)| norm 0.2373 (-1.16z)| lr 4.93e-04 | 4286.51 ms | 31.5% bf16 MFU | 122398 tok/s step 8889/19560 | loss 3.475816 (+0.95z)| norm 0.2769 (+1.04z)| lr 4.93e-04 | 4207.24 ms | 32.1% bf16 MFU | 122509 tok/s step 8890/19560 | loss 3.583814 (+3.69z)| norm 0.2680 (+0.55z)| lr 4.93e-04 | 4181.93 ms | 32.3% bf16 MFU | 122652 tok/s step 8891/19560 | loss 3.403130 (-1.03z)| norm 0.2677 (+0.54z)| lr 4.93e-04 | 4225.69 ms | 32.0% bf16 MFU | 122723 tok/s step 8892/19560 | loss 3.394086 (-1.25z)| norm 0.2708 (+0.71z)| lr 4.93e-04 | 4211.72 ms | 32.1% bf16 MFU | 122811 tok/s step 8893/19560 | loss 3.420712 (-0.55z)| norm 0.2398 (-1.01z)| lr 4.93e-04 | 4206.75 ms | 32.1% bf16 MFU | 122902 tok/s step 8894/19560 | loss 3.517205 (+1.94z)| norm 0.2806 (+1.27z)| lr 4.93e-04 | 4173.02 ms | 32.4% bf16 MFU | 123038 tok/s step 8895/19560 | loss 3.455510 (+0.35z)| norm 0.2551 (-0.14z)| lr 4.93e-04 | 4215.95 ms | 32.0% bf16 MFU | 123104 tok/s step 8896/19560 | loss 3.409523 (-0.83z)| norm 0.2719 (+0.85z)| lr 4.93e-04 | 4172.64 ms | 32.4% bf16 MFU | 123232 tok/s step 8897/19560 | loss 3.404537 (-0.94z)| norm 0.2380 (-1.13z)| lr 4.93e-04 | 4157.55 ms | 32.5% bf16 MFU | 123375 tok/s step 8898/19560 | loss 3.412264 (-0.76z)| norm 0.2391 (-1.06z)| lr 4.93e-04 | 4261.69 ms | 31.7% bf16 MFU | 123358 tok/s step 8899/19560 | loss 3.456338 (+0.39z)| norm 0.2564 (-0.05z)| lr 4.93e-04 | 4185.54 ms | 32.3% bf16 MFU | 123453 tok/s step 8900/19560 | loss 3.405128 (-0.93z)| norm 0.2574 (+0.02z)| lr 4.93e-04 | 4206.71 ms | 32.1% bf16 MFU | 123512 tok/s step 8901/19560 | loss 3.495553 (+1.39z)| norm 0.2719 (+0.87z)| lr 4.93e-04 | 4173.79 ms | 32.3% bf16 MFU | 123617 tok/s step 8902/19560 | loss 3.565646 (+3.06z)| norm 0.2739 (+0.99z)| lr 4.93e-04 | 4181.23 ms | 32.3% bf16 MFU | 123706 tok/s step 8903/19560 | loss 3.439237 (-0.08z)| norm 0.2452 (-0.69z)| lr 4.92e-04 | 4164.63 ms | 32.4% bf16 MFU | 123815 tok/s step 8904/19560 | loss 3.402068 (-1.00z)| norm 0.2478 (-0.53z)| lr 4.92e-04 | 4163.56 ms | 32.4% bf16 MFU | 123920 tok/s step 8905/19560 | loss 3.575250 (+3.21z)| norm 0.2791 (+1.32z)| lr 4.92e-04 | 4180.08 ms | 32.3% bf16 MFU | 123996 tok/s step 8906/19560 | loss 3.420713 (-0.56z)| norm 0.2911 (+1.98z)| lr 4.92e-04 | 4223.30 ms | 32.0% bf16 MFU | 124003 tok/s step 8907/19560 | loss 3.420918 (-0.55z)| norm 0.2671 (+0.58z)| lr 4.92e-04 | 4168.57 ms | 32.4% bf16 MFU | 124091 tok/s step 8908/19560 | loss 3.449028 (+0.14z)| norm 0.2586 (+0.10z)| lr 4.92e-04 | 4174.20 ms | 32.3% bf16 MFU | 124167 tok/s step 8909/19560 | loss 3.382584 (-1.46z)| norm 0.2607 (+0.22z)| lr 4.92e-04 | 4166.23 ms | 32.4% bf16 MFU | 124251 tok/s step 8910/19560 | loss 3.417262 (-0.61z)| norm 0.2381 (-1.10z)| lr 4.92e-04 | 4218.58 ms | 32.0% bf16 MFU | 124252 tok/s step 8911/19560 | loss 3.450561 (+0.19z)| norm 0.2506 (-0.34z)| lr 4.92e-04 | 4208.44 ms | 32.1% bf16 MFU | 124269 tok/s step 8912/19560 | loss 3.420876 (-0.54z)| norm 0.2949 (+2.28z)| lr 4.92e-04 | 4182.05 ms | 32.3% bf16 MFU | 124323 tok/s step 8913/19560 | loss 3.402990 (-0.97z)| norm 0.2453 (-0.66z)| lr 4.92e-04 | 4189.82 ms | 32.2% bf16 MFU | 124364 tok/s step 8914/19560 | loss 3.464017 (+0.52z)| norm 0.2827 (+1.55z)| lr 4.92e-04 | 4174.23 ms | 32.3% bf16 MFU | 124426 tok/s step 8915/19560 | loss 3.411868 (-0.74z)| norm 0.2884 (+1.85z)| lr 4.92e-04 | 4174.84 ms | 32.3% bf16 MFU | 124484 tok/s step 8916/19560 | loss 3.402375 (-0.96z)| norm 0.2421 (-0.86z)| lr 4.92e-04 | 4173.39 ms | 32.4% bf16 MFU | 124541 tok/s step 8917/19560 | loss 3.456011 (+0.37z)| norm 0.2703 (+0.77z)| lr 4.92e-04 | 4165.74 ms | 32.4% bf16 MFU | 124607 tok/s step 8918/19560 | loss 3.425181 (-0.39z)| norm 0.2668 (+0.56z)| lr 4.91e-04 | 4231.58 ms | 31.9% bf16 MFU | 124571 tok/s step 8919/19560 | loss 3.437082 (-0.10z)| norm 0.2589 (+0.09z)| lr 4.91e-04 | 4174.35 ms | 32.3% bf16 MFU | 124623 tok/s step 8920/19560 | loss 3.388593 (-1.28z)| norm 0.2737 (+0.95z)| lr 4.91e-04 | 4182.30 ms | 32.3% bf16 MFU | 124659 tok/s step 8921/19560 | loss 3.474281 (+0.82z)| norm 0.2353 (-1.30z)| lr 4.91e-04 | 4166.20 ms | 32.4% bf16 MFU | 124719 tok/s step 8922/19560 | loss 3.406717 (-0.83z)| norm 0.2647 (+0.42z)| lr 4.91e-04 | 4169.48 ms | 32.4% bf16 MFU | 124770 tok/s step 8923/19560 | loss 3.411921 (-0.71z)| norm 0.2238 (-1.93z)| lr 4.91e-04 | 4182.70 ms | 32.3% bf16 MFU | 124799 tok/s step 8924/19560 | loss 3.461237 (+0.50z)| norm 0.2429 (-0.82z)| lr 4.91e-04 | 4171.88 ms | 32.4% bf16 MFU | 124842 tok/s step 8925/19560 | loss 3.411279 (-0.72z)| norm 0.2544 (-0.16z)| lr 4.91e-04 | 4179.90 ms | 32.3% bf16 MFU | 124872 tok/s step 8926/19560 | loss 3.439528 (-0.02z)| norm 0.2412 (-0.91z)| lr 4.91e-04 | 4175.65 ms | 32.3% bf16 MFU | 124906 tok/s step 8927/19560 | loss 3.390263 (-1.22z)| norm 0.2458 (-0.65z)| lr 4.91e-04 | 4175.46 ms | 32.3% bf16 MFU | 124939 tok/s step 8928/19560 | loss 3.451897 (+0.29z)| norm 0.2573 (+0.00z)| lr 4.91e-04 | 4166.89 ms | 32.4% bf16 MFU | 124983 tok/s step 8929/19560 | loss 3.443242 (+0.09z)| norm 0.2891 (+1.81z)| lr 4.91e-04 | 4173.50 ms | 32.4% bf16 MFU | 125015 tok/s step 8930/19560 | loss 3.454431 (+0.36z)| norm 0.2617 (+0.24z)| lr 4.91e-04 | 4223.96 ms | 32.0% bf16 MFU | 124970 tok/s step 8931/19560 | loss 3.436062 (-0.09z)| norm 0.2603 (+0.16z)| lr 4.91e-04 | 4179.80 ms | 32.3% bf16 MFU | 124994 tok/s step 8932/19560 | loss 3.395596 (-1.10z)| norm 0.2595 (+0.11z)| lr 4.91e-04 | 4170.78 ms | 32.4% bf16 MFU | 125029 tok/s step 8933/19560 | loss 3.442572 (+0.07z)| norm 0.2500 (-0.43z)| lr 4.90e-04 | 4169.72 ms | 32.4% bf16 MFU | 125065 tok/s step 8934/19560 | loss 3.392850 (-1.15z)| norm 0.2393 (-1.05z)| lr 4.90e-04 | 4204.51 ms | 32.1% bf16 MFU | 125046 tok/s step 8935/19560 | loss 3.471971 (+0.79z)| norm 0.2453 (-0.68z)| lr 4.90e-04 | 4201.52 ms | 32.1% bf16 MFU | 125033 tok/s step 8936/19560 | loss 3.425512 (-0.35z)| norm 0.2858 (+1.68z)| lr 4.90e-04 | 4178.92 ms | 32.3% bf16 MFU | 125054 tok/s step 8937/19560 | loss 3.419239 (-0.50z)| norm 0.2911 (+1.95z)| lr 4.90e-04 | 4173.91 ms | 32.3% bf16 MFU | 125082 tok/s step 8938/19560 | loss 3.419054 (-0.51z)| norm 0.2515 (-0.33z)| lr 4.90e-04 | 4179.85 ms | 32.3% bf16 MFU | 125100 tok/s step 8939/19560 | loss 3.396583 (-1.05z)| norm 0.2497 (-0.42z)| lr 4.90e-04 | 4189.65 ms | 32.2% bf16 MFU | 125102 tok/s step 8940/19560 | loss 3.532050 (+2.23z)| norm 0.2942 (+2.13z)| lr 4.90e-04 | 4183.76 ms | 32.3% bf16 MFU | 125112 tok/s step 8941/19560 | loss 3.491540 (+1.23z)| norm 0.2538 (-0.19z)| lr 4.90e-04 | 4168.03 ms | 32.4% bf16 MFU | 125146 tok/s step 8942/19560 | loss 3.426647 (-0.32z)| norm 0.2419 (-0.86z)| lr 4.90e-04 | 4170.02 ms | 32.4% bf16 MFU | 125175 tok/s step 8943/19560 | loss 3.426034 (-0.34z)| norm 0.2496 (-0.41z)| lr 4.90e-04 | 4183.07 ms | 32.3% bf16 MFU | 125183 tok/s step 8944/19560 | loss 3.410233 (-0.71z)| norm 0.2567 (-0.00z)| lr 4.90e-04 | 4182.73 ms | 32.3% bf16 MFU | 125191 tok/s step 8945/19560 | loss 3.452049 (+0.28z)| norm 0.2593 (+0.14z)| lr 4.90e-04 | 4169.04 ms | 32.4% bf16 MFU | 125220 tok/s step 8946/19560 | loss 3.408238 (-0.79z)| norm 0.2656 (+0.49z)| lr 4.90e-04 | 4175.03 ms | 32.3% bf16 MFU | 125238 tok/s step 8947/19560 | loss 3.441087 (+0.02z)| norm 0.2820 (+1.41z)| lr 4.90e-04 | 4177.84 ms | 32.3% bf16 MFU | 125250 tok/s step 8948/19560 | loss 3.440348 (+0.00z)| norm 0.2583 (+0.07z)| lr 4.89e-04 | 4165.14 ms | 32.4% bf16 MFU | 125282 tok/s step 8949/19560 | loss 3.579628 (+3.26z)| norm 0.2816 (+1.37z)| lr 4.89e-04 | 4188.13 ms | 32.2% bf16 MFU | 125277 tok/s step 8950/19560 | loss 3.421194 (-0.46z)| norm 0.2522 (-0.29z)| lr 4.89e-04 | 4177.75 ms | 32.3% bf16 MFU | 125288 tok/s step 8951/19560 | loss 3.433563 (-0.17z)| norm 0.2700 (+0.71z)| lr 4.89e-04 | 4173.38 ms | 32.4% bf16 MFU | 125305 tok/s step 8952/19560 | loss 3.456970 (+0.38z)| norm 0.2604 (+0.15z)| lr 4.89e-04 | 4190.46 ms | 32.2% bf16 MFU | 125295 tok/s step 8953/19560 | loss 3.466315 (+0.58z)| norm 0.2347 (-1.31z)| lr 4.89e-04 | 4182.76 ms | 32.3% bf16 MFU | 125298 tok/s step 8954/19560 | loss 3.447842 (+0.14z)| norm 0.2549 (-0.17z)| lr 4.89e-04 | 4165.64 ms | 32.4% bf16 MFU | 125326 tok/s step 8955/19560 | loss 3.412870 (-0.68z)| norm 0.2225 (-1.98z)| lr 4.89e-04 | 4181.89 ms | 32.3% bf16 MFU | 125328 tok/s step 8956/19560 | loss 3.376244 (-1.53z)| norm 0.2407 (-0.95z)| lr 4.89e-04 | 4170.06 ms | 32.4% bf16 MFU | 125348 tok/s step 8957/19560 | loss 3.398811 (-0.99z)| norm 0.2342 (-1.29z)| lr 4.89e-04 | 4165.94 ms | 32.4% bf16 MFU | 125373 tok/s step 8958/19560 | loss 3.477895 (+0.86z)| norm 0.2571 (-0.02z)| lr 4.89e-04 | 4167.59 ms | 32.4% bf16 MFU | 125394 tok/s step 8959/19560 | loss 3.424730 (-0.40z)| norm 0.2446 (-0.73z)| lr 4.89e-04 | 4176.86 ms | 32.3% bf16 MFU | 125401 tok/s step 8960/19560 | loss 3.461519 (+0.46z)| norm 0.2343 (-1.30z)| lr 4.89e-04 | 4165.67 ms | 32.4% bf16 MFU | 125424 tok/s step 8961/19560 | loss 3.586011 (+3.27z)| norm 0.2342 (-1.31z)| lr 4.89e-04 | 4165.13 ms | 32.4% bf16 MFU | 125446 tok/s step 8962/19560 | loss 3.415821 (-0.61z)| norm 0.2611 (+0.20z)| lr 4.89e-04 | 4175.48 ms | 32.3% bf16 MFU | 125452 tok/s step 8963/19560 | loss 3.415752 (-0.60z)| norm 0.2578 (+0.01z)| lr 4.88e-04 | 4181.65 ms | 32.3% bf16 MFU | 125449 tok/s step 8964/19560 | loss 3.391107 (-1.15z)| norm 0.2429 (-0.84z)| lr 4.88e-04 | 4174.47 ms | 32.3% bf16 MFU | 125456 tok/s step 8965/19560 | loss 3.450886 (+0.21z)| norm 0.3012 (+2.39z)| lr 4.88e-04 | 4171.51 ms | 32.4% bf16 MFU | 125467 tok/s step 8966/19560 | loss 3.535283 (+2.09z)| norm 0.2533 (-0.26z)| lr 4.88e-04 | 4180.30 ms | 32.3% bf16 MFU | 125465 tok/s step 8967/19560 | loss 3.429549 (-0.27z)| norm 0.2851 (+1.49z)| lr 4.88e-04 | 4176.64 ms | 32.3% bf16 MFU | 125468 tok/s step 8968/19560 | loss 3.484138 (+0.94z)| norm 0.2581 (-0.02z)| lr 4.88e-04 | 4175.05 ms | 32.3% bf16 MFU | 125473 tok/s step 8969/19560 | loss 3.443014 (+0.03z)| norm 0.3412 (+4.23z)| lr 4.88e-04 | 4174.14 ms | 32.3% bf16 MFU | 125480 tok/s step 8970/19560 | loss 3.395194 (-1.03z)| norm 0.2675 (+0.43z)| lr 4.88e-04 | 4179.18 ms | 32.3% bf16 MFU | 125478 tok/s step 8971/19560 | loss 3.393744 (-1.07z)| norm 0.2582 (-0.06z)| lr 4.88e-04 | 4179.97 ms | 32.3% bf16 MFU | 125476 tok/s step 8972/19560 | loss 3.460749 (+0.42z)| norm 0.2764 (+0.87z)| lr 4.88e-04 | 4181.78 ms | 32.3% bf16 MFU | 125471 tok/s step 8973/19560 | loss 3.477012 (+0.78z)| norm 0.2552 (-0.23z)| lr 4.88e-04 | 4164.95 ms | 32.4% bf16 MFU | 125491 tok/s step 8974/19560 | loss 3.486032 (+0.97z)| norm 0.2758 (+0.82z)| lr 4.88e-04 | 4196.56 ms | 32.2% bf16 MFU | 125463 tok/s step 8975/19560 | loss 3.439052 (-0.09z)| norm 0.2451 (-0.77z)| lr 4.88e-04 | 4182.87 ms | 32.3% bf16 MFU | 125457 tok/s step 8976/19560 | loss 3.381078 (-1.37z)| norm 0.2344 (-1.31z)| lr 4.88e-04 | 4171.12 ms | 32.4% bf16 MFU | 125469 tok/s step 8977/19560 | loss 3.419787 (-0.51z)| norm 0.2600 (+0.02z)| lr 4.88e-04 | 4180.15 ms | 32.3% bf16 MFU | 125467 tok/s step 8978/19560 | loss 3.406644 (-0.81z)| norm 0.2556 (-0.19z)| lr 4.88e-04 | 4236.52 ms | 31.9% bf16 MFU | 125381 tok/s step 8979/19560 | loss 3.483915 (+0.91z)| norm 0.2232 (-1.89z)| lr 4.87e-04 | 4175.72 ms | 32.3% bf16 MFU | 125390 tok/s step 8980/19560 | loss 3.433343 (-0.22z)| norm 0.2655 (+0.36z)| lr 4.87e-04 | 4167.31 ms | 32.4% bf16 MFU | 125411 tok/s step 8981/19560 | loss 3.416309 (-0.59z)| norm 0.2394 (-1.02z)| lr 4.87e-04 | 4172.87 ms | 32.4% bf16 MFU | 125423 tok/s step 8982/19560 | loss 3.464540 (+0.52z)| norm 0.2712 (+0.70z)| lr 4.87e-04 | 4214.86 ms | 32.0% bf16 MFU | 125371 tok/s step 8983/19560 | loss 3.514253 (+1.65z)| norm 0.2905 (+1.72z)| lr 4.87e-04 | 4168.20 ms | 32.4% bf16 MFU | 125392 tok/s step 8984/19560 | loss 3.462222 (+0.45z)| norm 0.2863 (+1.47z)| lr 4.87e-04 | 4183.50 ms | 32.3% bf16 MFU | 125388 tok/s step 8985/19560 | loss 3.406200 (-0.82z)| norm 0.2574 (-0.06z)| lr 4.87e-04 | 4178.92 ms | 32.3% bf16 MFU | 125392 tok/s step 8986/19560 | loss 3.495868 (+1.22z)| norm 0.2560 (-0.12z)| lr 4.87e-04 | 4175.90 ms | 32.3% bf16 MFU | 125400 tok/s step 8987/19560 | loss 3.370450 (-1.61z)| norm 0.2751 (+0.90z)| lr 4.87e-04 | 4170.60 ms | 32.4% bf16 MFU | 125415 tok/s step 8988/19560 | loss 3.525725 (+1.90z)| norm 0.2538 (-0.23z)| lr 4.87e-04 | 4182.31 ms | 32.3% bf16 MFU | 125412 tok/s step 8989/19560 | loss 3.473488 (+0.71z)| norm 0.6909 (+10.15z)| lr 4.87e-04 | 4169.66 ms | 32.4% bf16 MFU | 125429 tok/s step 8990/19560 | loss 3.453013 (+0.25z)| norm 0.3734 (+2.56z)| lr 4.87e-04 | 4172.59 ms | 32.4% bf16 MFU | 125440 tok/s step 8991/19560 | loss 3.428776 (-0.30z)| norm 0.3141 (+1.17z)| lr 4.87e-04 | 4183.87 ms | 32.3% bf16 MFU | 125433 tok/s step 8992/19560 | loss 3.477380 (+0.79z)| norm 0.3013 (+0.87z)| lr 4.87e-04 | 4187.33 ms | 32.2% bf16 MFU | 125422 tok/s step 8993/19560 | loss 3.491651 (+1.10z)| norm 0.2684 (+0.12z)| lr 4.87e-04 | 4174.75 ms | 32.3% bf16 MFU | 125430 tok/s step 8994/19560 | loss 3.425563 (-0.39z)| norm 0.2907 (+0.62z)| lr 4.86e-04 | 4205.01 ms | 32.1% bf16 MFU | 125393 tok/s step 8995/19560 | loss 3.398550 (-0.99z)| norm 0.2606 (-0.08z)| lr 4.86e-04 | 4168.50 ms | 32.4% bf16 MFU | 125412 tok/s step 8996/19560 | loss 3.439322 (-0.07z)| norm 0.2889 (+0.56z)| lr 4.86e-04 | 4186.39 ms | 32.3% bf16 MFU | 125403 tok/s step 8997/19560 | loss 3.456870 (+0.31z)| norm 0.2572 (-0.16z)| lr 4.86e-04 | 4177.62 ms | 32.3% bf16 MFU | 125408 tok/s step 8998/19560 | loss 3.396847 (-1.03z)| norm 0.2708 (+0.14z)| lr 4.86e-04 | 4172.25 ms | 32.4% bf16 MFU | 125421 tok/s step 8999/19560 | loss 3.460069 (+0.39z)| norm 0.2691 (+0.10z)| lr 4.86e-04 | 4182.55 ms | 32.3% bf16 MFU | 125417 tok/s step 9000/19560 | loss 3.445046 (+0.05z)| norm 0.2636 (-0.03z)| lr 4.86e-04 | 4177.59 ms | 32.3% bf16 MFU | 125421 tok/s val loss 3.408902 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2914/10042 = 0.290181 step 9001/19560 | loss 3.445759 (+0.06z)| norm 0.2533 (-0.26z)| lr 4.86e-04 | 4197.77 ms | 32.2% bf16 MFU | 125395 tok/s step 9002/19560 | loss 3.435377 (-0.18z)| norm 0.2880 (+0.52z)| lr 4.86e-04 | 4193.01 ms | 32.2% bf16 MFU | 125377 tok/s step 9003/19560 | loss 3.410630 (-0.73z)| norm 0.2561 (-0.21z)| lr 4.86e-04 | 4186.48 ms | 32.3% bf16 MFU | 125370 tok/s step 9004/19560 | loss 3.416371 (-0.59z)| norm 0.2477 (-0.40z)| lr 4.86e-04 | 4387.78 ms | 30.8% bf16 MFU | 125076 tok/s step 9005/19560 | loss 3.398658 (-0.98z)| norm 0.2573 (-0.18z)| lr 4.86e-04 | 4196.41 ms | 32.2% bf16 MFU | 125069 tok/s step 9006/19560 | loss 3.389651 (-1.17z)| norm 0.2478 (-0.40z)| lr 4.86e-04 | 4188.12 ms | 32.2% bf16 MFU | 125075 tok/s step 9007/19560 | loss 3.437559 (-0.11z)| norm 0.2670 (+0.04z)| lr 4.86e-04 | 4167.97 ms | 32.4% bf16 MFU | 125111 tok/s step 9008/19560 | loss 3.445796 (+0.08z)| norm 0.2687 (+0.08z)| lr 4.86e-04 | 4180.27 ms | 32.3% bf16 MFU | 125126 tok/s step 9009/19560 | loss 3.444528 (+0.06z)| norm 0.2465 (-0.43z)| lr 4.85e-04 | 4197.69 ms | 32.2% bf16 MFU | 125115 tok/s step 9010/19560 | loss 3.381787 (-1.36z)| norm 0.4080 (+3.13z)| lr 4.85e-04 | 4170.11 ms | 32.4% bf16 MFU | 125145 tok/s step 9011/19560 | loss 3.476774 (+0.79z)| norm 0.2838 (+0.38z)| lr 4.85e-04 | 4167.68 ms | 32.4% bf16 MFU | 125178 tok/s step 9012/19560 | loss 3.463972 (+0.49z)| norm 0.2783 (+0.25z)| lr 4.85e-04 | 4184.07 ms | 32.3% bf16 MFU | 125184 tok/s step 9013/19560 | loss 3.575778 (+2.90z)| norm 0.3076 (+0.89z)| lr 4.85e-04 | 4158.09 ms | 32.5% bf16 MFU | 125229 tok/s step 9014/19560 | loss 3.385256 (-1.26z)| norm 0.2770 (+0.21z)| lr 4.85e-04 | 4172.36 ms | 32.4% bf16 MFU | 125251 tok/s step 9015/19560 | loss 3.457835 (+0.32z)| norm 0.2920 (+0.54z)| lr 4.85e-04 | 4175.13 ms | 32.3% bf16 MFU | 125267 tok/s step 9016/19560 | loss 3.442216 (-0.01z)| norm 0.2747 (+0.15z)| lr 4.85e-04 | 4282.48 ms | 31.5% bf16 MFU | 125125 tok/s step 9017/19560 | loss 3.448306 (+0.13z)| norm 0.2766 (+0.19z)| lr 4.85e-04 | 4163.36 ms | 32.4% bf16 MFU | 125165 tok/s step 9018/19560 | loss 3.519237 (+1.76z)| norm 0.2521 (-0.35z)| lr 4.85e-04 | 4171.75 ms | 32.4% bf16 MFU | 125191 tok/s step 9019/19560 | loss 3.444943 (+0.07z)| norm 0.2678 (-0.00z)| lr 4.85e-04 | 4174.43 ms | 32.3% bf16 MFU | 125211 tok/s step 9020/19560 | loss 3.569896 (+2.81z)| norm 0.2511 (-0.37z)| lr 4.85e-04 | 4173.80 ms | 32.3% bf16 MFU | 125231 tok/s step 9021/19560 | loss 3.441505 (-0.05z)| norm 0.2714 (+0.07z)| lr 4.85e-04 | 4181.87 ms | 32.3% bf16 MFU | 125238 tok/s step 9022/19560 | loss 3.430726 (-0.27z)| norm 0.2494 (-0.41z)| lr 4.85e-04 | 4181.68 ms | 32.3% bf16 MFU | 125245 tok/s step 9023/19560 | loss 3.405645 (-0.83z)| norm 0.2442 (-0.52z)| lr 4.85e-04 | 4174.30 ms | 32.3% bf16 MFU | 125263 tok/s step 9024/19560 | loss 3.497774 (+1.22z)| norm 0.2541 (-0.30z)| lr 4.84e-04 | 4164.72 ms | 32.4% bf16 MFU | 125294 tok/s step 9025/19560 | loss 3.426729 (-0.37z)| norm 0.2458 (-0.48z)| lr 4.84e-04 | 4180.35 ms | 32.3% bf16 MFU | 125300 tok/s step 9026/19560 | loss 3.461516 (+0.40z)| norm 0.2580 (-0.22z)| lr 4.84e-04 | 4173.96 ms | 32.3% bf16 MFU | 125316 tok/s step 9027/19560 | loss 3.419619 (-0.53z)| norm 0.2695 (+0.04z)| lr 4.84e-04 | 4165.53 ms | 32.4% bf16 MFU | 125343 tok/s step 9028/19560 | loss 3.396961 (-1.04z)| norm 0.2639 (-0.09z)| lr 4.84e-04 | 4171.63 ms | 32.4% bf16 MFU | 125360 tok/s step 9029/19560 | loss 3.367109 (-1.67z)| norm 0.2488 (-0.42z)| lr 4.84e-04 | 4168.88 ms | 32.4% bf16 MFU | 125380 tok/s step 9030/19560 | loss 3.354863 (-1.94z)| norm 0.2693 (+0.04z)| lr 4.84e-04 | 4179.46 ms | 32.3% bf16 MFU | 125383 tok/s step 9031/19560 | loss 3.548475 (+2.36z)| norm 0.2404 (-0.60z)| lr 4.84e-04 | 4166.32 ms | 32.4% bf16 MFU | 125406 tok/s step 9032/19560 | loss 3.448641 (+0.15z)| norm 0.2598 (-0.18z)| lr 4.84e-04 | 4164.05 ms | 32.4% bf16 MFU | 125431 tok/s step 9033/19560 | loss 3.463372 (+0.51z)| norm 0.2384 (-0.64z)| lr 4.84e-04 | 4167.19 ms | 32.4% bf16 MFU | 125450 tok/s step 9034/19560 | loss 3.452765 (+0.26z)| norm 0.2443 (-0.50z)| lr 4.84e-04 | 4177.72 ms | 32.3% bf16 MFU | 125453 tok/s step 9035/19560 | loss 3.448870 (+0.17z)| norm 0.2451 (-0.48z)| lr 4.84e-04 | 4172.40 ms | 32.4% bf16 MFU | 125463 tok/s step 9036/19560 | loss 3.472993 (+0.71z)| norm 0.2424 (-0.54z)| lr 4.84e-04 | 4174.67 ms | 32.3% bf16 MFU | 125469 tok/s step 9037/19560 | loss 3.416376 (-0.59z)| norm 0.2802 (+0.29z)| lr 4.84e-04 | 4170.61 ms | 32.4% bf16 MFU | 125481 tok/s step 9038/19560 | loss 3.457665 (+0.35z)| norm 0.2486 (-0.41z)| lr 4.84e-04 | 4196.36 ms | 32.2% bf16 MFU | 125454 tok/s step 9039/19560 | loss 3.428666 (-0.31z)| norm 0.2416 (-0.56z)| lr 4.83e-04 | 4179.58 ms | 32.3% bf16 MFU | 125453 tok/s step 9040/19560 | loss 3.444192 (+0.04z)| norm 0.2372 (-0.65z)| lr 4.83e-04 | 4180.22 ms | 32.3% bf16 MFU | 125452 tok/s step 9041/19560 | loss 3.423790 (-0.43z)| norm 0.2489 (-0.39z)| lr 4.83e-04 | 4174.31 ms | 32.3% bf16 MFU | 125459 tok/s step 9042/19560 | loss 3.425372 (-0.39z)| norm 0.2387 (-0.60z)| lr 4.83e-04 | 4164.92 ms | 32.4% bf16 MFU | 125480 tok/s step 9043/19560 | loss 3.402360 (-0.92z)| norm 0.2490 (-0.37z)| lr 4.83e-04 | 4172.65 ms | 32.4% bf16 MFU | 125489 tok/s step 9044/19560 | loss 3.389032 (-1.22z)| norm 0.2337 (-0.71z)| lr 4.83e-04 | 4164.29 ms | 32.4% bf16 MFU | 125509 tok/s step 9045/19560 | loss 3.416317 (-0.59z)| norm 0.2468 (-0.42z)| lr 4.83e-04 | 4168.62 ms | 32.4% bf16 MFU | 125522 tok/s step 9046/19560 | loss 3.423339 (-0.42z)| norm 0.2395 (-0.57z)| lr 4.83e-04 | 4177.14 ms | 32.3% bf16 MFU | 125522 tok/s step 9047/19560 | loss 3.438651 (-0.07z)| norm 0.2423 (-0.50z)| lr 4.83e-04 | 4176.54 ms | 32.3% bf16 MFU | 125522 tok/s step 9048/19560 | loss 3.408451 (-0.77z)| norm 0.2444 (-0.45z)| lr 4.83e-04 | 4162.04 ms | 32.4% bf16 MFU | 125545 tok/s step 9049/19560 | loss 3.376420 (-1.48z)| norm 0.2463 (-0.41z)| lr 4.83e-04 | 4169.99 ms | 32.4% bf16 MFU | 125554 tok/s step 9050/19560 | loss 3.432564 (-0.20z)| norm 0.2545 (-0.23z)| lr 4.83e-04 | 4180.65 ms | 32.3% bf16 MFU | 125547 tok/s step 9051/19560 | loss 3.392745 (-1.11z)| norm 0.2432 (-0.49z)| lr 4.83e-04 | 4170.83 ms | 32.4% bf16 MFU | 125554 tok/s step 9052/19560 | loss 3.451230 (+0.23z)| norm 0.2405 (-0.54z)| lr 4.83e-04 | 4174.88 ms | 32.3% bf16 MFU | 125556 tok/s step 9053/19560 | loss 3.423007 (-0.42z)| norm 0.2547 (-0.23z)| lr 4.83e-04 | 4258.70 ms | 31.7% bf16 MFU | 125433 tok/s step 9054/19560 | loss 3.396926 (-1.00z)| norm 0.2716 (+0.14z)| lr 4.82e-04 | 4176.10 ms | 32.3% bf16 MFU | 125439 tok/s step 9055/19560 | loss 3.439280 (-0.05z)| norm 0.2319 (-0.74z)| lr 4.82e-04 | 4170.36 ms | 32.4% bf16 MFU | 125453 tok/s step 9056/19560 | loss 3.493077 (+1.18z)| norm 0.2549 (-0.23z)| lr 4.82e-04 | 4162.36 ms | 32.4% bf16 MFU | 125478 tok/s step 9057/19560 | loss 3.493159 (+1.16z)| norm 0.2273 (-0.83z)| lr 4.82e-04 | 4211.64 ms | 32.1% bf16 MFU | 125429 tok/s step 9058/19560 | loss 3.410539 (-0.71z)| norm 0.2493 (-0.34z)| lr 4.82e-04 | 4164.27 ms | 32.4% bf16 MFU | 125452 tok/s step 9059/19560 | loss 3.425296 (-0.37z)| norm 0.2516 (-0.29z)| lr 4.82e-04 | 4180.15 ms | 32.3% bf16 MFU | 125451 tok/s step 9060/19560 | loss 3.450294 (+0.19z)| norm 0.2433 (-0.47z)| lr 4.82e-04 | 4167.20 ms | 32.4% bf16 MFU | 125469 tok/s step 9061/19560 | loss 3.474378 (+0.73z)| norm 0.2720 (+0.16z)| lr 4.82e-04 | 4167.63 ms | 32.4% bf16 MFU | 125485 tok/s step 9062/19560 | loss 3.495635 (+1.20z)| norm 0.2344 (-0.66z)| lr 4.82e-04 | 4153.59 ms | 32.5% bf16 MFU | 125522 tok/s step 9063/19560 | loss 3.436805 (-0.13z)| norm 0.2546 (-0.22z)| lr 4.82e-04 | 4254.11 ms | 31.7% bf16 MFU | 125408 tok/s step 9064/19560 | loss 3.302232 (-3.06z)| norm 0.2809 (+0.36z)| lr 4.82e-04 | 4171.92 ms | 32.4% bf16 MFU | 125422 tok/s step 9065/19560 | loss 3.449350 (+0.16z)| norm 0.2425 (-0.48z)| lr 4.82e-04 | 4161.86 ms | 32.4% bf16 MFU | 125449 tok/s step 9066/19560 | loss 3.421306 (-0.45z)| norm 0.3092 (+0.97z)| lr 4.82e-04 | 4186.68 ms | 32.2% bf16 MFU | 125438 tok/s step 9067/19560 | loss 3.369310 (-1.58z)| norm 0.2561 (-0.19z)| lr 4.82e-04 | 4179.45 ms | 32.3% bf16 MFU | 125438 tok/s step 9068/19560 | loss 3.527207 (+1.87z)| norm 0.2687 (+0.09z)| lr 4.82e-04 | 4380.83 ms | 30.8% bf16 MFU | 125150 tok/s step 9069/19560 | loss 3.468963 (+0.60z)| norm 0.2799 (+0.33z)| lr 4.81e-04 | 4897.75 ms | 27.6% bf16 MFU | 124245 tok/s step 9070/19560 | loss 3.412360 (-0.64z)| norm 0.2462 (-0.41z)| lr 4.81e-04 | 4469.65 ms | 30.2% bf16 MFU | 123898 tok/s step 9071/19560 | loss 3.469568 (+0.61z)| norm 0.2595 (-0.12z)| lr 4.81e-04 | 4784.91 ms | 28.2% bf16 MFU | 123182 tok/s step 9072/19560 | loss 3.422195 (-0.43z)| norm 0.2379 (-0.59z)| lr 4.81e-04 | 4163.26 ms | 32.4% bf16 MFU | 123319 tok/s step 9073/19560 | loss 3.494583 (+1.14z)| norm 0.2812 (+0.36z)| lr 4.81e-04 | 4243.34 ms | 31.8% bf16 MFU | 123331 tok/s step 9074/19560 | loss 3.436912 (-0.12z)| norm 0.2283 (-0.79z)| lr 4.81e-04 | 4388.00 ms | 30.8% bf16 MFU | 123139 tok/s step 9075/19560 | loss 3.447328 (+0.10z)| norm 0.2609 (-0.08z)| lr 4.81e-04 | 4329.03 ms | 31.2% bf16 MFU | 123037 tok/s step 9076/19560 | loss 3.396534 (-0.99z)| norm 0.2707 (+0.13z)| lr 4.81e-04 | 4258.20 ms | 31.7% bf16 MFU | 123041 tok/s step 9077/19560 | loss 3.450122 (+0.20z)| norm 0.2823 (+0.39z)| lr 4.81e-04 | 4247.43 ms | 31.8% bf16 MFU | 123061 tok/s step 9078/19560 | loss 3.436069 (-0.12z)| norm 0.2788 (+0.31z)| lr 4.81e-04 | 4265.02 ms | 31.7% bf16 MFU | 123054 tok/s step 9079/19560 | loss 3.455582 (+0.32z)| norm 0.2496 (-0.33z)| lr 4.81e-04 | 4179.45 ms | 32.3% bf16 MFU | 123174 tok/s step 9080/19560 | loss 3.538084 (+2.13z)| norm 0.2933 (+0.62z)| lr 4.81e-04 | 4351.49 ms | 31.0% bf16 MFU | 123040 tok/s step 9081/19560 | loss 3.433305 (-0.19z)| norm 0.2520 (-0.28z)| lr 4.81e-04 | 4223.28 ms | 32.0% bf16 MFU | 123095 tok/s step 9082/19560 | loss 3.456534 (+0.32z)| norm 0.2585 (-0.14z)| lr 4.81e-04 | 4186.62 ms | 32.2% bf16 MFU | 123201 tok/s step 9083/19560 | loss 3.463425 (+0.47z)| norm 0.2855 (+0.44z)| lr 4.81e-04 | 4248.79 ms | 31.8% bf16 MFU | 123211 tok/s step 9084/19560 | loss 3.438950 (-0.09z)| norm 0.2288 (-0.80z)| lr 4.80e-04 | 4301.47 ms | 31.4% bf16 MFU | 123145 tok/s step 9085/19560 | loss 3.470151 (+0.60z)| norm 0.2795 (+0.30z)| lr 4.80e-04 | 4220.69 ms | 32.0% bf16 MFU | 123199 tok/s step 9086/19560 | loss 3.451411 (+0.19z)| norm 0.2302 (-0.77z)| lr 4.80e-04 | 4207.85 ms | 32.1% bf16 MFU | 123269 tok/s step 9087/19560 | loss 3.445898 (+0.06z)| norm 0.4399 (+3.59z)| lr 4.80e-04 | 4258.54 ms | 31.7% bf16 MFU | 123261 tok/s step 9088/19560 | loss 3.452919 (+0.22z)| norm 0.2683 (+0.02z)| lr 4.80e-04 | 4255.56 ms | 31.7% bf16 MFU | 123258 tok/s step 9089/19560 | loss 3.460440 (+0.43z)| norm 0.2715 (+0.08z)| lr 4.80e-04 | 4196.73 ms | 32.2% bf16 MFU | 123341 tok/s step 9090/19560 | loss 3.449614 (+0.17z)| norm 0.2725 (+0.10z)| lr 4.80e-04 | 4162.78 ms | 32.4% bf16 MFU | 123472 tok/s step 9091/19560 | loss 3.499435 (+1.32z)| norm 0.2673 (-0.01z)| lr 4.80e-04 | 4253.74 ms | 31.7% bf16 MFU | 123461 tok/s step 9092/19560 | loss 3.429786 (-0.32z)| norm 0.2712 (+0.07z)| lr 4.80e-04 | 4180.28 ms | 32.3% bf16 MFU | 123559 tok/s step 9093/19560 | loss 3.370614 (-1.68z)| norm 0.2588 (-0.19z)| lr 4.80e-04 | 4204.30 ms | 32.1% bf16 MFU | 123616 tok/s step 9094/19560 | loss 3.490136 (+1.12z)| norm 0.2650 (-0.06z)| lr 4.80e-04 | 4184.16 ms | 32.3% bf16 MFU | 123700 tok/s step 9095/19560 | loss 3.385095 (-1.34z)| norm 0.2482 (-0.40z)| lr 4.80e-04 | 4171.48 ms | 32.4% bf16 MFU | 123799 tok/s step 9096/19560 | loss 3.450095 (+0.19z)| norm 0.2573 (-0.21z)| lr 4.80e-04 | 4180.61 ms | 32.3% bf16 MFU | 123880 tok/s step 9097/19560 | loss 3.437204 (-0.11z)| norm 0.2526 (-0.30z)| lr 4.80e-04 | 4204.53 ms | 32.1% bf16 MFU | 123921 tok/s step 9098/19560 | loss 3.433576 (-0.20z)| norm 0.2605 (-0.13z)| lr 4.80e-04 | 4166.51 ms | 32.4% bf16 MFU | 124016 tok/s step 9099/19560 | loss 3.480479 (+0.89z)| norm 0.2498 (-0.36z)| lr 4.79e-04 | 4198.26 ms | 32.2% bf16 MFU | 124060 tok/s step 9100/19560 | loss 3.517803 (+1.74z)| norm 0.2746 (+0.17z)| lr 4.79e-04 | 4181.88 ms | 32.3% bf16 MFU | 124125 tok/s step 9101/19560 | loss 3.480677 (+0.87z)| norm 0.2604 (-0.13z)| lr 4.79e-04 | 4178.33 ms | 32.3% bf16 MFU | 124193 tok/s step 9102/19560 | loss 3.444487 (+0.04z)| norm 0.2660 (-0.01z)| lr 4.79e-04 | 4174.30 ms | 32.3% bf16 MFU | 124263 tok/s step 9103/19560 | loss 3.465544 (+0.52z)| norm 0.2936 (+0.56z)| lr 4.79e-04 | 4180.12 ms | 32.3% bf16 MFU | 124321 tok/s step 9104/19560 | loss 3.433337 (-0.24z)| norm 0.2750 (+0.16z)| lr 4.79e-04 | 4176.39 ms | 32.3% bf16 MFU | 124382 tok/s step 9105/19560 | loss 3.456653 (+0.30z)| norm 0.2609 (-0.14z)| lr 4.79e-04 | 4183.17 ms | 32.3% bf16 MFU | 124429 tok/s step 9106/19560 | loss 3.442609 (-0.04z)| norm 0.2833 (+0.33z)| lr 4.79e-04 | 4169.53 ms | 32.4% bf16 MFU | 124495 tok/s step 9107/19560 | loss 3.383636 (-1.41z)| norm 0.2725 (+0.10z)| lr 4.79e-04 | 4228.44 ms | 31.9% bf16 MFU | 124470 tok/s step 9108/19560 | loss 3.471512 (+0.66z)| norm 0.2743 (+0.13z)| lr 4.79e-04 | 4170.84 ms | 32.4% bf16 MFU | 124532 tok/s step 9109/19560 | loss 3.549005 (+2.41z)| norm 0.2731 (+0.10z)| lr 4.79e-04 | 4180.44 ms | 32.3% bf16 MFU | 124576 tok/s step 9110/19560 | loss 3.432739 (-0.27z)| norm 0.3075 (+0.82z)| lr 4.79e-04 | 4188.60 ms | 32.2% bf16 MFU | 124605 tok/s step 9111/19560 | loss 3.418080 (-0.60z)| norm 0.2430 (-0.53z)| lr 4.79e-04 | 4173.19 ms | 32.4% bf16 MFU | 124657 tok/s step 9112/19560 | loss 3.453548 (+0.23z)| norm 0.2464 (-0.45z)| lr 4.79e-04 | 4191.55 ms | 32.2% bf16 MFU | 124678 tok/s step 9113/19560 | loss 3.405110 (-0.90z)| norm 0.2745 (+0.14z)| lr 4.79e-04 | 4189.61 ms | 32.2% bf16 MFU | 124701 tok/s step 9114/19560 | loss 3.439451 (-0.09z)| norm 0.2547 (-0.28z)| lr 4.79e-04 | 4181.12 ms | 32.3% bf16 MFU | 124736 tok/s step 9115/19560 | loss 3.470961 (+0.64z)| norm 0.2714 (+0.07z)| lr 4.78e-04 | 4193.83 ms | 32.2% bf16 MFU | 124750 tok/s step 9116/19560 | loss 3.430570 (-0.30z)| norm 0.2386 (-0.62z)| lr 4.78e-04 | 4173.16 ms | 32.4% bf16 MFU | 124794 tok/s step 9117/19560 | loss 3.416202 (-0.64z)| norm 0.2782 (+0.47z)| lr 4.78e-04 | 4171.95 ms | 32.4% bf16 MFU | 124838 tok/s step 9118/19560 | loss 3.417391 (-0.60z)| norm 0.2501 (-0.50z)| lr 4.78e-04 | 4183.15 ms | 32.3% bf16 MFU | 124862 tok/s step 9119/19560 | loss 3.474801 (+0.77z)| norm 0.2257 (-1.38z)| lr 4.78e-04 | 4169.84 ms | 32.4% bf16 MFU | 124906 tok/s step 9120/19560 | loss 3.383699 (-1.39z)| norm 0.2653 (+0.10z)| lr 4.78e-04 | 4179.70 ms | 32.3% bf16 MFU | 124933 tok/s step 9121/19560 | loss 3.401261 (-0.96z)| norm 0.2508 (-0.44z)| lr 4.78e-04 | 4191.23 ms | 32.2% bf16 MFU | 124940 tok/s step 9122/19560 | loss 3.484938 (+1.03z)| norm 0.2544 (-0.29z)| lr 4.78e-04 | 4172.23 ms | 32.4% bf16 MFU | 124977 tok/s step 9123/19560 | loss 3.361646 (-1.89z)| norm 0.2396 (-0.84z)| lr 4.78e-04 | 4194.14 ms | 32.2% bf16 MFU | 124978 tok/s step 9124/19560 | loss 3.445966 (+0.10z)| norm 0.2382 (-0.88z)| lr 4.78e-04 | 4176.09 ms | 32.3% bf16 MFU | 125006 tok/s step 9125/19560 | loss 3.448312 (+0.16z)| norm 0.2376 (-0.89z)| lr 4.78e-04 | 4183.92 ms | 32.3% bf16 MFU | 125022 tok/s step 9126/19560 | loss 3.393377 (-1.14z)| norm 0.2544 (-0.26z)| lr 4.78e-04 | 4183.98 ms | 32.3% bf16 MFU | 125036 tok/s step 9127/19560 | loss 3.390286 (-1.19z)| norm 0.2277 (-1.24z)| lr 4.78e-04 | 4182.58 ms | 32.3% bf16 MFU | 125052 tok/s step 9128/19560 | loss 3.469101 (+0.65z)| norm 0.2508 (-0.38z)| lr 4.78e-04 | 4170.56 ms | 32.4% bf16 MFU | 125085 tok/s step 9129/19560 | loss 3.357498 (-1.92z)| norm 0.2426 (-0.68z)| lr 4.78e-04 | 4180.89 ms | 32.3% bf16 MFU | 125100 tok/s step 9130/19560 | loss 3.470091 (+0.68z)| norm 0.2979 (+1.35z)| lr 4.77e-04 | 4172.39 ms | 32.4% bf16 MFU | 125128 tok/s step 9131/19560 | loss 3.429274 (-0.27z)| norm 0.2424 (-0.68z)| lr 4.77e-04 | 4173.44 ms | 32.4% bf16 MFU | 125153 tok/s step 9132/19560 | loss 3.420036 (-0.48z)| norm 0.2522 (-0.32z)| lr 4.77e-04 | 4170.27 ms | 32.4% bf16 MFU | 125181 tok/s step 9133/19560 | loss 3.405461 (-0.82z)| norm 0.2542 (-0.25z)| lr 4.77e-04 | 4200.97 ms | 32.1% bf16 MFU | 125162 tok/s step 9134/19560 | loss 3.387441 (-1.24z)| norm 0.2505 (-0.38z)| lr 4.77e-04 | 4189.80 ms | 32.2% bf16 MFU | 125161 tok/s step 9135/19560 | loss 3.446341 (+0.12z)| norm 0.2612 (+0.01z)| lr 4.77e-04 | 4182.61 ms | 32.3% bf16 MFU | 125170 tok/s step 9136/19560 | loss 3.422729 (-0.42z)| norm 0.2515 (-0.34z)| lr 4.77e-04 | 4182.16 ms | 32.3% bf16 MFU | 125180 tok/s step 9137/19560 | loss 3.411749 (-0.67z)| norm 0.2504 (-0.38z)| lr 4.77e-04 | 4186.94 ms | 32.2% bf16 MFU | 125182 tok/s step 9138/19560 | loss 3.434899 (-0.14z)| norm 0.2228 (-1.52z)| lr 4.77e-04 | 4164.76 ms | 32.4% bf16 MFU | 125217 tok/s step 9139/19560 | loss 3.501361 (+1.39z)| norm 0.2528 (-0.26z)| lr 4.77e-04 | 4174.76 ms | 32.3% bf16 MFU | 125236 tok/s step 9140/19560 | loss 3.486226 (+1.03z)| norm 0.2317 (-1.12z)| lr 4.77e-04 | 4177.10 ms | 32.3% bf16 MFU | 125250 tok/s step 9141/19560 | loss 3.459216 (+0.45z)| norm 0.2598 (+0.06z)| lr 4.77e-04 | 4165.90 ms | 32.4% bf16 MFU | 125280 tok/s step 9142/19560 | loss 3.421032 (-0.48z)| norm 0.2191 (-1.63z)| lr 4.77e-04 | 4185.48 ms | 32.3% bf16 MFU | 125279 tok/s step 9143/19560 | loss 3.359273 (-1.92z)| norm 0.2649 (+0.30z)| lr 4.77e-04 | 4170.68 ms | 32.4% bf16 MFU | 125300 tok/s step 9144/19560 | loss 3.385561 (-1.27z)| norm 0.2497 (-0.33z)| lr 4.77e-04 | 4175.78 ms | 32.3% bf16 MFU | 125313 tok/s step 9145/19560 | loss 3.488310 (+1.14z)| norm 0.2605 (+0.13z)| lr 4.76e-04 | 4180.45 ms | 32.3% bf16 MFU | 125318 tok/s step 9146/19560 | loss 3.448346 (+0.21z)| norm 0.2605 (+0.13z)| lr 4.76e-04 | 4176.22 ms | 32.3% bf16 MFU | 125329 tok/s step 9147/19560 | loss 3.438801 (-0.01z)| norm 0.2537 (-0.16z)| lr 4.76e-04 | 4170.36 ms | 32.4% bf16 MFU | 125349 tok/s step 9148/19560 | loss 3.407914 (-0.74z)| norm 0.2722 (+0.62z)| lr 4.76e-04 | 4209.04 ms | 32.1% bf16 MFU | 125309 tok/s step 9149/19560 | loss 3.496120 (+1.41z)| norm 0.2731 (+0.66z)| lr 4.76e-04 | 4184.00 ms | 32.3% bf16 MFU | 125309 tok/s step 9150/19560 | loss 3.419531 (-0.46z)| norm 0.2628 (+0.22z)| lr 4.76e-04 | 4176.39 ms | 32.3% bf16 MFU | 125321 tok/s step 9151/19560 | loss 3.514773 (+1.83z)| norm 0.2857 (+1.17z)| lr 4.76e-04 | 4192.15 ms | 32.2% bf16 MFU | 125308 tok/s step 9152/19560 | loss 3.372806 (-1.58z)| norm 0.2469 (-0.46z)| lr 4.76e-04 | 4174.38 ms | 32.3% bf16 MFU | 125322 tok/s step 9153/19560 | loss 3.412423 (-0.62z)| norm 0.2713 (+0.56z)| lr 4.76e-04 | 4166.77 ms | 32.4% bf16 MFU | 125347 tok/s step 9154/19560 | loss 3.439389 (+0.04z)| norm 0.2625 (+0.19z)| lr 4.76e-04 | 4167.80 ms | 32.4% bf16 MFU | 125370 tok/s step 9155/19560 | loss 3.433434 (-0.11z)| norm 0.2861 (+1.17z)| lr 4.76e-04 | 4176.91 ms | 32.3% bf16 MFU | 125377 tok/s step 9156/19560 | loss 3.353383 (-2.02z)| norm 0.2667 (+0.35z)| lr 4.76e-04 | 4186.58 ms | 32.3% bf16 MFU | 125370 tok/s step 9157/19560 | loss 3.390150 (-1.15z)| norm 0.2568 (-0.07z)| lr 4.76e-04 | 4182.77 ms | 32.3% bf16 MFU | 125369 tok/s step 9158/19560 | loss 3.465832 (+0.66z)| norm 0.2844 (+1.08z)| lr 4.76e-04 | 4179.88 ms | 32.3% bf16 MFU | 125372 tok/s step 9159/19560 | loss 3.408139 (-0.74z)| norm 0.2733 (+0.61z)| lr 4.76e-04 | 4190.79 ms | 32.2% bf16 MFU | 125359 tok/s step 9160/19560 | loss 3.507172 (+1.71z)| norm 0.2286 (-1.24z)| lr 4.75e-04 | 4184.43 ms | 32.3% bf16 MFU | 125355 tok/s step 9161/19560 | loss 3.445278 (+0.18z)| norm 0.2704 (+0.49z)| lr 4.75e-04 | 4179.77 ms | 32.3% bf16 MFU | 125359 tok/s step 9162/19560 | loss 3.401350 (-0.90z)| norm 0.2384 (-0.84z)| lr 4.75e-04 | 4245.94 ms | 31.8% bf16 MFU | 125265 tok/s step 9163/19560 | loss 3.412842 (-0.60z)| norm 0.2457 (-0.54z)| lr 4.75e-04 | 4175.39 ms | 32.3% bf16 MFU | 125280 tok/s step 9164/19560 | loss 3.450998 (+0.34z)| norm 0.2787 (+0.82z)| lr 4.75e-04 | 4179.77 ms | 32.3% bf16 MFU | 125288 tok/s step 9165/19560 | loss 3.489380 (+1.27z)| norm 0.2468 (-0.49z)| lr 4.75e-04 | 4178.06 ms | 32.3% bf16 MFU | 125298 tok/s step 9166/19560 | loss 3.422944 (-0.36z)| norm 0.2647 (+0.25z)| lr 4.75e-04 | 4186.46 ms | 32.3% bf16 MFU | 125295 tok/s step 9167/19560 | loss 3.437891 (+0.01z)| norm 0.2319 (-1.11z)| lr 4.75e-04 | 4181.41 ms | 32.3% bf16 MFU | 125299 tok/s step 9168/19560 | loss 3.453238 (+0.39z)| norm 0.2678 (+0.37z)| lr 4.75e-04 | 4176.45 ms | 32.3% bf16 MFU | 125311 tok/s step 9169/19560 | loss 3.386219 (-1.25z)| norm 0.2790 (+0.82z)| lr 4.75e-04 | 4183.35 ms | 32.3% bf16 MFU | 125312 tok/s step 9170/19560 | loss 3.395885 (-1.00z)| norm 0.2692 (+0.41z)| lr 4.75e-04 | 4172.92 ms | 32.4% bf16 MFU | 125328 tok/s step 9171/19560 | loss 3.413661 (-0.57z)| norm 0.2749 (+0.63z)| lr 4.75e-04 | 4175.22 ms | 32.3% bf16 MFU | 125341 tok/s step 9172/19560 | loss 3.389713 (-1.16z)| norm 0.2481 (-0.49z)| lr 4.75e-04 | 4232.82 ms | 31.9% bf16 MFU | 125267 tok/s step 9173/19560 | loss 3.426956 (-0.25z)| norm 0.2642 (+0.18z)| lr 4.75e-04 | 4184.48 ms | 32.3% bf16 MFU | 125268 tok/s step 9174/19560 | loss 3.398406 (-0.94z)| norm 0.2488 (-0.47z)| lr 4.75e-04 | 4195.04 ms | 32.2% bf16 MFU | 125253 tok/s step 9175/19560 | loss 3.504792 (+1.62z)| norm 0.2366 (-0.98z)| lr 4.74e-04 | 4176.44 ms | 32.3% bf16 MFU | 125268 tok/s step 9176/19560 | loss 3.422921 (-0.36z)| norm 0.2668 (+0.28z)| lr 4.74e-04 | 4177.45 ms | 32.3% bf16 MFU | 125279 tok/s step 9177/19560 | loss 3.456243 (+0.44z)| norm 0.2251 (-1.45z)| lr 4.74e-04 | 4169.84 ms | 32.4% bf16 MFU | 125302 tok/s step 9178/19560 | loss 3.455559 (+0.42z)| norm 0.2664 (+0.26z)| lr 4.74e-04 | 4175.35 ms | 32.3% bf16 MFU | 125315 tok/s step 9179/19560 | loss 3.393559 (-1.09z)| norm 0.2371 (-0.95z)| lr 4.74e-04 | 4180.98 ms | 32.3% bf16 MFU | 125319 tok/s step 9180/19560 | loss 3.365056 (-1.75z)| norm 0.2433 (-0.70z)| lr 4.74e-04 | 4184.36 ms | 32.3% bf16 MFU | 125318 tok/s step 9181/19560 | loss 3.418010 (-0.47z)| norm 0.2621 (+0.08z)| lr 4.74e-04 | 4177.35 ms | 32.3% bf16 MFU | 125328 tok/s step 9182/19560 | loss 3.399420 (-0.92z)| norm 0.2445 (-0.64z)| lr 4.74e-04 | 4231.87 ms | 31.9% bf16 MFU | 125256 tok/s step 9183/19560 | loss 3.340035 (-2.29z)| norm 0.2637 (+0.15z)| lr 4.74e-04 | 4177.16 ms | 32.3% bf16 MFU | 125269 tok/s step 9184/19560 | loss 3.392406 (-1.04z)| norm 0.2330 (-1.12z)| lr 4.74e-04 | 4177.31 ms | 32.3% bf16 MFU | 125281 tok/s step 9185/19560 | loss 3.373752 (-1.45z)| norm 0.2496 (-0.44z)| lr 4.74e-04 | 4172.59 ms | 32.4% bf16 MFU | 125299 tok/s step 9186/19560 | loss 3.424400 (-0.26z)| norm 0.2617 (+0.06z)| lr 4.74e-04 | 4183.69 ms | 32.3% bf16 MFU | 125300 tok/s step 9187/19560 | loss 3.429541 (-0.14z)| norm 0.2597 (-0.02z)| lr 4.74e-04 | 4202.03 ms | 32.1% bf16 MFU | 125274 tok/s step 9188/19560 | loss 3.422024 (-0.31z)| norm 0.2814 (+0.87z)| lr 4.74e-04 | 4182.47 ms | 32.3% bf16 MFU | 125278 tok/s step 9189/19560 | loss 3.450454 (+0.37z)| norm 0.2489 (-0.48z)| lr 4.74e-04 | 4182.13 ms | 32.3% bf16 MFU | 125282 tok/s step 9190/19560 | loss 3.400766 (-0.80z)| norm 0.2603 (-0.01z)| lr 4.73e-04 | 4169.82 ms | 32.4% bf16 MFU | 125305 tok/s step 9191/19560 | loss 3.408563 (-0.61z)| norm 0.2412 (-0.81z)| lr 4.73e-04 | 4172.85 ms | 32.4% bf16 MFU | 125321 tok/s step 9192/19560 | loss 3.438958 (+0.10z)| norm 0.2650 (+0.19z)| lr 4.73e-04 | 4190.37 ms | 32.2% bf16 MFU | 125311 tok/s step 9193/19560 | loss 3.376359 (-1.43z)| norm 0.2467 (-0.58z)| lr 4.73e-04 | 4178.02 ms | 32.3% bf16 MFU | 125320 tok/s step 9194/19560 | loss 3.367829 (-1.61z)| norm 0.2610 (+0.04z)| lr 4.73e-04 | 4186.73 ms | 32.2% bf16 MFU | 125315 tok/s step 9195/19560 | loss 3.433712 (-0.02z)| norm 0.2440 (-0.68z)| lr 4.73e-04 | 4218.56 ms | 32.0% bf16 MFU | 125264 tok/s step 9196/19560 | loss 3.513401 (+1.95z)| norm 0.2433 (-0.70z)| lr 4.73e-04 | 4163.34 ms | 32.4% bf16 MFU | 125297 tok/s step 9197/19560 | loss 3.342035 (-2.23z)| norm 0.2733 (+0.58z)| lr 4.73e-04 | 4169.72 ms | 32.4% bf16 MFU | 125319 tok/s step 9198/19560 | loss 3.400486 (-0.80z)| norm 0.2519 (-0.34z)| lr 4.73e-04 | 4182.82 ms | 32.3% bf16 MFU | 125320 tok/s step 9199/19560 | loss 3.482600 (+1.19z)| norm 0.2653 (+0.23z)| lr 4.73e-04 | 4169.85 ms | 32.4% bf16 MFU | 125341 tok/s step 9200/19560 | loss 3.429717 (-0.09z)| norm 0.2773 (+0.74z)| lr 4.73e-04 | 4199.14 ms | 32.2% bf16 MFU | 125317 tok/s step 9201/19560 | loss 3.468309 (+0.85z)| norm 0.2594 (-0.02z)| lr 4.73e-04 | 4185.46 ms | 32.3% bf16 MFU | 125314 tok/s step 9202/19560 | loss 3.441031 (+0.19z)| norm 0.2566 (-0.15z)| lr 4.73e-04 | 4175.29 ms | 32.3% bf16 MFU | 125327 tok/s step 9203/19560 | loss 3.385163 (-1.16z)| norm 0.2545 (-0.24z)| lr 4.73e-04 | 4169.08 ms | 32.4% bf16 MFU | 125348 tok/s step 9204/19560 | loss 3.461205 (+0.67z)| norm 0.2901 (+1.29z)| lr 4.73e-04 | 4203.38 ms | 32.1% bf16 MFU | 125317 tok/s step 9205/19560 | loss 3.436853 (+0.09z)| norm 0.2584 (-0.07z)| lr 4.72e-04 | 4167.26 ms | 32.4% bf16 MFU | 125342 tok/s step 9206/19560 | loss 3.345203 (-2.09z)| norm 0.2689 (+0.39z)| lr 4.72e-04 | 4178.32 ms | 32.3% bf16 MFU | 125349 tok/s step 9207/19560 | loss 3.441101 (+0.21z)| norm 0.2843 (+1.04z)| lr 4.72e-04 | 4171.15 ms | 32.4% bf16 MFU | 125366 tok/s step 9208/19560 | loss 3.438343 (+0.16z)| norm 0.2653 (+0.23z)| lr 4.72e-04 | 4177.07 ms | 32.3% bf16 MFU | 125374 tok/s step 9209/19560 | loss 3.374877 (-1.37z)| norm 0.2566 (-0.15z)| lr 4.72e-04 | 4177.15 ms | 32.3% bf16 MFU | 125381 tok/s step 9210/19560 | loss 3.432183 (+0.03z)| norm 0.2647 (+0.20z)| lr 4.72e-04 | 4176.75 ms | 32.3% bf16 MFU | 125388 tok/s step 9211/19560 | loss 3.465630 (+0.84z)| norm 0.2750 (+0.65z)| lr 4.72e-04 | 4168.91 ms | 32.4% bf16 MFU | 125406 tok/s step 9212/19560 | loss 3.435921 (+0.12z)| norm 0.2699 (+0.42z)| lr 4.72e-04 | 4161.86 ms | 32.4% bf16 MFU | 125435 tok/s step 9213/19560 | loss 3.506379 (+1.81z)| norm 0.2788 (+0.81z)| lr 4.72e-04 | 4168.19 ms | 32.4% bf16 MFU | 125452 tok/s step 9214/19560 | loss 3.463357 (+0.77z)| norm 0.2683 (+0.34z)| lr 4.72e-04 | 4173.54 ms | 32.4% bf16 MFU | 125461 tok/s step 9215/19560 | loss 3.418800 (-0.30z)| norm 0.2608 (+0.10z)| lr 4.72e-04 | 4205.79 ms | 32.1% bf16 MFU | 125421 tok/s step 9216/19560 | loss 3.409882 (-0.51z)| norm 0.2605 (+0.08z)| lr 4.72e-04 | 4199.55 ms | 32.2% bf16 MFU | 125392 tok/s step 9217/19560 | loss 3.443162 (+0.30z)| norm 0.2511 (-0.50z)| lr 4.72e-04 | 4232.36 ms | 31.9% bf16 MFU | 125316 tok/s step 9218/19560 | loss 3.429581 (-0.02z)| norm 0.2462 (-0.79z)| lr 4.72e-04 | 4183.31 ms | 32.3% bf16 MFU | 125317 tok/s step 9219/19560 | loss 3.343385 (-2.07z)| norm 0.2637 (+0.30z)| lr 4.72e-04 | 4168.20 ms | 32.4% bf16 MFU | 125340 tok/s step 9220/19560 | loss 3.432721 (+0.08z)| norm 0.2408 (-1.10z)| lr 4.71e-04 | 4189.19 ms | 32.2% bf16 MFU | 125331 tok/s step 9221/19560 | loss 3.479871 (+1.19z)| norm 0.2667 (+0.50z)| lr 4.71e-04 | 4240.04 ms | 31.8% bf16 MFU | 125247 tok/s step 9222/19560 | loss 3.397710 (-0.77z)| norm 0.2448 (-0.85z)| lr 4.71e-04 | 4187.52 ms | 32.2% bf16 MFU | 125244 tok/s step 9223/19560 | loss 3.429574 (-0.01z)| norm 0.2427 (-0.97z)| lr 4.71e-04 | 4209.38 ms | 32.1% bf16 MFU | 125210 tok/s step 9224/19560 | loss 3.342040 (-2.08z)| norm 0.2393 (-1.17z)| lr 4.71e-04 | 4174.86 ms | 32.3% bf16 MFU | 125228 tok/s step 9225/19560 | loss 3.336599 (-2.15z)| norm 0.2797 (+1.30z)| lr 4.71e-04 | 4177.40 ms | 32.3% bf16 MFU | 125242 tok/s step 9226/19560 | loss 3.498339 (+1.62z)| norm 0.2584 (-0.01z)| lr 4.71e-04 | 4174.82 ms | 32.3% bf16 MFU | 125259 tok/s step 9227/19560 | loss 3.414623 (-0.32z)| norm 0.2250 (-2.00z)| lr 4.71e-04 | 4176.71 ms | 32.3% bf16 MFU | 125273 tok/s step 9228/19560 | loss 3.359338 (-1.59z)| norm 0.2626 (+0.27z)| lr 4.71e-04 | 4173.48 ms | 32.4% bf16 MFU | 125290 tok/s step 9229/19560 | loss 3.402179 (-0.57z)| norm 0.3292 (+3.99z)| lr 4.71e-04 | 4225.93 ms | 31.9% bf16 MFU | 125229 tok/s step 9230/19560 | loss 3.375103 (-1.20z)| norm 0.3041 (+2.49z)| lr 4.71e-04 | 4178.91 ms | 32.3% bf16 MFU | 125241 tok/s step 9231/19560 | loss 3.431446 (+0.14z)| norm 0.2646 (+0.33z)| lr 4.71e-04 | 4186.77 ms | 32.2% bf16 MFU | 125240 tok/s step 9232/19560 | loss 3.499648 (+1.72z)| norm 0.2866 (+1.54z)| lr 4.71e-04 | 4173.98 ms | 32.3% bf16 MFU | 125258 tok/s step 9233/19560 | loss 3.438035 (+0.28z)| norm 0.2484 (-0.58z)| lr 4.71e-04 | 4213.46 ms | 32.0% bf16 MFU | 125217 tok/s step 9234/19560 | loss 3.408003 (-0.41z)| norm 0.2648 (+0.34z)| lr 4.71e-04 | 4193.26 ms | 32.2% bf16 MFU | 125208 tok/s step 9235/19560 | loss 3.377010 (-1.14z)| norm 0.2601 (+0.09z)| lr 4.70e-04 | 4191.17 ms | 32.2% bf16 MFU | 125202 tok/s step 9236/19560 | loss 3.382264 (-1.00z)| norm 0.2573 (-0.06z)| lr 4.70e-04 | 4195.64 ms | 32.2% bf16 MFU | 125190 tok/s step 9237/19560 | loss 3.431865 (+0.19z)| norm 0.2756 (+0.97z)| lr 4.70e-04 | 4176.07 ms | 32.3% bf16 MFU | 125208 tok/s step 9238/19560 | loss 3.415107 (-0.21z)| norm 0.2479 (-0.58z)| lr 4.70e-04 | 4166.34 ms | 32.4% bf16 MFU | 125239 tok/s step 9239/19560 | loss 3.467712 (+1.05z)| norm 0.2659 (+0.45z)| lr 4.70e-04 | 4237.79 ms | 31.9% bf16 MFU | 125163 tok/s step 9240/19560 | loss 3.369173 (-1.31z)| norm 0.2440 (-0.82z)| lr 4.70e-04 | 4210.61 ms | 32.1% bf16 MFU | 125131 tok/s step 9241/19560 | loss 3.424774 (+0.02z)| norm 0.2493 (-0.50z)| lr 4.70e-04 | 4172.44 ms | 32.4% bf16 MFU | 125157 tok/s step 9242/19560 | loss 3.371128 (-1.24z)| norm 0.2610 (+0.17z)| lr 4.70e-04 | 4203.52 ms | 32.1% bf16 MFU | 125135 tok/s step 9243/19560 | loss 3.430614 (+0.18z)| norm 0.2555 (-0.14z)| lr 4.70e-04 | 4194.06 ms | 32.2% bf16 MFU | 125129 tok/s step 9244/19560 | loss 3.412272 (-0.25z)| norm 0.2701 (+0.70z)| lr 4.70e-04 | 4195.24 ms | 32.2% bf16 MFU | 125121 tok/s step 9245/19560 | loss 3.414762 (-0.19z)| norm 0.2553 (-0.15z)| lr 4.70e-04 | 4175.39 ms | 32.3% bf16 MFU | 125143 tok/s step 9246/19560 | loss 3.400895 (-0.52z)| norm 0.2514 (-0.38z)| lr 4.70e-04 | 4193.66 ms | 32.2% bf16 MFU | 125137 tok/s step 9247/19560 | loss 3.440081 (+0.43z)| norm 0.2533 (-0.29z)| lr 4.70e-04 | 4185.31 ms | 32.3% bf16 MFU | 125144 tok/s step 9248/19560 | loss 3.405187 (-0.42z)| norm 0.2296 (-1.67z)| lr 4.70e-04 | 4172.65 ms | 32.4% bf16 MFU | 125169 tok/s step 9249/19560 | loss 3.420773 (-0.05z)| norm 0.2611 (+0.18z)| lr 4.70e-04 | 4177.95 ms | 32.3% bf16 MFU | 125185 tok/s step 9250/19560 | loss 3.540437 (+2.77z)| norm 0.2552 (-0.16z)| lr 4.69e-04 | 4444.96 ms | 30.4% bf16 MFU | 124823 tok/s val loss 3.399588 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2911/10042 = 0.289882 step 9251/19560 | loss 3.441675 (+0.43z)| norm 0.2675 (+0.55z)| lr 4.69e-04 | 4198.99 ms | 32.2% bf16 MFU | 124825 tok/s step 9252/19560 | loss 3.410602 (-0.31z)| norm 0.2607 (+0.14z)| lr 4.69e-04 | 4174.19 ms | 32.3% bf16 MFU | 124864 tok/s step 9253/19560 | loss 3.466882 (+1.03z)| norm 0.2734 (+0.88z)| lr 4.69e-04 | 4271.16 ms | 31.6% bf16 MFU | 124758 tok/s step 9254/19560 | loss 3.413737 (-0.24z)| norm 0.2375 (-1.25z)| lr 4.69e-04 | 4174.23 ms | 32.3% bf16 MFU | 124800 tok/s step 9255/19560 | loss 3.440927 (+0.40z)| norm 0.2792 (+1.21z)| lr 4.69e-04 | 4221.53 ms | 32.0% bf16 MFU | 124770 tok/s step 9256/19560 | loss 3.432068 (+0.20z)| norm 0.2598 (+0.05z)| lr 4.69e-04 | 4175.62 ms | 32.3% bf16 MFU | 124810 tok/s step 9257/19560 | loss 3.394441 (-0.72z)| norm 0.2662 (+0.42z)| lr 4.69e-04 | 4318.11 ms | 31.3% bf16 MFU | 124640 tok/s step 9258/19560 | loss 3.386836 (-0.89z)| norm 0.2558 (-0.18z)| lr 4.69e-04 | 4234.85 ms | 31.9% bf16 MFU | 124598 tok/s step 9259/19560 | loss 3.526249 (+2.41z)| norm 0.2558 (-0.19z)| lr 4.69e-04 | 4553.30 ms | 29.7% bf16 MFU | 124125 tok/s step 9260/19560 | loss 3.336428 (-2.03z)| norm 0.2567 (-0.14z)| lr 4.69e-04 | 4979.58 ms | 27.1% bf16 MFU | 123184 tok/s step 9261/19560 | loss 3.427171 (+0.08z)| norm 0.2661 (+0.43z)| lr 4.69e-04 | 4326.56 ms | 31.2% bf16 MFU | 123083 tok/s step 9262/19560 | loss 3.416171 (-0.18z)| norm 0.2554 (-0.23z)| lr 4.69e-04 | 4392.53 ms | 30.7% bf16 MFU | 122897 tok/s step 9263/19560 | loss 3.434338 (+0.24z)| norm 0.2525 (-0.40z)| lr 4.69e-04 | 4273.34 ms | 31.6% bf16 MFU | 122887 tok/s step 9264/19560 | loss 3.399096 (-0.58z)| norm 0.2509 (-0.50z)| lr 4.69e-04 | 4470.82 ms | 30.2% bf16 MFU | 122606 tok/s step 9265/19560 | loss 3.391715 (-0.74z)| norm 0.2248 (-2.06z)| lr 4.68e-04 | 4201.92 ms | 32.1% bf16 MFU | 122714 tok/s step 9266/19560 | loss 3.431043 (+0.17z)| norm 0.2466 (-0.76z)| lr 4.68e-04 | 4293.37 ms | 31.4% bf16 MFU | 122684 tok/s step 9267/19560 | loss 3.404932 (-0.42z)| norm 0.2348 (-1.47z)| lr 4.68e-04 | 4166.40 ms | 32.4% bf16 MFU | 122842 tok/s step 9268/19560 | loss 3.486690 (+1.51z)| norm 0.2490 (-0.62z)| lr 4.68e-04 | 4270.76 ms | 31.6% bf16 MFU | 122838 tok/s step 9269/19560 | loss 3.426285 (+0.09z)| norm 0.4052 (+7.01z)| lr 4.68e-04 | 4173.70 ms | 32.3% bf16 MFU | 122977 tok/s step 9270/19560 | loss 3.447623 (+0.59z)| norm 0.3011 (+1.95z)| lr 4.68e-04 | 4353.61 ms | 31.0% bf16 MFU | 122849 tok/s step 9271/19560 | loss 3.471700 (+1.14z)| norm 0.2731 (+0.59z)| lr 4.68e-04 | 4244.71 ms | 31.8% bf16 MFU | 122883 tok/s step 9272/19560 | loss 3.437714 (+0.32z)| norm 0.2648 (+0.18z)| lr 4.68e-04 | 4230.86 ms | 31.9% bf16 MFU | 122934 tok/s step 9273/19560 | loss 3.423146 (-0.01z)| norm 0.3211 (+2.79z)| lr 4.68e-04 | 4171.58 ms | 32.4% bf16 MFU | 123072 tok/s step 9274/19560 | loss 3.431694 (+0.20z)| norm 0.3137 (+2.37z)| lr 4.68e-04 | 4173.93 ms | 32.3% bf16 MFU | 123199 tok/s step 9275/19560 | loss 3.426300 (+0.07z)| norm 0.2652 (+0.15z)| lr 4.68e-04 | 4233.06 ms | 31.9% bf16 MFU | 123232 tok/s step 9276/19560 | loss 3.370887 (-1.25z)| norm 0.2953 (+1.51z)| lr 4.68e-04 | 4176.85 ms | 32.3% bf16 MFU | 123346 tok/s step 9277/19560 | loss 3.389473 (-0.79z)| norm 0.2367 (-1.14z)| lr 4.68e-04 | 4178.68 ms | 32.3% bf16 MFU | 123452 tok/s step 9278/19560 | loss 3.396955 (-0.60z)| norm 0.2588 (-0.14z)| lr 4.68e-04 | 4176.08 ms | 32.3% bf16 MFU | 123557 tok/s step 9279/19560 | loss 3.400304 (-0.51z)| norm 0.2529 (-0.39z)| lr 4.68e-04 | 4216.65 ms | 32.0% bf16 MFU | 123596 tok/s step 9280/19560 | loss 3.511680 (+2.17z)| norm 0.2699 (+0.37z)| lr 4.67e-04 | 4497.60 ms | 30.0% bf16 MFU | 123245 tok/s step 9281/19560 | loss 3.414257 (-0.19z)| norm 0.2499 (-0.53z)| lr 4.67e-04 | 4171.34 ms | 32.4% bf16 MFU | 123367 tok/s step 9282/19560 | loss 3.426157 (+0.10z)| norm 0.2862 (+1.10z)| lr 4.67e-04 | 4192.05 ms | 32.2% bf16 MFU | 123452 tok/s step 9283/19560 | loss 3.440338 (+0.44z)| norm 0.2538 (-0.35z)| lr 4.67e-04 | 4174.32 ms | 32.3% bf16 MFU | 123559 tok/s step 9284/19560 | loss 3.471352 (+1.18z)| norm 0.2538 (-0.34z)| lr 4.67e-04 | 4305.08 ms | 31.4% bf16 MFU | 123470 tok/s step 9285/19560 | loss 3.401570 (-0.53z)| norm 0.2455 (-0.72z)| lr 4.67e-04 | 4247.64 ms | 31.8% bf16 MFU | 123468 tok/s step 9286/19560 | loss 3.401481 (-0.52z)| norm 0.2392 (-0.99z)| lr 4.67e-04 | 4185.10 ms | 32.3% bf16 MFU | 123559 tok/s step 9287/19560 | loss 3.413983 (-0.21z)| norm 0.2484 (-0.56z)| lr 4.67e-04 | 4176.08 ms | 32.3% bf16 MFU | 123658 tok/s step 9288/19560 | loss 3.364135 (-1.42z)| norm 0.2525 (-0.39z)| lr 4.67e-04 | 4181.51 ms | 32.3% bf16 MFU | 123744 tok/s step 9289/19560 | loss 3.378325 (-1.05z)| norm 0.2594 (-0.06z)| lr 4.67e-04 | 4378.48 ms | 30.8% bf16 MFU | 123544 tok/s step 9290/19560 | loss 3.402463 (-0.46z)| norm 0.2769 (+0.72z)| lr 4.67e-04 | 4180.02 ms | 32.3% bf16 MFU | 123638 tok/s step 9291/19560 | loss 3.354160 (-1.62z)| norm 0.2658 (+0.21z)| lr 4.67e-04 | 4192.13 ms | 32.2% bf16 MFU | 123710 tok/s step 9292/19560 | loss 3.420463 (+0.00z)| norm 0.2908 (+1.34z)| lr 4.67e-04 | 4181.87 ms | 32.3% bf16 MFU | 123793 tok/s step 9293/19560 | loss 3.409074 (-0.26z)| norm 0.2871 (+1.16z)| lr 4.67e-04 | 4245.69 ms | 31.8% bf16 MFU | 123777 tok/s step 9294/19560 | loss 3.406652 (-0.32z)| norm 0.2693 (+0.34z)| lr 4.67e-04 | 4216.71 ms | 32.0% bf16 MFU | 123805 tok/s step 9295/19560 | loss 3.394474 (-0.61z)| norm 0.3021 (+1.80z)| lr 4.66e-04 | 4186.79 ms | 32.2% bf16 MFU | 123876 tok/s step 9296/19560 | loss 3.416444 (-0.06z)| norm 0.2677 (+0.24z)| lr 4.66e-04 | 4184.24 ms | 32.3% bf16 MFU | 123947 tok/s step 9297/19560 | loss 3.418726 (-0.01z)| norm 0.2956 (+1.49z)| lr 4.66e-04 | 4177.28 ms | 32.3% bf16 MFU | 124026 tok/s step 9298/19560 | loss 3.442222 (+0.56z)| norm 0.2608 (-0.07z)| lr 4.66e-04 | 4181.71 ms | 32.3% bf16 MFU | 124093 tok/s step 9299/19560 | loss 3.456970 (+0.92z)| norm 0.2903 (+1.25z)| lr 4.66e-04 | 4210.88 ms | 32.1% bf16 MFU | 124114 tok/s step 9300/19560 | loss 3.358562 (-1.50z)| norm 0.2488 (-0.61z)| lr 4.66e-04 | 4206.40 ms | 32.1% bf16 MFU | 124140 tok/s step 9301/19560 | loss 3.404768 (-0.36z)| norm 0.2592 (-0.14z)| lr 4.66e-04 | 4301.24 ms | 31.4% bf16 MFU | 124028 tok/s step 9302/19560 | loss 3.385779 (-0.83z)| norm 0.2579 (-0.21z)| lr 4.66e-04 | 4196.90 ms | 32.2% bf16 MFU | 124073 tok/s step 9303/19560 | loss 3.405331 (-0.33z)| norm 0.2470 (-0.70z)| lr 4.66e-04 | 4207.89 ms | 32.1% bf16 MFU | 124099 tok/s step 9304/19560 | loss 3.475236 (+1.39z)| norm 0.2489 (-0.61z)| lr 4.66e-04 | 4168.39 ms | 32.4% bf16 MFU | 124183 tok/s step 9305/19560 | loss 3.371128 (-1.17z)| norm 0.2403 (-1.01z)| lr 4.66e-04 | 4191.01 ms | 32.2% bf16 MFU | 124228 tok/s step 9306/19560 | loss 3.458821 (+1.00z)| norm 0.2727 (+0.45z)| lr 4.66e-04 | 4179.01 ms | 32.3% bf16 MFU | 124290 tok/s step 9307/19560 | loss 3.362009 (-1.38z)| norm 0.2667 (+0.18z)| lr 4.66e-04 | 4194.35 ms | 32.2% bf16 MFU | 124325 tok/s step 9308/19560 | loss 3.420294 (+0.04z)| norm 0.2681 (+0.23z)| lr 4.66e-04 | 4277.32 ms | 31.6% bf16 MFU | 124238 tok/s step 9309/19560 | loss 3.433847 (+0.37z)| norm 0.2504 (-0.58z)| lr 4.66e-04 | 4184.65 ms | 32.3% bf16 MFU | 124290 tok/s step 9310/19560 | loss 3.378460 (-0.99z)| norm 0.2592 (-0.18z)| lr 4.65e-04 | 4175.59 ms | 32.3% bf16 MFU | 124354 tok/s step 9311/19560 | loss 3.382078 (-0.92z)| norm 0.2611 (-0.09z)| lr 4.65e-04 | 4178.65 ms | 32.3% bf16 MFU | 124410 tok/s step 9312/19560 | loss 3.466236 (+1.16z)| norm 0.2543 (-0.41z)| lr 4.65e-04 | 4194.22 ms | 32.2% bf16 MFU | 124439 tok/s step 9313/19560 | loss 3.523477 (+2.50z)| norm 0.2781 (+0.67z)| lr 4.65e-04 | 4169.60 ms | 32.4% bf16 MFU | 124504 tok/s step 9314/19560 | loss 3.469684 (+1.18z)| norm 0.2546 (-0.41z)| lr 4.65e-04 | 4287.92 ms | 31.5% bf16 MFU | 124393 tok/s step 9315/19560 | loss 3.394965 (-0.62z)| norm 0.2919 (+1.29z)| lr 4.65e-04 | 4173.09 ms | 32.4% bf16 MFU | 124455 tok/s step 9316/19560 | loss 3.468525 (+1.14z)| norm 0.2514 (-0.55z)| lr 4.65e-04 | 4254.51 ms | 31.7% bf16 MFU | 124394 tok/s step 9317/19560 | loss 3.492094 (+1.68z)| norm 0.2755 (+0.54z)| lr 4.65e-04 | 4163.32 ms | 32.4% bf16 MFU | 124470 tok/s step 9318/19560 | loss 3.439808 (+0.43z)| norm 0.2493 (-0.65z)| lr 4.65e-04 | 4182.14 ms | 32.3% bf16 MFU | 124515 tok/s step 9319/19560 | loss 3.415851 (-0.14z)| norm 0.2425 (-0.96z)| lr 4.65e-04 | 4859.74 ms | 27.8% bf16 MFU | 123683 tok/s step 9320/19560 | loss 3.380229 (-0.98z)| norm 0.2503 (-0.60z)| lr 4.65e-04 | 4170.45 ms | 32.4% bf16 MFU | 123785 tok/s step 9321/19560 | loss 3.402331 (-0.46z)| norm 0.2521 (-0.52z)| lr 4.65e-04 | 4205.41 ms | 32.1% bf16 MFU | 123829 tok/s step 9322/19560 | loss 3.382807 (-0.93z)| norm 0.2542 (-0.42z)| lr 4.65e-04 | 4178.61 ms | 32.3% bf16 MFU | 123911 tok/s step 9323/19560 | loss 3.400219 (-0.51z)| norm 0.2694 (+0.26z)| lr 4.65e-04 | 4394.72 ms | 30.7% bf16 MFU | 123681 tok/s step 9324/19560 | loss 3.450369 (+0.72z)| norm 0.2653 (+0.07z)| lr 4.65e-04 | 4179.91 ms | 32.3% bf16 MFU | 123768 tok/s step 9325/19560 | loss 3.428370 (+0.17z)| norm 0.2437 (-0.91z)| lr 4.64e-04 | 4171.43 ms | 32.4% bf16 MFU | 123864 tok/s step 9326/19560 | loss 3.421589 (-0.00z)| norm 0.2565 (-0.33z)| lr 4.64e-04 | 4230.81 ms | 31.9% bf16 MFU | 123867 tok/s step 9327/19560 | loss 3.454835 (+0.83z)| norm 0.2312 (-1.46z)| lr 4.64e-04 | 4184.23 ms | 32.3% bf16 MFU | 123939 tok/s step 9328/19560 | loss 3.396430 (-0.62z)| norm 0.2596 (-0.17z)| lr 4.64e-04 | 4184.51 ms | 32.3% bf16 MFU | 124006 tok/s step 9329/19560 | loss 3.377283 (-1.08z)| norm 0.2478 (-0.69z)| lr 4.64e-04 | 4187.78 ms | 32.2% bf16 MFU | 124066 tok/s step 9330/19560 | loss 3.409131 (-0.28z)| norm 0.2546 (-0.39z)| lr 4.64e-04 | 4175.08 ms | 32.3% bf16 MFU | 124141 tok/s step 9331/19560 | loss 3.428041 (+0.18z)| norm 0.2349 (-1.27z)| lr 4.64e-04 | 4192.70 ms | 32.2% bf16 MFU | 124187 tok/s step 9332/19560 | loss 3.423638 (+0.08z)| norm 0.2361 (-1.20z)| lr 4.64e-04 | 4172.65 ms | 32.4% bf16 MFU | 124260 tok/s step 9333/19560 | loss 3.426102 (+0.15z)| norm 0.2326 (-1.33z)| lr 4.64e-04 | 4452.49 ms | 30.3% bf16 MFU | 123934 tok/s step 9334/19560 | loss 3.402488 (-0.46z)| norm 0.2416 (-0.92z)| lr 4.64e-04 | 4194.60 ms | 32.2% bf16 MFU | 123987 tok/s step 9335/19560 | loss 3.398974 (-0.54z)| norm 0.2332 (-1.27z)| lr 4.64e-04 | 4210.76 ms | 32.1% bf16 MFU | 124013 tok/s step 9336/19560 | loss 3.354093 (-1.65z)| norm 0.2358 (-1.14z)| lr 4.64e-04 | 4174.80 ms | 32.3% bf16 MFU | 124092 tok/s step 9337/19560 | loss 3.441990 (+0.55z)| norm 0.2108 (-2.19z)| lr 4.64e-04 | 4176.72 ms | 32.3% bf16 MFU | 124164 tok/s step 9338/19560 | loss 3.439163 (+0.48z)| norm 0.2518 (-0.40z)| lr 4.64e-04 | 4172.88 ms | 32.4% bf16 MFU | 124237 tok/s step 9339/19560 | loss 3.392198 (-0.70z)| norm 0.2530 (-0.34z)| lr 4.64e-04 | 4264.87 ms | 31.7% bf16 MFU | 124172 tok/s step 9340/19560 | loss 3.498250 (+1.95z)| norm 0.2617 (+0.04z)| lr 4.63e-04 | 4179.12 ms | 32.3% bf16 MFU | 124236 tok/s step 9341/19560 | loss 3.484813 (+1.63z)| norm 0.2761 (+0.67z)| lr 4.63e-04 | 4187.85 ms | 32.2% bf16 MFU | 124284 tok/s step 9342/19560 | loss 3.390921 (-0.72z)| norm 0.2282 (-1.40z)| lr 4.63e-04 | 4193.27 ms | 32.2% bf16 MFU | 124321 tok/s step 9343/19560 | loss 3.454055 (+0.86z)| norm 0.3029 (+1.80z)| lr 4.63e-04 | 4193.64 ms | 32.2% bf16 MFU | 124356 tok/s step 9344/19560 | loss 3.453938 (+0.85z)| norm 0.2854 (+1.04z)| lr 4.63e-04 | 4204.80 ms | 32.1% bf16 MFU | 124373 tok/s step 9345/19560 | loss 3.358952 (-1.51z)| norm 0.2573 (-0.16z)| lr 4.63e-04 | 4208.11 ms | 32.1% bf16 MFU | 124384 tok/s step 9346/19560 | loss 3.422225 (+0.07z)| norm 0.2480 (-0.56z)| lr 4.63e-04 | 4694.98 ms | 28.8% bf16 MFU | 123748 tok/s step 9347/19560 | loss 3.516335 (+2.36z)| norm 0.2633 (+0.10z)| lr 4.63e-04 | 4169.68 ms | 32.4% bf16 MFU | 123848 tok/s step 9348/19560 | loss 3.429626 (+0.22z)| norm 0.2715 (+0.44z)| lr 4.63e-04 | 4195.34 ms | 32.2% bf16 MFU | 123904 tok/s step 9349/19560 | loss 3.378557 (-1.03z)| norm 0.2528 (-0.36z)| lr 4.63e-04 | 4191.27 ms | 32.2% bf16 MFU | 123963 tok/s step 9350/19560 | loss 3.439186 (+0.47z)| norm 0.2665 (+0.22z)| lr 4.63e-04 | 4202.19 ms | 32.1% bf16 MFU | 124003 tok/s step 9351/19560 | loss 3.452643 (+0.80z)| norm 0.2630 (+0.07z)| lr 4.63e-04 | 4185.19 ms | 32.3% bf16 MFU | 124067 tok/s step 9352/19560 | loss 3.431135 (+0.25z)| norm 0.2584 (-0.14z)| lr 4.63e-04 | 4211.42 ms | 32.1% bf16 MFU | 124088 tok/s step 9353/19560 | loss 3.397278 (-0.62z)| norm 0.2716 (+0.43z)| lr 4.63e-04 | 4189.23 ms | 32.2% bf16 MFU | 124141 tok/s step 9354/19560 | loss 3.379669 (-1.06z)| norm 0.2395 (-0.94z)| lr 4.63e-04 | 4171.86 ms | 32.4% bf16 MFU | 124218 tok/s step 9355/19560 | loss 3.431385 (+0.27z)| norm 0.2498 (-0.51z)| lr 4.62e-04 | 4182.92 ms | 32.3% bf16 MFU | 124274 tok/s step 9356/19560 | loss 3.415861 (-0.14z)| norm 0.2431 (-0.80z)| lr 4.62e-04 | 4213.99 ms | 32.0% bf16 MFU | 124281 tok/s step 9357/19560 | loss 3.440065 (+0.48z)| norm 0.2515 (-0.42z)| lr 4.62e-04 | 4347.94 ms | 31.1% bf16 MFU | 124096 tok/s step 9358/19560 | loss 3.477404 (+1.43z)| norm 0.2344 (-1.17z)| lr 4.62e-04 | 4172.70 ms | 32.4% bf16 MFU | 124174 tok/s step 9359/19560 | loss 3.342838 (-2.02z)| norm 0.2507 (-0.43z)| lr 4.62e-04 | 4176.76 ms | 32.3% bf16 MFU | 124241 tok/s step 9360/19560 | loss 3.487661 (+1.70z)| norm 0.2454 (-0.65z)| lr 4.62e-04 | 4182.49 ms | 32.3% bf16 MFU | 124297 tok/s step 9361/19560 | loss 3.452991 (+0.80z)| norm 0.2420 (-0.81z)| lr 4.62e-04 | 4177.64 ms | 32.3% bf16 MFU | 124357 tok/s step 9362/19560 | loss 3.485861 (+1.62z)| norm 0.2434 (-0.73z)| lr 4.62e-04 | 4173.23 ms | 32.4% bf16 MFU | 124420 tok/s step 9363/19560 | loss 3.329084 (-2.32z)| norm 0.2353 (-1.09z)| lr 4.62e-04 | 4183.53 ms | 32.3% bf16 MFU | 124466 tok/s step 9364/19560 | loss 3.394384 (-0.69z)| norm 0.2576 (-0.08z)| lr 4.62e-04 | 4171.99 ms | 32.4% bf16 MFU | 124526 tok/s step 9365/19560 | loss 3.425951 (+0.10z)| norm 0.2227 (-1.62z)| lr 4.62e-04 | 4181.44 ms | 32.3% bf16 MFU | 124569 tok/s step 9366/19560 | loss 3.396822 (-0.62z)| norm 0.2494 (-0.43z)| lr 4.62e-04 | 4207.16 ms | 32.1% bf16 MFU | 124571 tok/s step 9367/19560 | loss 3.418607 (-0.07z)| norm 0.2471 (-0.53z)| lr 4.62e-04 | 4188.62 ms | 32.2% bf16 MFU | 124601 tok/s step 9368/19560 | loss 3.394647 (-0.68z)| norm 0.2417 (-0.77z)| lr 4.62e-04 | 4178.80 ms | 32.3% bf16 MFU | 124644 tok/s step 9369/19560 | loss 3.320538 (-2.47z)| norm 0.2370 (-0.97z)| lr 4.62e-04 | 4179.81 ms | 32.3% bf16 MFU | 124684 tok/s step 9370/19560 | loss 3.415018 (-0.15z)| norm 0.2362 (-0.99z)| lr 4.61e-04 | 4197.41 ms | 32.2% bf16 MFU | 124695 tok/s step 9371/19560 | loss 3.492104 (+1.73z)| norm 0.2425 (-0.71z)| lr 4.61e-04 | 4276.78 ms | 31.6% bf16 MFU | 124590 tok/s step 9372/19560 | loss 3.505387 (+2.00z)| norm 0.2715 (+0.57z)| lr 4.61e-04 | 4181.40 ms | 32.3% bf16 MFU | 124629 tok/s step 9373/19560 | loss 3.466259 (+1.05z)| norm 0.2623 (+0.16z)| lr 4.61e-04 | 4186.18 ms | 32.3% bf16 MFU | 124660 tok/s step 9374/19560 | loss 3.408882 (-0.34z)| norm 0.2697 (+0.48z)| lr 4.61e-04 | 4171.44 ms | 32.4% bf16 MFU | 124711 tok/s step 9375/19560 | loss 3.446974 (+0.58z)| norm 0.2758 (+0.74z)| lr 4.61e-04 | 4183.60 ms | 32.3% bf16 MFU | 124742 tok/s step 9376/19560 | loss 3.397880 (-0.60z)| norm 0.2704 (+0.49z)| lr 4.61e-04 | 4178.18 ms | 32.3% bf16 MFU | 124779 tok/s step 9377/19560 | loss 3.410568 (-0.29z)| norm 0.2585 (-0.03z)| lr 4.61e-04 | 4330.43 ms | 31.2% bf16 MFU | 124593 tok/s step 9378/19560 | loss 3.405171 (-0.41z)| norm 0.2623 (+0.14z)| lr 4.61e-04 | 4169.87 ms | 32.4% bf16 MFU | 124650 tok/s step 9379/19560 | loss 3.360442 (-1.49z)| norm 0.2526 (-0.29z)| lr 4.61e-04 | 4185.25 ms | 32.3% bf16 MFU | 124681 tok/s step 9380/19560 | loss 3.446456 (+0.62z)| norm 0.2560 (-0.14z)| lr 4.61e-04 | 4174.02 ms | 32.3% bf16 MFU | 124728 tok/s step 9381/19560 | loss 3.398067 (-0.56z)| norm 0.2524 (-0.29z)| lr 4.61e-04 | 4186.47 ms | 32.3% bf16 MFU | 124753 tok/s step 9382/19560 | loss 3.451033 (+0.74z)| norm 0.2242 (-1.53z)| lr 4.61e-04 | 4167.16 ms | 32.4% bf16 MFU | 124806 tok/s step 9383/19560 | loss 3.389522 (-0.77z)| norm 0.2379 (-0.91z)| lr 4.61e-04 | 4176.00 ms | 32.3% bf16 MFU | 124843 tok/s step 9384/19560 | loss 3.431504 (+0.27z)| norm 0.2435 (-0.66z)| lr 4.60e-04 | 4175.76 ms | 32.3% bf16 MFU | 124879 tok/s step 9385/19560 | loss 3.413470 (-0.18z)| norm 0.2368 (-0.94z)| lr 4.60e-04 | 4176.49 ms | 32.3% bf16 MFU | 124911 tok/s step 9386/19560 | loss 3.469918 (+1.19z)| norm 0.2638 (+0.25z)| lr 4.60e-04 | 4172.68 ms | 32.4% bf16 MFU | 124948 tok/s step 9387/19560 | loss 3.492623 (+1.78z)| norm 0.3304 (+3.03z)| lr 4.60e-04 | 4178.37 ms | 32.3% bf16 MFU | 124975 tok/s step 9388/19560 | loss 3.365954 (-1.39z)| norm 0.3343 (+3.05z)| lr 4.60e-04 | 4346.67 ms | 31.1% bf16 MFU | 124757 tok/s step 9389/19560 | loss 3.360966 (-1.49z)| norm 0.3145 (+2.18z)| lr 4.60e-04 | 4183.80 ms | 32.3% bf16 MFU | 124785 tok/s step 9390/19560 | loss 3.465965 (+1.11z)| norm 0.2756 (+0.62z)| lr 4.60e-04 | 4175.85 ms | 32.3% bf16 MFU | 124823 tok/s step 9391/19560 | loss 3.412282 (-0.22z)| norm 0.2893 (+1.15z)| lr 4.60e-04 | 4183.20 ms | 32.3% bf16 MFU | 124848 tok/s step 9392/19560 | loss 3.436573 (+0.38z)| norm 0.2861 (+1.01z)| lr 4.60e-04 | 4210.28 ms | 32.1% bf16 MFU | 124832 tok/s step 9393/19560 | loss 3.404229 (-0.43z)| norm 0.2802 (+0.77z)| lr 4.60e-04 | 4342.87 ms | 31.1% bf16 MFU | 124627 tok/s step 9394/19560 | loss 3.422620 (+0.03z)| norm 0.2731 (+0.48z)| lr 4.60e-04 | 4175.90 ms | 32.3% bf16 MFU | 124673 tok/s step 9395/19560 | loss 3.366720 (-1.34z)| norm 0.2531 (-0.33z)| lr 4.60e-04 | 4183.52 ms | 32.3% bf16 MFU | 124706 tok/s step 9396/19560 | loss 3.356487 (-1.57z)| norm 0.2602 (-0.05z)| lr 4.60e-04 | 4196.80 ms | 32.2% bf16 MFU | 124717 tok/s step 9397/19560 | loss 3.358481 (-1.50z)| norm 0.2509 (-0.43z)| lr 4.60e-04 | 4224.62 ms | 32.0% bf16 MFU | 124686 tok/s step 9398/19560 | loss 3.307261 (-2.65z)| norm 0.2421 (-0.83z)| lr 4.60e-04 | 4191.55 ms | 32.2% bf16 MFU | 124706 tok/s step 9399/19560 | loss 3.424131 (+0.14z)| norm 0.2417 (-0.83z)| lr 4.59e-04 | 4246.16 ms | 31.8% bf16 MFU | 124644 tok/s step 9400/19560 | loss 3.414005 (-0.10z)| norm 0.2633 (+0.18z)| lr 4.59e-04 | 4173.97 ms | 32.3% bf16 MFU | 124692 tok/s step 9401/19560 | loss 3.392907 (-0.60z)| norm 0.2537 (-0.25z)| lr 4.59e-04 | 4226.99 ms | 31.9% bf16 MFU | 124659 tok/s step 9402/19560 | loss 3.520465 (+2.39z)| norm 0.2850 (+1.31z)| lr 4.59e-04 | 4266.93 ms | 31.6% bf16 MFU | 124570 tok/s step 9403/19560 | loss 3.410923 (-0.17z)| norm 0.2670 (+0.41z)| lr 4.59e-04 | 4177.71 ms | 32.3% bf16 MFU | 124616 tok/s step 9404/19560 | loss 3.377582 (-0.96z)| norm 0.2459 (-0.62z)| lr 4.59e-04 | 4181.10 ms | 32.3% bf16 MFU | 124655 tok/s step 9405/19560 | loss 3.427469 (+0.21z)| norm 0.2728 (+0.71z)| lr 4.59e-04 | 4171.55 ms | 32.4% bf16 MFU | 124707 tok/s step 9406/19560 | loss 3.433761 (+0.35z)| norm 0.2524 (-0.31z)| lr 4.59e-04 | 4176.31 ms | 32.3% bf16 MFU | 124748 tok/s step 9407/19560 | loss 3.413642 (-0.13z)| norm 0.2491 (-0.47z)| lr 4.59e-04 | 4189.17 ms | 32.2% bf16 MFU | 124768 tok/s step 9408/19560 | loss 3.414933 (-0.08z)| norm 0.2483 (-0.51z)| lr 4.59e-04 | 4177.87 ms | 32.3% bf16 MFU | 124805 tok/s step 9409/19560 | loss 3.387717 (-0.73z)| norm 0.2425 (-0.79z)| lr 4.59e-04 | 4182.40 ms | 32.3% bf16 MFU | 124832 tok/s step 9410/19560 | loss 3.492801 (+1.76z)| norm 0.2664 (+0.42z)| lr 4.59e-04 | 4294.56 ms | 31.4% bf16 MFU | 124695 tok/s step 9411/19560 | loss 3.472363 (+1.26z)| norm 0.2373 (-1.04z)| lr 4.59e-04 | 4179.11 ms | 32.3% bf16 MFU | 124733 tok/s step 9412/19560 | loss 3.405198 (-0.31z)| norm 0.2761 (+0.90z)| lr 4.59e-04 | 4178.16 ms | 32.3% bf16 MFU | 124770 tok/s step 9413/19560 | loss 3.380546 (-0.89z)| norm 0.2585 (+0.01z)| lr 4.59e-04 | 4173.93 ms | 32.3% bf16 MFU | 124812 tok/s step 9414/19560 | loss 3.347047 (-1.65z)| norm 0.2634 (+0.25z)| lr 4.58e-04 | 4180.04 ms | 32.3% bf16 MFU | 124843 tok/s step 9415/19560 | loss 3.497451 (+1.82z)| norm 0.4795 (+7.90z)| lr 4.58e-04 | 4179.72 ms | 32.3% bf16 MFU | 124873 tok/s step 9416/19560 | loss 3.421864 (+0.07z)| norm 0.3878 (+4.23z)| lr 4.58e-04 | 4192.45 ms | 32.2% bf16 MFU | 124882 tok/s step 9417/19560 | loss 3.521617 (+2.32z)| norm 0.2924 (+1.02z)| lr 4.58e-04 | 4175.60 ms | 32.3% bf16 MFU | 124916 tok/s step 9418/19560 | loss 3.334091 (-1.92z)| norm 0.3358 (+2.40z)| lr 4.58e-04 | 4189.82 ms | 32.2% bf16 MFU | 124927 tok/s step 9419/19560 | loss 3.385936 (-0.76z)| norm 0.2850 (+0.74z)| lr 4.58e-04 | 4228.41 ms | 31.9% bf16 MFU | 124880 tok/s step 9420/19560 | loss 3.394295 (-0.57z)| norm 0.3157 (+1.72z)| lr 4.58e-04 | 4164.95 ms | 32.4% bf16 MFU | 124930 tok/s step 9421/19560 | loss 3.477033 (+1.28z)| norm 0.2616 (-0.02z)| lr 4.58e-04 | 4205.55 ms | 32.1% bf16 MFU | 124917 tok/s step 9422/19560 | loss 3.397422 (-0.50z)| norm 0.2836 (+0.69z)| lr 4.58e-04 | 4240.39 ms | 31.8% bf16 MFU | 124853 tok/s step 9423/19560 | loss 3.438882 (+0.42z)| norm 0.2613 (-0.02z)| lr 4.58e-04 | 4194.51 ms | 32.2% bf16 MFU | 124860 tok/s step 9424/19560 | loss 3.548508 (+2.77z)| norm 0.5368 (+6.96z)| lr 4.58e-04 | 4180.75 ms | 32.3% bf16 MFU | 124887 tok/s step 9425/19560 | loss 3.397584 (-0.51z)| norm 0.3351 (+1.79z)| lr 4.58e-04 | 4173.48 ms | 32.4% bf16 MFU | 124924 tok/s step 9426/19560 | loss 3.428464 (+0.16z)| norm 0.2860 (+0.54z)| lr 4.58e-04 | 4243.60 ms | 31.8% bf16 MFU | 124855 tok/s step 9427/19560 | loss 3.481725 (+1.31z)| norm 0.3071 (+1.07z)| lr 4.58e-04 | 4214.66 ms | 32.0% bf16 MFU | 124832 tok/s step 9428/19560 | loss 3.550246 (+2.70z)| norm 0.2746 (+0.24z)| lr 4.58e-04 | 4215.37 ms | 32.0% bf16 MFU | 124809 tok/s step 9429/19560 | loss 3.456591 (+0.71z)| norm 0.2691 (+0.10z)| lr 4.57e-04 | 4172.11 ms | 32.4% bf16 MFU | 124852 tok/s step 9430/19560 | loss 3.458375 (+0.73z)| norm 0.2633 (-0.04z)| lr 4.57e-04 | 4190.28 ms | 32.2% bf16 MFU | 124866 tok/s step 9431/19560 | loss 3.402438 (-0.45z)| norm 0.2768 (+0.29z)| lr 4.57e-04 | 4169.87 ms | 32.4% bf16 MFU | 124909 tok/s step 9432/19560 | loss 3.430252 (+0.15z)| norm 0.2464 (-0.48z)| lr 4.57e-04 | 4165.37 ms | 32.4% bf16 MFU | 124957 tok/s step 9433/19560 | loss 3.466593 (+0.90z)| norm 0.2649 (-0.01z)| lr 4.57e-04 | 4177.24 ms | 32.3% bf16 MFU | 124985 tok/s step 9434/19560 | loss 3.449238 (+0.54z)| norm 0.2545 (-0.27z)| lr 4.57e-04 | 4181.00 ms | 32.3% bf16 MFU | 125005 tok/s step 9435/19560 | loss 3.484611 (+1.27z)| norm 0.2460 (-0.48z)| lr 4.57e-04 | 4174.74 ms | 32.3% bf16 MFU | 125034 tok/s step 9436/19560 | loss 3.477065 (+1.10z)| norm 0.2658 (+0.02z)| lr 4.57e-04 | 4193.79 ms | 32.2% bf16 MFU | 125033 tok/s step 9437/19560 | loss 3.441998 (+0.35z)| norm 0.2351 (-0.75z)| lr 4.57e-04 | 4296.11 ms | 31.4% bf16 MFU | 124884 tok/s step 9438/19560 | loss 3.419328 (-0.14z)| norm 0.2656 (+0.01z)| lr 4.57e-04 | 4182.71 ms | 32.3% bf16 MFU | 124907 tok/s step 9439/19560 | loss 3.403875 (-0.47z)| norm 0.2308 (-0.85z)| lr 4.57e-04 | 4177.24 ms | 32.3% bf16 MFU | 124937 tok/s step 9440/19560 | loss 3.449954 (+0.52z)| norm 0.2426 (-0.55z)| lr 4.57e-04 | 4187.04 ms | 32.2% bf16 MFU | 124951 tok/s step 9441/19560 | loss 3.401657 (-0.50z)| norm 0.2505 (-0.35z)| lr 4.57e-04 | 4190.54 ms | 32.2% bf16 MFU | 124959 tok/s step 9442/19560 | loss 3.455692 (+0.67z)| norm 0.2377 (-0.67z)| lr 4.57e-04 | 4183.13 ms | 32.3% bf16 MFU | 124978 tok/s step 9443/19560 | loss 3.352102 (-1.56z)| norm 0.2521 (-0.30z)| lr 4.57e-04 | 4196.52 ms | 32.2% bf16 MFU | 124976 tok/s step 9444/19560 | loss 3.400988 (-0.50z)| norm 0.2553 (-0.22z)| lr 4.56e-04 | 4177.57 ms | 32.3% bf16 MFU | 125002 tok/s step 9445/19560 | loss 3.371847 (-1.11z)| norm 0.2358 (-0.70z)| lr 4.56e-04 | 4189.47 ms | 32.2% bf16 MFU | 125009 tok/s step 9446/19560 | loss 3.441390 (+0.40z)| norm 0.2648 (+0.02z)| lr 4.56e-04 | 4188.34 ms | 32.2% bf16 MFU | 125017 tok/s step 9447/19560 | loss 3.405365 (-0.38z)| norm 0.2408 (-0.58z)| lr 4.56e-04 | 4183.46 ms | 32.3% bf16 MFU | 125033 tok/s step 9448/19560 | loss 3.425659 (+0.05z)| norm 0.2393 (-0.61z)| lr 4.56e-04 | 4183.40 ms | 32.3% bf16 MFU | 125047 tok/s step 9449/19560 | loss 3.380526 (-0.93z)| norm 0.2447 (-0.48z)| lr 4.56e-04 | 4180.67 ms | 32.3% bf16 MFU | 125065 tok/s step 9450/19560 | loss 3.368962 (-1.17z)| norm 0.2623 (-0.04z)| lr 4.56e-04 | 4688.16 ms | 28.8% bf16 MFU | 124404 tok/s step 9451/19560 | loss 3.399338 (-0.51z)| norm 0.2253 (-0.95z)| lr 4.56e-04 | 4471.79 ms | 30.2% bf16 MFU | 124046 tok/s step 9452/19560 | loss 3.389517 (-0.72z)| norm 0.2626 (-0.02z)| lr 4.56e-04 | 4365.84 ms | 30.9% bf16 MFU | 123848 tok/s step 9453/19560 | loss 3.409015 (-0.29z)| norm 0.2622 (-0.03z)| lr 4.56e-04 | 4298.64 ms | 31.4% bf16 MFU | 123754 tok/s step 9454/19560 | loss 3.436721 (+0.31z)| norm 0.2520 (-0.29z)| lr 4.56e-04 | 4170.93 ms | 32.4% bf16 MFU | 123851 tok/s step 9455/19560 | loss 3.433532 (+0.24z)| norm 0.2676 (+0.09z)| lr 4.56e-04 | 4286.10 ms | 31.5% bf16 MFU | 123775 tok/s step 9456/19560 | loss 3.438702 (+0.35z)| norm 0.2328 (-0.77z)| lr 4.56e-04 | 4218.40 ms | 32.0% bf16 MFU | 123800 tok/s step 9457/19560 | loss 3.417002 (-0.13z)| norm 0.2784 (+0.36z)| lr 4.56e-04 | 4177.27 ms | 32.3% bf16 MFU | 123886 tok/s step 9458/19560 | loss 3.460338 (+0.81z)| norm 0.2724 (+0.21z)| lr 4.56e-04 | 4233.83 ms | 31.9% bf16 MFU | 123883 tok/s step 9459/19560 | loss 3.499650 (+1.63z)| norm 0.2667 (+0.06z)| lr 4.55e-04 | 4217.63 ms | 32.0% bf16 MFU | 123904 tok/s step 9460/19560 | loss 3.410001 (-0.30z)| norm 0.2510 (-0.33z)| lr 4.55e-04 | 4242.88 ms | 31.8% bf16 MFU | 123888 tok/s step 9461/19560 | loss 3.424229 (+0.01z)| norm 0.2951 (+0.76z)| lr 4.55e-04 | 4173.67 ms | 32.3% bf16 MFU | 123974 tok/s step 9462/19560 | loss 3.350473 (-1.56z)| norm 0.2643 (-0.02z)| lr 4.55e-04 | 4170.76 ms | 32.4% bf16 MFU | 124061 tok/s step 9463/19560 | loss 3.422931 (-0.01z)| norm 0.2435 (-0.54z)| lr 4.55e-04 | 4161.49 ms | 32.4% bf16 MFU | 124157 tok/s step 9464/19560 | loss 3.436947 (+0.27z)| norm 0.2795 (+0.35z)| lr 4.55e-04 | 4186.02 ms | 32.3% bf16 MFU | 124211 tok/s step 9465/19560 | loss 3.396673 (-0.58z)| norm 0.2348 (-0.78z)| lr 4.55e-04 | 4205.83 ms | 32.1% bf16 MFU | 124234 tok/s step 9466/19560 | loss 3.401137 (-0.48z)| norm 0.2471 (-0.47z)| lr 4.55e-04 | 4197.71 ms | 32.2% bf16 MFU | 124267 tok/s step 9467/19560 | loss 3.328837 (-2.00z)| norm 0.2468 (-0.47z)| lr 4.55e-04 | 4192.91 ms | 32.2% bf16 MFU | 124306 tok/s step 9468/19560 | loss 3.428181 (+0.12z)| norm 0.2887 (+0.58z)| lr 4.55e-04 | 4179.77 ms | 32.3% bf16 MFU | 124362 tok/s step 9469/19560 | loss 3.402575 (-0.42z)| norm 0.3036 (+0.94z)| lr 4.55e-04 | 4171.56 ms | 32.4% bf16 MFU | 124428 tok/s step 9470/19560 | loss 3.365026 (-1.22z)| norm 0.2691 (+0.07z)| lr 4.55e-04 | 4176.45 ms | 32.3% bf16 MFU | 124483 tok/s step 9471/19560 | loss 3.386002 (-0.75z)| norm 0.2703 (+0.11z)| lr 4.55e-04 | 4164.80 ms | 32.4% bf16 MFU | 124554 tok/s step 9472/19560 | loss 3.406311 (-0.31z)| norm 0.2895 (+0.59z)| lr 4.55e-04 | 4205.34 ms | 32.1% bf16 MFU | 124559 tok/s step 9473/19560 | loss 3.479804 (+1.25z)| norm 0.2528 (-0.34z)| lr 4.55e-04 | 4170.50 ms | 32.4% bf16 MFU | 124617 tok/s step 9474/19560 | loss 3.527412 (+2.21z)| norm 0.2565 (-0.24z)| lr 4.54e-04 | 4162.42 ms | 32.4% bf16 MFU | 124684 tok/s step 9475/19560 | loss 3.411190 (-0.23z)| norm 0.2489 (-0.43z)| lr 4.54e-04 | 4171.98 ms | 32.4% bf16 MFU | 124733 tok/s step 9476/19560 | loss 3.419940 (-0.04z)| norm 0.2439 (-0.55z)| lr 4.54e-04 | 4196.72 ms | 32.2% bf16 MFU | 124743 tok/s step 9477/19560 | loss 3.466629 (+0.95z)| norm 0.2485 (-0.44z)| lr 4.54e-04 | 4174.54 ms | 32.3% bf16 MFU | 124786 tok/s step 9478/19560 | loss 3.497301 (+1.58z)| norm 0.2516 (-0.35z)| lr 4.54e-04 | 4174.92 ms | 32.3% bf16 MFU | 124825 tok/s step 9479/19560 | loss 3.494973 (+1.51z)| norm 0.2362 (-0.74z)| lr 4.54e-04 | 4163.50 ms | 32.4% bf16 MFU | 124880 tok/s step 9480/19560 | loss 3.412244 (-0.23z)| norm 0.2465 (-0.47z)| lr 4.54e-04 | 4170.98 ms | 32.4% bf16 MFU | 124921 tok/s step 9481/19560 | loss 3.339894 (-1.72z)| norm 0.2526 (-0.32z)| lr 4.54e-04 | 4174.60 ms | 32.3% bf16 MFU | 124955 tok/s step 9482/19560 | loss 3.412896 (-0.21z)| norm 0.2527 (-0.32z)| lr 4.54e-04 | 4170.27 ms | 32.4% bf16 MFU | 124993 tok/s step 9483/19560 | loss 3.375283 (-0.98z)| norm 0.2691 (+0.09z)| lr 4.54e-04 | 4169.41 ms | 32.4% bf16 MFU | 125031 tok/s step 9484/19560 | loss 3.439600 (+0.36z)| norm 0.2559 (-0.24z)| lr 4.54e-04 | 4179.04 ms | 32.3% bf16 MFU | 125052 tok/s step 9485/19560 | loss 3.391206 (-0.64z)| norm 0.2631 (-0.07z)| lr 4.54e-04 | 4177.29 ms | 32.3% bf16 MFU | 125075 tok/s step 9486/19560 | loss 3.481875 (+1.24z)| norm 0.2647 (-0.03z)| lr 4.54e-04 | 4172.91 ms | 32.4% bf16 MFU | 125103 tok/s step 9487/19560 | loss 3.374842 (-1.00z)| norm 0.2657 (-0.01z)| lr 4.54e-04 | 4169.13 ms | 32.4% bf16 MFU | 125136 tok/s step 9488/19560 | loss 3.473935 (+1.08z)| norm 0.2869 (+0.52z)| lr 4.54e-04 | 4173.15 ms | 32.4% bf16 MFU | 125161 tok/s step 9489/19560 | loss 3.441056 (+0.40z)| norm 0.2481 (-0.46z)| lr 4.53e-04 | 4169.66 ms | 32.4% bf16 MFU | 125189 tok/s step 9490/19560 | loss 3.417600 (-0.09z)| norm 0.2751 (+0.21z)| lr 4.53e-04 | 4188.58 ms | 32.2% bf16 MFU | 125189 tok/s step 9491/19560 | loss 3.445629 (+0.49z)| norm 0.2586 (-0.21z)| lr 4.53e-04 | 4182.46 ms | 32.3% bf16 MFU | 125197 tok/s step 9492/19560 | loss 3.456935 (+0.73z)| norm 0.2394 (-0.69z)| lr 4.53e-04 | 4165.09 ms | 32.4% bf16 MFU | 125231 tok/s step 9493/19560 | loss 3.454454 (+0.67z)| norm 0.2644 (-0.07z)| lr 4.53e-04 | 4174.24 ms | 32.3% bf16 MFU | 125249 tok/s step 9494/19560 | loss 3.331158 (-1.93z)| norm 0.2480 (-0.48z)| lr 4.53e-04 | 4175.12 ms | 32.3% bf16 MFU | 125266 tok/s step 9495/19560 | loss 3.397160 (-0.54z)| norm 0.2284 (-0.98z)| lr 4.53e-04 | 4170.52 ms | 32.4% bf16 MFU | 125288 tok/s step 9496/19560 | loss 3.423502 (+0.01z)| norm 0.2568 (-0.26z)| lr 4.53e-04 | 4165.24 ms | 32.4% bf16 MFU | 125317 tok/s step 9497/19560 | loss 3.423489 (-0.00z)| norm 0.2370 (-0.76z)| lr 4.53e-04 | 4191.18 ms | 32.2% bf16 MFU | 125306 tok/s step 9498/19560 | loss 3.520536 (+2.03z)| norm 0.2611 (-0.16z)| lr 4.53e-04 | 4171.69 ms | 32.4% bf16 MFU | 125324 tok/s step 9499/19560 | loss 3.380831 (-0.91z)| norm 0.2552 (-0.31z)| lr 4.53e-04 | 4177.72 ms | 32.3% bf16 MFU | 125333 tok/s step 9500/19560 | loss 3.407307 (-0.33z)| norm 0.2499 (-0.44z)| lr 4.53e-04 | 4165.56 ms | 32.4% bf16 MFU | 125360 tok/s val loss 3.394617 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2929/10042 = 0.291675 step 9501/19560 | loss 3.414526 (-0.17z)| norm 0.2486 (-0.47z)| lr 4.53e-04 | 4168.96 ms | 32.4% bf16 MFU | 125380 tok/s step 9502/19560 | loss 3.363335 (-1.26z)| norm 0.2377 (-0.74z)| lr 4.53e-04 | 4183.94 ms | 32.3% bf16 MFU | 125376 tok/s step 9503/19560 | loss 3.416828 (-0.11z)| norm 0.2506 (-0.41z)| lr 4.53e-04 | 4178.14 ms | 32.3% bf16 MFU | 125381 tok/s step 9504/19560 | loss 3.433271 (+0.24z)| norm 0.2438 (-0.57z)| lr 4.52e-04 | 4175.91 ms | 32.3% bf16 MFU | 125390 tok/s step 9505/19560 | loss 3.559874 (+2.84z)| norm 0.2616 (-0.12z)| lr 4.52e-04 | 4164.41 ms | 32.4% bf16 MFU | 125415 tok/s step 9506/19560 | loss 3.417487 (-0.12z)| norm 0.2668 (+0.01z)| lr 4.52e-04 | 4176.58 ms | 32.3% bf16 MFU | 125421 tok/s step 9507/19560 | loss 3.410854 (-0.27z)| norm 0.2695 (+0.08z)| lr 4.52e-04 | 4182.54 ms | 32.3% bf16 MFU | 125418 tok/s step 9508/19560 | loss 3.427314 (+0.08z)| norm 0.2582 (-0.21z)| lr 4.52e-04 | 4169.62 ms | 32.4% bf16 MFU | 125434 tok/s step 9509/19560 | loss 3.408258 (-0.32z)| norm 0.2730 (+0.16z)| lr 4.52e-04 | 4162.72 ms | 32.4% bf16 MFU | 125459 tok/s step 9510/19560 | loss 3.394292 (-0.61z)| norm 0.2462 (-0.53z)| lr 4.52e-04 | 4184.35 ms | 32.3% bf16 MFU | 125451 tok/s step 9511/19560 | loss 3.343212 (-1.66z)| norm 0.2783 (+0.28z)| lr 4.52e-04 | 4170.70 ms | 32.4% bf16 MFU | 125464 tok/s step 9512/19560 | loss 3.406322 (-0.34z)| norm 0.2346 (-0.83z)| lr 4.52e-04 | 4165.55 ms | 32.4% bf16 MFU | 125484 tok/s step 9513/19560 | loss 3.394655 (-0.58z)| norm 0.2573 (-0.25z)| lr 4.52e-04 | 4168.10 ms | 32.4% bf16 MFU | 125499 tok/s step 9514/19560 | loss 3.464160 (+0.87z)| norm 0.5060 (+5.34z)| lr 4.52e-04 | 4215.60 ms | 32.0% bf16 MFU | 125443 tok/s step 9515/19560 | loss 3.442605 (+0.43z)| norm 0.2905 (+0.49z)| lr 4.52e-04 | 4176.48 ms | 32.3% bf16 MFU | 125447 tok/s step 9516/19560 | loss 3.412954 (-0.20z)| norm 0.2759 (+0.17z)| lr 4.52e-04 | 4178.56 ms | 32.3% bf16 MFU | 125448 tok/s step 9517/19560 | loss 3.385161 (-0.79z)| norm 0.2808 (+0.29z)| lr 4.52e-04 | 4179.97 ms | 32.3% bf16 MFU | 125447 tok/s step 9518/19560 | loss 3.492052 (+1.46z)| norm 0.2641 (-0.09z)| lr 4.52e-04 | 4174.40 ms | 32.3% bf16 MFU | 125455 tok/s step 9519/19560 | loss 3.481984 (+1.22z)| norm 0.2872 (+0.44z)| lr 4.51e-04 | 4170.33 ms | 32.4% bf16 MFU | 125468 tok/s step 9520/19560 | loss 3.462358 (+0.81z)| norm 0.2672 (-0.02z)| lr 4.51e-04 | 4180.47 ms | 32.3% bf16 MFU | 125465 tok/s step 9521/19560 | loss 3.412825 (-0.23z)| norm 0.2734 (+0.13z)| lr 4.51e-04 | 4170.23 ms | 32.4% bf16 MFU | 125478 tok/s step 9522/19560 | loss 3.359975 (-1.31z)| norm 0.2343 (-0.76z)| lr 4.51e-04 | 4177.53 ms | 32.3% bf16 MFU | 125479 tok/s step 9523/19560 | loss 3.465237 (+0.86z)| norm 0.2456 (-0.50z)| lr 4.51e-04 | 4176.15 ms | 32.3% bf16 MFU | 125482 tok/s step 9524/19560 | loss 3.445138 (+0.43z)| norm 0.2545 (-0.30z)| lr 4.51e-04 | 4165.81 ms | 32.4% bf16 MFU | 125501 tok/s step 9525/19560 | loss 3.405951 (-0.40z)| norm 0.2654 (-0.05z)| lr 4.51e-04 | 4182.60 ms | 32.3% bf16 MFU | 125494 tok/s step 9526/19560 | loss 3.408655 (-0.37z)| norm 0.2332 (-0.78z)| lr 4.51e-04 | 4185.29 ms | 32.3% bf16 MFU | 125482 tok/s step 9527/19560 | loss 3.437289 (+0.25z)| norm 0.2588 (-0.20z)| lr 4.51e-04 | 4170.58 ms | 32.4% bf16 MFU | 125494 tok/s step 9528/19560 | loss 3.430649 (+0.10z)| norm 0.2348 (-0.75z)| lr 4.51e-04 | 4169.51 ms | 32.4% bf16 MFU | 125506 tok/s step 9529/19560 | loss 3.362549 (-1.36z)| norm 0.2351 (-0.74z)| lr 4.51e-04 | 4165.81 ms | 32.4% bf16 MFU | 125524 tok/s step 9530/19560 | loss 3.590659 (+3.42z)| norm 0.2511 (-0.36z)| lr 4.51e-04 | 4167.41 ms | 32.4% bf16 MFU | 125538 tok/s step 9531/19560 | loss 3.438108 (+0.24z)| norm 0.2414 (-0.58z)| lr 4.51e-04 | 4179.26 ms | 32.3% bf16 MFU | 125533 tok/s step 9532/19560 | loss 3.403646 (-0.48z)| norm 0.2771 (+0.23z)| lr 4.51e-04 | 4180.19 ms | 32.3% bf16 MFU | 125528 tok/s step 9533/19560 | loss 3.465691 (+0.80z)| norm 0.2702 (+0.07z)| lr 4.51e-04 | 4160.83 ms | 32.4% bf16 MFU | 125552 tok/s step 9534/19560 | loss 3.417866 (-0.19z)| norm 0.2482 (-0.43z)| lr 4.50e-04 | 4160.58 ms | 32.5% bf16 MFU | 125575 tok/s step 9535/19560 | loss 3.409126 (-0.37z)| norm 0.2458 (-0.48z)| lr 4.50e-04 | 4179.57 ms | 32.3% bf16 MFU | 125568 tok/s step 9536/19560 | loss 3.391467 (-0.73z)| norm 0.2429 (-0.55z)| lr 4.50e-04 | 4165.54 ms | 32.4% bf16 MFU | 125583 tok/s step 9537/19560 | loss 3.454959 (+0.58z)| norm 0.2417 (-0.57z)| lr 4.50e-04 | 4164.65 ms | 32.4% bf16 MFU | 125598 tok/s step 9538/19560 | loss 3.412446 (-0.30z)| norm 0.2182 (-1.10z)| lr 4.50e-04 | 4173.85 ms | 32.3% bf16 MFU | 125599 tok/s step 9539/19560 | loss 3.407383 (-0.39z)| norm 0.2279 (-0.87z)| lr 4.50e-04 | 4166.93 ms | 32.4% bf16 MFU | 125610 tok/s step 9540/19560 | loss 3.443173 (+0.35z)| norm 0.2395 (-0.60z)| lr 4.50e-04 | 4160.04 ms | 32.5% bf16 MFU | 125631 tok/s step 9541/19560 | loss 3.442810 (+0.34z)| norm 0.2253 (-0.92z)| lr 4.50e-04 | 4167.51 ms | 32.4% bf16 MFU | 125640 tok/s step 9542/19560 | loss 3.360949 (-1.40z)| norm 0.2434 (-0.50z)| lr 4.50e-04 | 4171.02 ms | 32.4% bf16 MFU | 125643 tok/s step 9543/19560 | loss 3.460454 (+0.72z)| norm 0.2466 (-0.43z)| lr 4.50e-04 | 4167.83 ms | 32.4% bf16 MFU | 125650 tok/s step 9544/19560 | loss 3.406136 (-0.44z)| norm 0.2551 (-0.20z)| lr 4.50e-04 | 4177.01 ms | 32.3% bf16 MFU | 125643 tok/s step 9545/19560 | loss 3.377820 (-1.03z)| norm 0.2489 (-0.35z)| lr 4.50e-04 | 4178.18 ms | 32.3% bf16 MFU | 125635 tok/s step 9546/19560 | loss 3.440438 (+0.31z)| norm 0.2605 (-0.04z)| lr 4.50e-04 | 4177.19 ms | 32.3% bf16 MFU | 125629 tok/s step 9547/19560 | loss 3.396206 (-0.66z)| norm 0.2248 (-0.97z)| lr 4.50e-04 | 4167.51 ms | 32.4% bf16 MFU | 125638 tok/s step 9548/19560 | loss 3.423392 (-0.07z)| norm 0.2498 (-0.30z)| lr 4.50e-04 | 4172.96 ms | 32.4% bf16 MFU | 125638 tok/s step 9549/19560 | loss 3.423641 (-0.06z)| norm 0.2479 (-0.34z)| lr 4.49e-04 | 4172.09 ms | 32.4% bf16 MFU | 125639 tok/s step 9550/19560 | loss 3.413306 (-0.29z)| norm 0.2632 (+0.07z)| lr 4.49e-04 | 4166.83 ms | 32.4% bf16 MFU | 125649 tok/s step 9551/19560 | loss 3.455571 (+0.64z)| norm 0.2784 (+0.47z)| lr 4.49e-04 | 4161.67 ms | 32.4% bf16 MFU | 125665 tok/s step 9552/19560 | loss 3.498607 (+1.63z)| norm 0.2473 (-0.39z)| lr 4.49e-04 | 4167.85 ms | 32.4% bf16 MFU | 125672 tok/s step 9553/19560 | loss 3.390811 (-0.79z)| norm 0.2680 (+0.36z)| lr 4.49e-04 | 4167.26 ms | 32.4% bf16 MFU | 125679 tok/s step 9554/19560 | loss 3.383233 (-0.95z)| norm 0.2565 (-0.05z)| lr 4.49e-04 | 4179.62 ms | 32.3% bf16 MFU | 125667 tok/s step 9555/19560 | loss 3.390116 (-0.78z)| norm 0.2442 (-0.48z)| lr 4.49e-04 | 4166.90 ms | 32.4% bf16 MFU | 125674 tok/s step 9556/19560 | loss 3.341331 (-1.88z)| norm 0.2585 (+0.05z)| lr 4.49e-04 | 4170.20 ms | 32.4% bf16 MFU | 125677 tok/s step 9557/19560 | loss 3.379832 (-0.98z)| norm 0.2597 (+0.09z)| lr 4.49e-04 | 4164.66 ms | 32.4% bf16 MFU | 125687 tok/s step 9558/19560 | loss 3.442749 (+0.46z)| norm 0.2327 (-0.88z)| lr 4.49e-04 | 4165.33 ms | 32.4% bf16 MFU | 125697 tok/s step 9559/19560 | loss 3.438841 (+0.37z)| norm 0.2685 (+0.43z)| lr 4.49e-04 | 4161.40 ms | 32.4% bf16 MFU | 125711 tok/s step 9560/19560 | loss 3.438859 (+0.36z)| norm 0.2664 (+0.34z)| lr 4.49e-04 | 4169.59 ms | 32.4% bf16 MFU | 125713 tok/s step 9561/19560 | loss 3.430861 (+0.19z)| norm 0.2613 (+0.16z)| lr 4.49e-04 | 4170.28 ms | 32.4% bf16 MFU | 125713 tok/s step 9562/19560 | loss 3.381135 (-0.94z)| norm 0.2570 (+0.00z)| lr 4.49e-04 | 4168.15 ms | 32.4% bf16 MFU | 125717 tok/s step 9563/19560 | loss 3.557796 (+3.01z)| norm 0.2742 (+0.62z)| lr 4.48e-04 | 4160.30 ms | 32.5% bf16 MFU | 125732 tok/s step 9564/19560 | loss 3.356607 (-1.45z)| norm 0.2520 (-0.19z)| lr 4.48e-04 | 4171.16 ms | 32.4% bf16 MFU | 125730 tok/s step 9565/19560 | loss 3.481770 (+1.32z)| norm 0.2581 (+0.03z)| lr 4.48e-04 | 4217.80 ms | 32.0% bf16 MFU | 125659 tok/s step 9566/19560 | loss 3.363824 (-1.27z)| norm 0.2613 (+0.15z)| lr 4.48e-04 | 4168.44 ms | 32.4% bf16 MFU | 125664 tok/s step 9567/19560 | loss 3.439988 (+0.40z)| norm 0.2488 (-0.31z)| lr 4.48e-04 | 4187.60 ms | 32.2% bf16 MFU | 125641 tok/s step 9568/19560 | loss 3.401073 (-0.45z)| norm 0.2609 (+0.13z)| lr 4.48e-04 | 4171.24 ms | 32.4% bf16 MFU | 125644 tok/s step 9569/19560 | loss 3.438767 (+0.37z)| norm 0.2664 (+0.32z)| lr 4.48e-04 | 4167.37 ms | 32.4% bf16 MFU | 125652 tok/s step 9570/19560 | loss 3.383539 (-0.83z)| norm 0.2484 (-0.34z)| lr 4.48e-04 | 4164.28 ms | 32.4% bf16 MFU | 125664 tok/s step 9571/19560 | loss 3.323664 (-2.12z)| norm 0.2535 (-0.16z)| lr 4.48e-04 | 4180.98 ms | 32.3% bf16 MFU | 125651 tok/s step 9572/19560 | loss 3.431123 (+0.21z)| norm 0.2503 (-0.27z)| lr 4.48e-04 | 4167.25 ms | 32.4% bf16 MFU | 125659 tok/s step 9573/19560 | loss 3.413153 (-0.19z)| norm 0.2380 (-0.73z)| lr 4.48e-04 | 4158.93 ms | 32.5% bf16 MFU | 125679 tok/s step 9574/19560 | loss 3.391165 (-0.66z)| norm 0.2387 (-0.69z)| lr 4.48e-04 | 4163.92 ms | 32.4% bf16 MFU | 125691 tok/s step 9575/19560 | loss 3.343358 (-1.67z)| norm 0.2481 (-0.35z)| lr 4.48e-04 | 4174.53 ms | 32.3% bf16 MFU | 125686 tok/s step 9576/19560 | loss 3.496213 (+1.60z)| norm 0.2447 (-0.47z)| lr 4.48e-04 | 4161.66 ms | 32.4% bf16 MFU | 125701 tok/s step 9577/19560 | loss 3.349484 (-1.52z)| norm 0.2355 (-0.81z)| lr 4.48e-04 | 4163.50 ms | 32.4% bf16 MFU | 125712 tok/s step 9578/19560 | loss 3.433132 (+0.25z)| norm 0.2473 (-0.37z)| lr 4.47e-04 | 4158.55 ms | 32.5% bf16 MFU | 125730 tok/s step 9579/19560 | loss 3.461338 (+0.84z)| norm 0.2715 (+0.51z)| lr 4.47e-04 | 4165.87 ms | 32.4% bf16 MFU | 125736 tok/s step 9580/19560 | loss 3.472262 (+1.06z)| norm 0.2405 (-0.63z)| lr 4.47e-04 | 4162.31 ms | 32.4% bf16 MFU | 125747 tok/s step 9581/19560 | loss 3.350102 (-1.52z)| norm 0.2632 (+0.20z)| lr 4.47e-04 | 4163.13 ms | 32.4% bf16 MFU | 125757 tok/s step 9582/19560 | loss 3.418326 (-0.08z)| norm 0.2414 (-0.59z)| lr 4.47e-04 | 4159.59 ms | 32.5% bf16 MFU | 125771 tok/s step 9583/19560 | loss 3.421041 (-0.02z)| norm 0.2575 (+0.00z)| lr 4.47e-04 | 4167.42 ms | 32.4% bf16 MFU | 125773 tok/s step 9584/19560 | loss 3.394690 (-0.57z)| norm 0.2461 (-0.42z)| lr 4.47e-04 | 4164.99 ms | 32.4% bf16 MFU | 125778 tok/s step 9585/19560 | loss 3.479775 (+1.21z)| norm 0.2675 (+0.37z)| lr 4.47e-04 | 4168.65 ms | 32.4% bf16 MFU | 125778 tok/s step 9586/19560 | loss 3.367761 (-1.12z)| norm 0.2438 (-0.50z)| lr 4.47e-04 | 4166.64 ms | 32.4% bf16 MFU | 125780 tok/s step 9587/19560 | loss 3.399058 (-0.46z)| norm 0.2401 (-0.63z)| lr 4.47e-04 | 4166.63 ms | 32.4% bf16 MFU | 125783 tok/s step 9588/19560 | loss 3.414155 (-0.14z)| norm 0.2625 (+0.20z)| lr 4.47e-04 | 4160.07 ms | 32.5% bf16 MFU | 125795 tok/s step 9589/19560 | loss 3.480607 (+1.25z)| norm 0.2435 (-0.49z)| lr 4.47e-04 | 4172.38 ms | 32.4% bf16 MFU | 125788 tok/s step 9590/19560 | loss 3.431970 (+0.22z)| norm 0.2801 (+0.86z)| lr 4.47e-04 | 4206.73 ms | 32.1% bf16 MFU | 125730 tok/s step 9591/19560 | loss 3.375322 (-0.97z)| norm 0.2602 (+0.12z)| lr 4.47e-04 | 4166.04 ms | 32.4% bf16 MFU | 125736 tok/s step 9592/19560 | loss 3.413526 (-0.16z)| norm 0.2545 (-0.08z)| lr 4.47e-04 | 4170.64 ms | 32.4% bf16 MFU | 125735 tok/s step 9593/19560 | loss 3.388971 (-0.68z)| norm 0.2733 (+0.60z)| lr 4.46e-04 | 4162.99 ms | 32.4% bf16 MFU | 125745 tok/s step 9594/19560 | loss 3.372458 (-1.02z)| norm 0.2545 (-0.10z)| lr 4.46e-04 | 4171.90 ms | 32.4% bf16 MFU | 125742 tok/s step 9595/19560 | loss 3.365316 (-1.18z)| norm 0.2481 (-0.34z)| lr 4.46e-04 | 4156.17 ms | 32.5% bf16 MFU | 125762 tok/s step 9596/19560 | loss 3.467667 (+0.98z)| norm 0.2568 (-0.00z)| lr 4.46e-04 | 4167.67 ms | 32.4% bf16 MFU | 125764 tok/s step 9597/19560 | loss 3.370956 (-1.06z)| norm 0.2231 (-1.25z)| lr 4.46e-04 | 4167.45 ms | 32.4% bf16 MFU | 125766 tok/s step 9598/19560 | loss 3.390495 (-0.65z)| norm 0.2614 (+0.20z)| lr 4.46e-04 | 4165.83 ms | 32.4% bf16 MFU | 125770 tok/s step 9599/19560 | loss 3.425921 (+0.09z)| norm 0.2609 (+0.18z)| lr 4.46e-04 | 4166.68 ms | 32.4% bf16 MFU | 125773 tok/s step 9600/19560 | loss 3.510511 (+1.84z)| norm 0.2470 (-0.33z)| lr 4.46e-04 | 4165.82 ms | 32.4% bf16 MFU | 125777 tok/s step 9601/19560 | loss 3.410964 (-0.23z)| norm 0.2425 (-0.50z)| lr 4.46e-04 | 4321.81 ms | 31.2% bf16 MFU | 125554 tok/s step 9602/19560 | loss 3.408334 (-0.27z)| norm 0.2413 (-0.54z)| lr 4.46e-04 | 4162.53 ms | 32.4% bf16 MFU | 125574 tok/s step 9603/19560 | loss 3.427177 (+0.13z)| norm 0.2526 (-0.12z)| lr 4.46e-04 | 4176.81 ms | 32.3% bf16 MFU | 125571 tok/s step 9604/19560 | loss 3.405017 (-0.34z)| norm 0.2429 (-0.48z)| lr 4.46e-04 | 4161.85 ms | 32.4% bf16 MFU | 125592 tok/s step 9605/19560 | loss 3.393117 (-0.59z)| norm 0.2630 (+0.28z)| lr 4.46e-04 | 4169.51 ms | 32.4% bf16 MFU | 125599 tok/s step 9606/19560 | loss 3.434495 (+0.31z)| norm 0.2372 (-0.70z)| lr 4.46e-04 | 4170.04 ms | 32.4% bf16 MFU | 125606 tok/s step 9607/19560 | loss 3.417636 (-0.04z)| norm 0.2622 (+0.24z)| lr 4.46e-04 | 4165.39 ms | 32.4% bf16 MFU | 125619 tok/s step 9608/19560 | loss 3.387871 (-0.69z)| norm 0.4662 (+6.48z)| lr 4.45e-04 | 4165.20 ms | 32.4% bf16 MFU | 125631 tok/s step 9609/19560 | loss 3.409426 (-0.23z)| norm 0.2534 (-0.13z)| lr 4.45e-04 | 4168.02 ms | 32.4% bf16 MFU | 125639 tok/s step 9610/19560 | loss 3.427515 (+0.17z)| norm 0.2487 (-0.27z)| lr 4.45e-04 | 4164.22 ms | 32.4% bf16 MFU | 125652 tok/s step 9611/19560 | loss 3.428461 (+0.18z)| norm 0.2411 (-0.50z)| lr 4.45e-04 | 4164.89 ms | 32.4% bf16 MFU | 125664 tok/s step 9612/19560 | loss 3.409436 (-0.24z)| norm 0.2626 (+0.16z)| lr 4.45e-04 | 4158.81 ms | 32.5% bf16 MFU | 125684 tok/s step 9613/19560 | loss 3.391277 (-0.64z)| norm 0.2602 (+0.09z)| lr 4.45e-04 | 4168.39 ms | 32.4% bf16 MFU | 125689 tok/s step 9614/19560 | loss 3.334252 (-1.87z)| norm 0.2354 (-0.67z)| lr 4.45e-04 | 4164.17 ms | 32.4% bf16 MFU | 125700 tok/s step 9615/19560 | loss 3.367507 (-1.13z)| norm 0.2454 (-0.36z)| lr 4.45e-04 | 4166.34 ms | 32.4% bf16 MFU | 125707 tok/s step 9616/19560 | loss 3.389777 (-0.63z)| norm 0.2366 (-0.62z)| lr 4.45e-04 | 4160.71 ms | 32.5% bf16 MFU | 125722 tok/s step 9617/19560 | loss 3.446665 (+0.63z)| norm 0.2335 (-0.71z)| lr 4.45e-04 | 4164.49 ms | 32.4% bf16 MFU | 125730 tok/s step 9618/19560 | loss 3.425831 (+0.17z)| norm 0.2476 (-0.27z)| lr 4.45e-04 | 4162.12 ms | 32.4% bf16 MFU | 125742 tok/s step 9619/19560 | loss 3.420742 (+0.06z)| norm 0.2382 (-0.55z)| lr 4.45e-04 | 4158.03 ms | 32.5% bf16 MFU | 125760 tok/s step 9620/19560 | loss 3.407096 (-0.24z)| norm 0.2460 (-0.32z)| lr 4.45e-04 | 4170.35 ms | 32.4% bf16 MFU | 125757 tok/s step 9621/19560 | loss 3.469513 (+1.15z)| norm 0.2168 (-1.20z)| lr 4.45e-04 | 4166.05 ms | 32.4% bf16 MFU | 125762 tok/s step 9622/19560 | loss 3.488317 (+1.54z)| norm 0.2643 (+0.26z)| lr 4.45e-04 | 4162.72 ms | 32.4% bf16 MFU | 125771 tok/s step 9623/19560 | loss 3.450513 (+0.69z)| norm 0.2633 (+0.22z)| lr 4.44e-04 | 4170.69 ms | 32.4% bf16 MFU | 125768 tok/s step 9624/19560 | loss 3.397647 (-0.48z)| norm 0.2449 (-0.34z)| lr 4.44e-04 | 4177.64 ms | 32.3% bf16 MFU | 125755 tok/s step 9625/19560 | loss 3.345801 (-1.61z)| norm 0.2566 (+0.01z)| lr 4.44e-04 | 4171.19 ms | 32.4% bf16 MFU | 125752 tok/s step 9626/19560 | loss 3.458022 (+0.89z)| norm 0.2466 (-0.29z)| lr 4.44e-04 | 4171.70 ms | 32.4% bf16 MFU | 125748 tok/s step 9627/19560 | loss 3.359859 (-1.30z)| norm 0.2729 (+0.51z)| lr 4.44e-04 | 4163.11 ms | 32.4% bf16 MFU | 125757 tok/s step 9628/19560 | loss 3.412340 (-0.13z)| norm 0.2659 (+0.30z)| lr 4.44e-04 | 4161.98 ms | 32.4% bf16 MFU | 125768 tok/s step 9629/19560 | loss 3.443480 (+0.56z)| norm 0.2548 (-0.05z)| lr 4.44e-04 | 4170.47 ms | 32.4% bf16 MFU | 125765 tok/s step 9630/19560 | loss 3.414366 (-0.10z)| norm 0.2714 (+0.45z)| lr 4.44e-04 | 4165.82 ms | 32.4% bf16 MFU | 125770 tok/s step 9631/19560 | loss 3.509143 (+1.98z)| norm 0.2478 (-0.27z)| lr 4.44e-04 | 4160.77 ms | 32.5% bf16 MFU | 125782 tok/s step 9632/19560 | loss 3.424957 (+0.12z)| norm 0.2669 (+0.31z)| lr 4.44e-04 | 4172.88 ms | 32.4% bf16 MFU | 125775 tok/s step 9633/19560 | loss 3.370251 (-1.09z)| norm 0.2505 (-0.19z)| lr 4.44e-04 | 4172.88 ms | 32.4% bf16 MFU | 125768 tok/s step 9634/19560 | loss 3.304271 (-2.51z)| norm 0.2703 (+0.42z)| lr 4.44e-04 | 4164.98 ms | 32.4% bf16 MFU | 125774 tok/s step 9635/19560 | loss 3.430350 (+0.29z)| norm 0.2723 (+0.48z)| lr 4.44e-04 | 4172.99 ms | 32.4% bf16 MFU | 125767 tok/s step 9636/19560 | loss 3.406554 (-0.23z)| norm 0.2398 (-0.52z)| lr 4.44e-04 | 4164.17 ms | 32.4% bf16 MFU | 125774 tok/s step 9637/19560 | loss 3.428317 (+0.25z)| norm 0.2484 (-0.25z)| lr 4.44e-04 | 4178.54 ms | 32.3% bf16 MFU | 125759 tok/s step 9638/19560 | loss 3.396862 (-0.45z)| norm 0.2571 (+0.02z)| lr 4.43e-04 | 4159.12 ms | 32.5% bf16 MFU | 125774 tok/s step 9639/19560 | loss 3.378305 (-0.88z)| norm 0.2292 (-0.83z)| lr 4.43e-04 | 4168.28 ms | 32.4% bf16 MFU | 125774 tok/s step 9640/19560 | loss 3.479207 (+1.37z)| norm 0.2691 (+0.39z)| lr 4.43e-04 | 4308.50 ms | 31.3% bf16 MFU | 125570 tok/s step 9641/19560 | loss 3.460996 (+0.95z)| norm 0.2608 (+0.13z)| lr 4.43e-04 | 5170.24 ms | 26.1% bf16 MFU | 124361 tok/s step 9642/19560 | loss 3.443357 (+0.56z)| norm 0.2537 (-0.03z)| lr 4.43e-04 | 4555.62 ms | 29.6% bf16 MFU | 123898 tok/s step 9643/19560 | loss 3.383313 (-0.77z)| norm 0.2480 (-0.26z)| lr 4.43e-04 | 4514.62 ms | 29.9% bf16 MFU | 123509 tok/s step 9644/19560 | loss 3.413581 (-0.10z)| norm 0.2546 (+0.03z)| lr 4.43e-04 | 4358.54 ms | 31.0% bf16 MFU | 123348 tok/s step 9645/19560 | loss 3.360445 (-1.27z)| norm 0.2783 (+1.05z)| lr 4.43e-04 | 4241.99 ms | 31.8% bf16 MFU | 123361 tok/s step 9646/19560 | loss 3.434876 (+0.40z)| norm 0.2320 (-0.93z)| lr 4.43e-04 | 4219.91 ms | 32.0% bf16 MFU | 123405 tok/s step 9647/19560 | loss 3.400085 (-0.37z)| norm 0.2594 (+0.25z)| lr 4.43e-04 | 4254.60 ms | 31.7% bf16 MFU | 123396 tok/s step 9648/19560 | loss 3.402957 (-0.30z)| norm 0.2793 (+1.11z)| lr 4.43e-04 | 4245.11 ms | 31.8% bf16 MFU | 123401 tok/s step 9649/19560 | loss 3.437492 (+0.48z)| norm 0.2475 (-0.25z)| lr 4.43e-04 | 4218.44 ms | 32.0% bf16 MFU | 123445 tok/s step 9650/19560 | loss 3.373335 (-0.98z)| norm 0.2386 (-0.64z)| lr 4.43e-04 | 4311.62 ms | 31.3% bf16 MFU | 123353 tok/s step 9651/19560 | loss 3.386525 (-0.67z)| norm 0.2468 (-0.29z)| lr 4.43e-04 | 4205.29 ms | 32.1% bf16 MFU | 123419 tok/s step 9652/19560 | loss 3.379563 (-0.81z)| norm 0.2358 (-0.75z)| lr 4.43e-04 | 4214.26 ms | 32.0% bf16 MFU | 123469 tok/s step 9653/19560 | loss 3.393534 (-0.49z)| norm 0.2578 (+0.20z)| lr 4.42e-04 | 4179.76 ms | 32.3% bf16 MFU | 123567 tok/s step 9654/19560 | loss 3.407504 (-0.18z)| norm 0.2494 (-0.17z)| lr 4.42e-04 | 4158.97 ms | 32.5% bf16 MFU | 123692 tok/s step 9655/19560 | loss 3.445929 (+0.70z)| norm 0.2487 (-0.20z)| lr 4.42e-04 | 4199.12 ms | 32.2% bf16 MFU | 123750 tok/s step 9656/19560 | loss 3.367714 (-1.07z)| norm 0.2510 (-0.10z)| lr 4.42e-04 | 4181.28 ms | 32.3% bf16 MFU | 123832 tok/s step 9657/19560 | loss 3.335916 (-1.77z)| norm 0.2372 (-0.71z)| lr 4.42e-04 | 4194.84 ms | 32.2% bf16 MFU | 123889 tok/s step 9658/19560 | loss 3.406158 (-0.17z)| norm 0.2453 (-0.35z)| lr 4.42e-04 | 4179.66 ms | 32.3% bf16 MFU | 123967 tok/s step 9659/19560 | loss 3.388738 (-0.58z)| norm 0.2428 (-0.46z)| lr 4.42e-04 | 4183.29 ms | 32.3% bf16 MFU | 124035 tok/s step 9660/19560 | loss 3.425996 (+0.31z)| norm 0.2407 (-0.54z)| lr 4.42e-04 | 4193.27 ms | 32.2% bf16 MFU | 124085 tok/s step 9661/19560 | loss 3.397219 (-0.37z)| norm 0.2513 (-0.07z)| lr 4.42e-04 | 4174.80 ms | 32.3% bf16 MFU | 124160 tok/s step 9662/19560 | loss 3.354617 (-1.37z)| norm 0.2721 (+0.82z)| lr 4.42e-04 | 4187.38 ms | 32.2% bf16 MFU | 124212 tok/s step 9663/19560 | loss 3.385598 (-0.63z)| norm 0.2534 (+0.01z)| lr 4.42e-04 | 4186.06 ms | 32.3% bf16 MFU | 124264 tok/s step 9664/19560 | loss 3.444009 (+0.76z)| norm 0.2567 (+0.15z)| lr 4.42e-04 | 4190.97 ms | 32.2% bf16 MFU | 124306 tok/s step 9665/19560 | loss 3.381301 (-0.73z)| norm 0.2598 (+0.28z)| lr 4.42e-04 | 4218.33 ms | 32.0% bf16 MFU | 124305 tok/s step 9666/19560 | loss 3.376723 (-0.83z)| norm 0.2436 (-0.44z)| lr 4.42e-04 | 4360.32 ms | 31.0% bf16 MFU | 124101 tok/s step 9667/19560 | loss 3.450029 (+0.92z)| norm 0.2619 (+0.36z)| lr 4.41e-04 | 4167.77 ms | 32.4% bf16 MFU | 124186 tok/s step 9668/19560 | loss 3.440911 (+0.70z)| norm 0.2882 (+1.49z)| lr 4.41e-04 | 4182.43 ms | 32.3% bf16 MFU | 124245 tok/s step 9669/19560 | loss 3.394831 (-0.39z)| norm 0.2725 (+0.79z)| lr 4.41e-04 | 4190.69 ms | 32.2% bf16 MFU | 124288 tok/s step 9670/19560 | loss 3.398179 (-0.32z)| norm 0.2487 (-0.27z)| lr 4.41e-04 | 4180.41 ms | 32.3% bf16 MFU | 124344 tok/s step 9671/19560 | loss 3.408435 (-0.06z)| norm 0.2836 (+1.26z)| lr 4.41e-04 | 4203.79 ms | 32.1% bf16 MFU | 124363 tok/s step 9672/19560 | loss 3.458216 (+1.12z)| norm 0.2383 (-0.73z)| lr 4.41e-04 | 4171.23 ms | 32.4% bf16 MFU | 124429 tok/s step 9673/19560 | loss 3.400540 (-0.27z)| norm 0.2785 (+1.02z)| lr 4.41e-04 | 4216.32 ms | 32.0% bf16 MFU | 124425 tok/s step 9674/19560 | loss 3.373971 (-0.89z)| norm 0.2461 (-0.39z)| lr 4.41e-04 | 4215.72 ms | 32.0% bf16 MFU | 124422 tok/s step 9675/19560 | loss 3.362374 (-1.16z)| norm 0.2480 (-0.32z)| lr 4.41e-04 | 4218.16 ms | 32.0% bf16 MFU | 124416 tok/s step 9676/19560 | loss 3.438194 (+0.65z)| norm 0.2617 (+0.28z)| lr 4.41e-04 | 4308.16 ms | 31.3% bf16 MFU | 124280 tok/s step 9677/19560 | loss 3.511975 (+2.34z)| norm 0.2446 (-0.46z)| lr 4.41e-04 | 4196.25 ms | 32.2% bf16 MFU | 124313 tok/s step 9678/19560 | loss 3.373214 (-0.89z)| norm 0.2496 (-0.24z)| lr 4.41e-04 | 4231.31 ms | 31.9% bf16 MFU | 124293 tok/s step 9679/19560 | loss 3.434468 (+0.54z)| norm 0.2356 (-0.84z)| lr 4.41e-04 | 4184.68 ms | 32.3% bf16 MFU | 124342 tok/s step 9680/19560 | loss 3.406098 (-0.10z)| norm 0.2534 (-0.06z)| lr 4.41e-04 | 4201.93 ms | 32.1% bf16 MFU | 124364 tok/s step 9681/19560 | loss 3.440769 (+0.71z)| norm 0.2331 (-0.94z)| lr 4.41e-04 | 4198.82 ms | 32.2% bf16 MFU | 124389 tok/s step 9682/19560 | loss 3.412427 (+0.03z)| norm 0.2416 (-0.56z)| lr 4.40e-04 | 4187.42 ms | 32.2% bf16 MFU | 124430 tok/s step 9683/19560 | loss 3.411160 (-0.00z)| norm 0.2505 (-0.17z)| lr 4.40e-04 | 4177.12 ms | 32.3% bf16 MFU | 124484 tok/s step 9684/19560 | loss 3.381562 (-0.72z)| norm 0.2416 (-0.56z)| lr 4.40e-04 | 4178.19 ms | 32.3% bf16 MFU | 124534 tok/s step 9685/19560 | loss 3.507125 (+2.23z)| norm 0.2372 (-0.74z)| lr 4.40e-04 | 4177.75 ms | 32.3% bf16 MFU | 124582 tok/s step 9686/19560 | loss 3.390608 (-0.51z)| norm 0.2521 (-0.10z)| lr 4.40e-04 | 4180.82 ms | 32.3% bf16 MFU | 124623 tok/s step 9687/19560 | loss 3.415985 (+0.09z)| norm 0.2670 (+0.56z)| lr 4.40e-04 | 4186.34 ms | 32.3% bf16 MFU | 124654 tok/s step 9688/19560 | loss 3.369922 (-0.98z)| norm 0.2386 (-0.68z)| lr 4.40e-04 | 4182.69 ms | 32.3% bf16 MFU | 124688 tok/s step 9689/19560 | loss 3.392504 (-0.44z)| norm 0.2492 (-0.21z)| lr 4.40e-04 | 4176.07 ms | 32.3% bf16 MFU | 124731 tok/s step 9690/19560 | loss 3.367349 (-1.03z)| norm 0.2404 (-0.59z)| lr 4.40e-04 | 4170.02 ms | 32.4% bf16 MFU | 124781 tok/s step 9691/19560 | loss 3.383906 (-0.64z)| norm 0.2652 (+0.50z)| lr 4.40e-04 | 4174.95 ms | 32.3% bf16 MFU | 124821 tok/s step 9692/19560 | loss 3.447143 (+0.91z)| norm 0.2365 (-0.75z)| lr 4.40e-04 | 4181.35 ms | 32.3% bf16 MFU | 124849 tok/s step 9693/19560 | loss 3.415014 (+0.13z)| norm 0.2604 (+0.29z)| lr 4.40e-04 | 4184.50 ms | 32.3% bf16 MFU | 124872 tok/s step 9694/19560 | loss 3.475853 (+1.63z)| norm 0.2437 (-0.43z)| lr 4.40e-04 | 4170.96 ms | 32.4% bf16 MFU | 124913 tok/s step 9695/19560 | loss 3.385664 (-0.62z)| norm 0.2558 (+0.09z)| lr 4.40e-04 | 4173.85 ms | 32.3% bf16 MFU | 124948 tok/s step 9696/19560 | loss 3.474814 (+1.58z)| norm 0.2397 (-0.60z)| lr 4.40e-04 | 4187.82 ms | 32.2% bf16 MFU | 124960 tok/s step 9697/19560 | loss 3.416392 (+0.14z)| norm 0.2711 (+0.77z)| lr 4.39e-04 | 4174.58 ms | 32.3% bf16 MFU | 124992 tok/s step 9698/19560 | loss 3.370267 (-1.00z)| norm 0.2280 (-1.10z)| lr 4.39e-04 | 4174.24 ms | 32.3% bf16 MFU | 125022 tok/s step 9699/19560 | loss 3.460221 (+1.22z)| norm 0.2669 (+0.58z)| lr 4.39e-04 | 4179.40 ms | 32.3% bf16 MFU | 125043 tok/s step 9700/19560 | loss 3.411370 (-0.00z)| norm 0.2427 (-0.46z)| lr 4.39e-04 | 4175.19 ms | 32.3% bf16 MFU | 125070 tok/s step 9701/19560 | loss 3.377727 (-0.84z)| norm 0.2592 (+0.25z)| lr 4.39e-04 | 4179.02 ms | 32.3% bf16 MFU | 125089 tok/s step 9702/19560 | loss 3.412399 (+0.02z)| norm 0.2613 (+0.33z)| lr 4.39e-04 | 4181.91 ms | 32.3% bf16 MFU | 125103 tok/s step 9703/19560 | loss 3.384706 (-0.68z)| norm 0.2697 (+0.69z)| lr 4.39e-04 | 4174.83 ms | 32.3% bf16 MFU | 125127 tok/s step 9704/19560 | loss 3.415483 (+0.11z)| norm 0.2521 (-0.08z)| lr 4.39e-04 | 4179.47 ms | 32.3% bf16 MFU | 125143 tok/s step 9705/19560 | loss 3.451068 (+1.01z)| norm 0.2410 (-0.57z)| lr 4.39e-04 | 4179.72 ms | 32.3% bf16 MFU | 125158 tok/s step 9706/19560 | loss 3.395615 (-0.41z)| norm 0.2626 (+0.37z)| lr 4.39e-04 | 4251.80 ms | 31.8% bf16 MFU | 125065 tok/s step 9707/19560 | loss 3.355658 (-1.43z)| norm 0.2453 (-0.38z)| lr 4.39e-04 | 4166.36 ms | 32.4% bf16 MFU | 125104 tok/s step 9708/19560 | loss 3.354884 (-1.43z)| norm 0.2355 (-0.80z)| lr 4.39e-04 | 4163.22 ms | 32.4% bf16 MFU | 125145 tok/s step 9709/19560 | loss 3.387989 (-0.58z)| norm 0.2437 (-0.44z)| lr 4.39e-04 | 4188.19 ms | 32.2% bf16 MFU | 125147 tok/s step 9710/19560 | loss 3.421021 (+0.28z)| norm 0.2703 (+0.71z)| lr 4.39e-04 | 4170.24 ms | 32.4% bf16 MFU | 125176 tok/s step 9711/19560 | loss 3.398162 (-0.31z)| norm 0.2357 (-0.78z)| lr 4.39e-04 | 4187.71 ms | 32.2% bf16 MFU | 125177 tok/s step 9712/19560 | loss 3.351089 (-1.52z)| norm 0.2556 (+0.07z)| lr 4.38e-04 | 4175.88 ms | 32.3% bf16 MFU | 125196 tok/s step 9713/19560 | loss 3.445005 (+0.93z)| norm 0.2756 (+0.94z)| lr 4.38e-04 | 4177.02 ms | 32.3% bf16 MFU | 125212 tok/s step 9714/19560 | loss 3.495039 (+2.18z)| norm 0.2567 (+0.12z)| lr 4.38e-04 | 4175.74 ms | 32.3% bf16 MFU | 125229 tok/s step 9715/19560 | loss 3.451711 (+1.05z)| norm 0.2601 (+0.26z)| lr 4.38e-04 | 4176.32 ms | 32.3% bf16 MFU | 125244 tok/s step 9716/19560 | loss 3.338860 (-1.81z)| norm 0.2532 (-0.04z)| lr 4.38e-04 | 4174.75 ms | 32.3% bf16 MFU | 125262 tok/s step 9717/19560 | loss 3.419494 (+0.25z)| norm 0.2732 (+0.82z)| lr 4.38e-04 | 4181.30 ms | 32.3% bf16 MFU | 125268 tok/s step 9718/19560 | loss 3.360598 (-1.24z)| norm 0.2279 (-1.13z)| lr 4.38e-04 | 4177.19 ms | 32.3% bf16 MFU | 125280 tok/s step 9719/19560 | loss 3.381613 (-0.71z)| norm 0.2580 (+0.18z)| lr 4.38e-04 | 4178.65 ms | 32.3% bf16 MFU | 125289 tok/s step 9720/19560 | loss 3.446167 (+0.93z)| norm 0.2443 (-0.41z)| lr 4.38e-04 | 4178.75 ms | 32.3% bf16 MFU | 125298 tok/s step 9721/19560 | loss 3.465256 (+1.40z)| norm 0.2945 (+1.74z)| lr 4.38e-04 | 4169.55 ms | 32.4% bf16 MFU | 125320 tok/s step 9722/19560 | loss 3.467185 (+1.42z)| norm 0.2667 (+0.54z)| lr 4.38e-04 | 4169.15 ms | 32.4% bf16 MFU | 125342 tok/s step 9723/19560 | loss 3.440993 (+0.75z)| norm 0.2531 (-0.04z)| lr 4.38e-04 | 4181.72 ms | 32.3% bf16 MFU | 125344 tok/s step 9724/19560 | loss 3.465439 (+1.37z)| norm 0.3363 (+3.34z)| lr 4.38e-04 | 4207.81 ms | 32.1% bf16 MFU | 125307 tok/s step 9725/19560 | loss 3.392929 (-0.47z)| norm 0.2857 (+1.25z)| lr 4.38e-04 | 4170.04 ms | 32.4% bf16 MFU | 125328 tok/s step 9726/19560 | loss 3.385983 (-0.65z)| norm 0.2767 (+0.87z)| lr 4.38e-04 | 4172.51 ms | 32.4% bf16 MFU | 125344 tok/s step 9727/19560 | loss 3.394989 (-0.41z)| norm 0.2948 (+1.58z)| lr 4.37e-04 | 4178.22 ms | 32.3% bf16 MFU | 125351 tok/s step 9728/19560 | loss 3.361216 (-1.26z)| norm 0.2571 (+0.06z)| lr 4.37e-04 | 4200.31 ms | 32.1% bf16 MFU | 125324 tok/s step 9729/19560 | loss 3.422106 (+0.31z)| norm 0.2901 (+1.37z)| lr 4.37e-04 | 4168.56 ms | 32.4% bf16 MFU | 125347 tok/s step 9730/19560 | loss 3.358306 (-1.32z)| norm 0.2658 (+0.39z)| lr 4.37e-04 | 5149.29 ms | 26.2% bf16 MFU | 124170 tok/s step 9731/19560 | loss 3.417028 (+0.19z)| norm 0.2688 (+0.50z)| lr 4.37e-04 | 4180.07 ms | 32.3% bf16 MFU | 124233 tok/s step 9732/19560 | loss 3.407941 (-0.05z)| norm 0.2681 (+0.46z)| lr 4.37e-04 | 4161.98 ms | 32.4% bf16 MFU | 124320 tok/s step 9733/19560 | loss 3.385387 (-0.62z)| norm 0.2610 (+0.18z)| lr 4.37e-04 | 4173.02 ms | 32.4% bf16 MFU | 124386 tok/s step 9734/19560 | loss 3.407547 (-0.05z)| norm 0.2663 (+0.38z)| lr 4.37e-04 | 4169.98 ms | 32.4% bf16 MFU | 124453 tok/s step 9735/19560 | loss 3.418459 (+0.23z)| norm 0.2701 (+0.53z)| lr 4.37e-04 | 4176.88 ms | 32.3% bf16 MFU | 124506 tok/s step 9736/19560 | loss 3.426488 (+0.43z)| norm 0.2585 (+0.20z)| lr 4.37e-04 | 4169.32 ms | 32.4% bf16 MFU | 124568 tok/s step 9737/19560 | loss 3.370030 (-1.01z)| norm 0.2632 (+0.48z)| lr 4.37e-04 | 4174.78 ms | 32.3% bf16 MFU | 124619 tok/s step 9738/19560 | loss 3.387036 (-0.57z)| norm 0.2484 (-0.42z)| lr 4.37e-04 | 4197.25 ms | 32.2% bf16 MFU | 124634 tok/s step 9739/19560 | loss 3.399064 (-0.25z)| norm 0.2718 (+0.98z)| lr 4.37e-04 | 4180.80 ms | 32.3% bf16 MFU | 124672 tok/s step 9740/19560 | loss 3.372466 (-0.92z)| norm 0.2817 (+1.56z)| lr 4.37e-04 | 4167.42 ms | 32.4% bf16 MFU | 124729 tok/s step 9741/19560 | loss 3.432225 (+0.59z)| norm 0.2202 (-2.07z)| lr 4.37e-04 | 4173.55 ms | 32.4% bf16 MFU | 124774 tok/s step 9742/19560 | loss 3.408204 (-0.04z)| norm 0.2527 (-0.17z)| lr 4.36e-04 | 4185.96 ms | 32.3% bf16 MFU | 124797 tok/s step 9743/19560 | loss 3.362820 (-1.21z)| norm 0.2478 (-0.46z)| lr 4.36e-04 | 4169.79 ms | 32.4% bf16 MFU | 124844 tok/s step 9744/19560 | loss 3.391589 (-0.46z)| norm 0.2403 (-0.91z)| lr 4.36e-04 | 4178.97 ms | 32.3% bf16 MFU | 124875 tok/s step 9745/19560 | loss 3.392135 (-0.44z)| norm 0.2542 (-0.09z)| lr 4.36e-04 | 4183.50 ms | 32.3% bf16 MFU | 124897 tok/s step 9746/19560 | loss 3.374952 (-0.87z)| norm 0.2355 (-1.20z)| lr 4.36e-04 | 4164.08 ms | 32.4% bf16 MFU | 124948 tok/s step 9747/19560 | loss 3.372703 (-0.92z)| norm 0.2406 (-0.90z)| lr 4.36e-04 | 4183.02 ms | 32.3% bf16 MFU | 124967 tok/s step 9748/19560 | loss 3.384874 (-0.60z)| norm 0.2713 (+0.92z)| lr 4.36e-04 | 4181.06 ms | 32.3% bf16 MFU | 124989 tok/s step 9749/19560 | loss 3.463948 (+1.44z)| norm 0.2485 (-0.46z)| lr 4.36e-04 | 4160.48 ms | 32.5% bf16 MFU | 125040 tok/s step 9750/19560 | loss 3.335919 (-1.84z)| norm 0.2352 (-1.25z)| lr 4.36e-04 | 4161.52 ms | 32.4% bf16 MFU | 125087 tok/s val loss 3.384924 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2916/10042 = 0.290380 step 9751/19560 | loss 3.378625 (-0.72z)| norm 0.2752 (+1.16z)| lr 4.36e-04 | 4225.76 ms | 32.0% bf16 MFU | 125037 tok/s step 9752/19560 | loss 3.345011 (-1.57z)| norm 0.2411 (-0.89z)| lr 4.36e-04 | 4168.87 ms | 32.4% bf16 MFU | 125073 tok/s step 9753/19560 | loss 3.403187 (-0.08z)| norm 0.2505 (-0.33z)| lr 4.36e-04 | 4183.69 ms | 32.3% bf16 MFU | 125085 tok/s step 9754/19560 | loss 3.397762 (-0.21z)| norm 0.2564 (+0.03z)| lr 4.36e-04 | 4168.66 ms | 32.4% bf16 MFU | 125119 tok/s step 9755/19560 | loss 3.429389 (+0.60z)| norm 0.2478 (-0.48z)| lr 4.36e-04 | 4236.92 ms | 31.9% bf16 MFU | 125050 tok/s step 9756/19560 | loss 3.386478 (-0.52z)| norm 0.2578 (+0.13z)| lr 4.35e-04 | 4190.60 ms | 32.2% bf16 MFU | 125053 tok/s step 9757/19560 | loss 3.367418 (-1.00z)| norm 0.2488 (-0.42z)| lr 4.35e-04 | 4205.44 ms | 32.1% bf16 MFU | 125034 tok/s step 9758/19560 | loss 3.439511 (+0.88z)| norm 0.2396 (-0.95z)| lr 4.35e-04 | 4212.55 ms | 32.1% bf16 MFU | 125005 tok/s step 9759/19560 | loss 3.421062 (+0.43z)| norm 0.2558 (+0.02z)| lr 4.35e-04 | 4212.34 ms | 32.1% bf16 MFU | 124978 tok/s step 9760/19560 | loss 3.472203 (+1.77z)| norm 0.2230 (-1.91z)| lr 4.35e-04 | 4281.88 ms | 31.5% bf16 MFU | 124852 tok/s step 9761/19560 | loss 3.327826 (-2.03z)| norm 0.2485 (-0.40z)| lr 4.35e-04 | 4265.30 ms | 31.7% bf16 MFU | 124755 tok/s step 9762/19560 | loss 3.373184 (-0.88z)| norm 0.2498 (-0.31z)| lr 4.35e-04 | 4165.37 ms | 32.4% bf16 MFU | 124811 tok/s step 9763/19560 | loss 3.361660 (-1.17z)| norm 0.2457 (-0.54z)| lr 4.35e-04 | 4214.68 ms | 32.0% bf16 MFU | 124790 tok/s step 9764/19560 | loss 3.369885 (-0.94z)| norm 0.2412 (-0.81z)| lr 4.35e-04 | 4168.11 ms | 32.4% bf16 MFU | 124840 tok/s step 9765/19560 | loss 3.373694 (-0.82z)| norm 0.2417 (-0.78z)| lr 4.35e-04 | 4173.83 ms | 32.3% bf16 MFU | 124878 tok/s step 9766/19560 | loss 3.433747 (+0.77z)| norm 0.2369 (-1.05z)| lr 4.35e-04 | 4183.16 ms | 32.3% bf16 MFU | 124901 tok/s step 9767/19560 | loss 3.437217 (+0.85z)| norm 0.2485 (-0.37z)| lr 4.35e-04 | 4171.79 ms | 32.4% bf16 MFU | 124940 tok/s step 9768/19560 | loss 3.387239 (-0.47z)| norm 0.2484 (-0.37z)| lr 4.35e-04 | 4175.40 ms | 32.3% bf16 MFU | 124971 tok/s step 9769/19560 | loss 3.405795 (+0.05z)| norm 0.2366 (-1.07z)| lr 4.35e-04 | 4174.85 ms | 32.3% bf16 MFU | 125002 tok/s step 9770/19560 | loss 3.397348 (-0.18z)| norm 0.2339 (-1.21z)| lr 4.35e-04 | 4169.84 ms | 32.4% bf16 MFU | 125038 tok/s step 9771/19560 | loss 3.398935 (-0.14z)| norm 0.2572 (+0.18z)| lr 4.34e-04 | 4175.79 ms | 32.3% bf16 MFU | 125064 tok/s step 9772/19560 | loss 3.410437 (+0.18z)| norm 0.2275 (-1.57z)| lr 4.34e-04 | 4189.68 ms | 32.2% bf16 MFU | 125068 tok/s step 9773/19560 | loss 3.384856 (-0.53z)| norm 0.2338 (-1.18z)| lr 4.34e-04 | 4162.47 ms | 32.4% bf16 MFU | 125112 tok/s step 9774/19560 | loss 3.439481 (+0.98z)| norm 0.2356 (-1.08z)| lr 4.34e-04 | 4172.86 ms | 32.4% bf16 MFU | 125139 tok/s step 9775/19560 | loss 3.383699 (-0.56z)| norm 0.2353 (-1.08z)| lr 4.34e-04 | 4171.48 ms | 32.4% bf16 MFU | 125166 tok/s step 9776/19560 | loss 3.407816 (+0.10z)| norm 0.2370 (-0.96z)| lr 4.34e-04 | 4249.41 ms | 31.8% bf16 MFU | 125077 tok/s step 9777/19560 | loss 3.366147 (-1.03z)| norm 0.2566 (+0.20z)| lr 4.34e-04 | 4180.83 ms | 32.3% bf16 MFU | 125093 tok/s step 9778/19560 | loss 3.438866 (+0.96z)| norm 0.2391 (-0.85z)| lr 4.34e-04 | 4166.02 ms | 32.4% bf16 MFU | 125131 tok/s step 9779/19560 | loss 3.391432 (-0.35z)| norm 0.2410 (-0.73z)| lr 4.34e-04 | 4177.17 ms | 32.3% bf16 MFU | 125150 tok/s step 9780/19560 | loss 3.404484 (+0.01z)| norm 0.2432 (-0.60z)| lr 4.34e-04 | 4177.32 ms | 32.3% bf16 MFU | 125168 tok/s step 9781/19560 | loss 3.391963 (-0.34z)| norm 0.2675 (+0.84z)| lr 4.34e-04 | 4237.91 ms | 31.9% bf16 MFU | 125095 tok/s step 9782/19560 | loss 3.352225 (-1.41z)| norm 0.2765 (+1.35z)| lr 4.34e-04 | 4253.81 ms | 31.7% bf16 MFU | 125003 tok/s step 9783/19560 | loss 3.426414 (+0.63z)| norm 0.2609 (+0.43z)| lr 4.34e-04 | 4162.43 ms | 32.4% bf16 MFU | 125050 tok/s step 9784/19560 | loss 3.444377 (+1.10z)| norm 0.2805 (+1.55z)| lr 4.34e-04 | 4164.06 ms | 32.4% bf16 MFU | 125093 tok/s step 9785/19560 | loss 3.381978 (-0.63z)| norm 0.2562 (+0.13z)| lr 4.34e-04 | 4169.64 ms | 32.4% bf16 MFU | 125126 tok/s step 9786/19560 | loss 3.434459 (+0.82z)| norm 0.2872 (+1.90z)| lr 4.33e-04 | 4184.94 ms | 32.3% bf16 MFU | 125133 tok/s step 9787/19560 | loss 3.375793 (-0.80z)| norm 0.2640 (+0.55z)| lr 4.33e-04 | 4170.62 ms | 32.4% bf16 MFU | 125162 tok/s step 9788/19560 | loss 3.389519 (-0.41z)| norm 0.3037 (+2.74z)| lr 4.33e-04 | 4170.18 ms | 32.4% bf16 MFU | 125190 tok/s step 9789/19560 | loss 3.349745 (-1.49z)| norm 0.2451 (-0.56z)| lr 4.33e-04 | 4174.02 ms | 32.3% bf16 MFU | 125211 tok/s step 9790/19560 | loss 3.386695 (-0.48z)| norm 0.2549 (+0.01z)| lr 4.33e-04 | 4159.91 ms | 32.5% bf16 MFU | 125252 tok/s step 9791/19560 | loss 3.333848 (-1.90z)| norm 0.2426 (-0.69z)| lr 4.33e-04 | 4164.04 ms | 32.4% bf16 MFU | 125285 tok/s step 9792/19560 | loss 3.425283 (+0.59z)| norm 0.2618 (+0.39z)| lr 4.33e-04 | 4163.52 ms | 32.4% bf16 MFU | 125317 tok/s step 9793/19560 | loss 3.373907 (-0.81z)| norm 0.2616 (+0.38z)| lr 4.33e-04 | 4183.18 ms | 32.3% bf16 MFU | 125318 tok/s step 9794/19560 | loss 3.411384 (+0.20z)| norm 0.2802 (+1.41z)| lr 4.33e-04 | 4173.54 ms | 32.4% bf16 MFU | 125333 tok/s step 9795/19560 | loss 3.398595 (-0.14z)| norm 0.2517 (-0.19z)| lr 4.33e-04 | 4180.11 ms | 32.3% bf16 MFU | 125338 tok/s step 9796/19560 | loss 3.406854 (+0.10z)| norm 0.2375 (-0.97z)| lr 4.33e-04 | 4170.10 ms | 32.4% bf16 MFU | 125357 tok/s step 9797/19560 | loss 3.373497 (-0.81z)| norm 0.2547 (+0.01z)| lr 4.33e-04 | 4179.47 ms | 32.3% bf16 MFU | 125361 tok/s step 9798/19560 | loss 3.381175 (-0.60z)| norm 0.2444 (-0.57z)| lr 4.33e-04 | 4181.39 ms | 32.3% bf16 MFU | 125363 tok/s step 9799/19560 | loss 3.448108 (+1.23z)| norm 0.2422 (-0.68z)| lr 4.33e-04 | 4174.14 ms | 32.3% bf16 MFU | 125375 tok/s step 9800/19560 | loss 3.367240 (-0.97z)| norm 0.2724 (+1.03z)| lr 4.33e-04 | 4173.56 ms | 32.4% bf16 MFU | 125387 tok/s step 9801/19560 | loss 3.377803 (-0.67z)| norm 0.2439 (-0.59z)| lr 4.32e-04 | 4166.05 ms | 32.4% bf16 MFU | 125410 tok/s step 9802/19560 | loss 3.431191 (+0.78z)| norm 0.2558 (+0.09z)| lr 4.32e-04 | 4175.59 ms | 32.3% bf16 MFU | 125417 tok/s step 9803/19560 | loss 3.440873 (+1.03z)| norm 0.2776 (+1.33z)| lr 4.32e-04 | 4165.98 ms | 32.4% bf16 MFU | 125439 tok/s step 9804/19560 | loss 3.381601 (-0.59z)| norm 0.2418 (-0.72z)| lr 4.32e-04 | 4186.08 ms | 32.3% bf16 MFU | 125429 tok/s step 9805/19560 | loss 3.365891 (-1.02z)| norm 0.2797 (+1.43z)| lr 4.32e-04 | 4173.52 ms | 32.4% bf16 MFU | 125439 tok/s step 9806/19560 | loss 3.431659 (+0.83z)| norm 0.2567 (+0.12z)| lr 4.32e-04 | 4173.01 ms | 32.4% bf16 MFU | 125449 tok/s step 9807/19560 | loss 3.426351 (+0.69z)| norm 0.2636 (+0.50z)| lr 4.32e-04 | 4173.49 ms | 32.4% bf16 MFU | 125458 tok/s step 9808/19560 | loss 3.403162 (+0.03z)| norm 0.2603 (+0.31z)| lr 4.32e-04 | 4188.71 ms | 32.2% bf16 MFU | 125443 tok/s step 9809/19560 | loss 3.365496 (-1.03z)| norm 0.2580 (+0.17z)| lr 4.32e-04 | 4177.30 ms | 32.3% bf16 MFU | 125446 tok/s step 9810/19560 | loss 3.468178 (+1.86z)| norm 0.2668 (+0.66z)| lr 4.32e-04 | 4226.05 ms | 31.9% bf16 MFU | 125377 tok/s step 9811/19560 | loss 3.391906 (-0.28z)| norm 0.2519 (-0.19z)| lr 4.32e-04 | 4175.67 ms | 32.3% bf16 MFU | 125386 tok/s step 9812/19560 | loss 3.396440 (-0.16z)| norm 0.2598 (+0.25z)| lr 4.32e-04 | 4171.61 ms | 32.4% bf16 MFU | 125401 tok/s step 9813/19560 | loss 3.431541 (+0.87z)| norm 0.2738 (+1.04z)| lr 4.32e-04 | 4191.72 ms | 32.2% bf16 MFU | 125385 tok/s step 9814/19560 | loss 3.364928 (-1.05z)| norm 0.2540 (-0.10z)| lr 4.32e-04 | 4176.86 ms | 32.3% bf16 MFU | 125392 tok/s step 9815/19560 | loss 3.406824 (+0.16z)| norm 0.2658 (+0.58z)| lr 4.32e-04 | 4173.97 ms | 32.3% bf16 MFU | 125402 tok/s step 9816/19560 | loss 3.418581 (+0.49z)| norm 0.2689 (+0.74z)| lr 4.31e-04 | 4171.60 ms | 32.4% bf16 MFU | 125416 tok/s step 9817/19560 | loss 3.348094 (-1.53z)| norm 0.2544 (-0.09z)| lr 4.31e-04 | 4175.56 ms | 32.3% bf16 MFU | 125424 tok/s step 9818/19560 | loss 3.416651 (+0.43z)| norm 0.2770 (+1.19z)| lr 4.31e-04 | 4160.05 ms | 32.5% bf16 MFU | 125454 tok/s step 9819/19560 | loss 3.385015 (-0.48z)| norm 0.2328 (-1.33z)| lr 4.31e-04 | 4176.82 ms | 32.3% bf16 MFU | 125457 tok/s step 9820/19560 | loss 3.419100 (+0.51z)| norm 0.2772 (+1.19z)| lr 4.31e-04 | 4174.35 ms | 32.3% bf16 MFU | 125464 tok/s step 9821/19560 | loss 3.392916 (-0.24z)| norm 0.2591 (+0.16z)| lr 4.31e-04 | 4159.30 ms | 32.5% bf16 MFU | 125494 tok/s step 9822/19560 | loss 3.343638 (-1.66z)| norm 0.2387 (-1.01z)| lr 4.31e-04 | 4172.23 ms | 32.4% bf16 MFU | 125502 tok/s step 9823/19560 | loss 3.352270 (-1.39z)| norm 0.2616 (+0.30z)| lr 4.31e-04 | 4165.10 ms | 32.4% bf16 MFU | 125521 tok/s step 9824/19560 | loss 3.405143 (+0.17z)| norm 0.2468 (-0.55z)| lr 4.31e-04 | 4178.22 ms | 32.3% bf16 MFU | 125519 tok/s step 9825/19560 | loss 3.530257 (+3.65z)| norm 0.2676 (+0.64z)| lr 4.31e-04 | 4179.24 ms | 32.3% bf16 MFU | 125515 tok/s step 9826/19560 | loss 3.409225 (+0.24z)| norm 0.2814 (+1.41z)| lr 4.31e-04 | 4167.97 ms | 32.4% bf16 MFU | 125529 tok/s step 9827/19560 | loss 3.400815 (+0.02z)| norm 0.2446 (-0.69z)| lr 4.31e-04 | 4172.10 ms | 32.4% bf16 MFU | 125536 tok/s step 9828/19560 | loss 3.397835 (-0.06z)| norm 0.2589 (+0.13z)| lr 4.31e-04 | 4197.50 ms | 32.2% bf16 MFU | 125504 tok/s step 9829/19560 | loss 3.368759 (-0.89z)| norm 0.2388 (-1.02z)| lr 4.31e-04 | 4192.87 ms | 32.2% bf16 MFU | 125481 tok/s step 9830/19560 | loss 3.399475 (-0.01z)| norm 0.2732 (+0.94z)| lr 4.31e-04 | 4171.79 ms | 32.4% bf16 MFU | 125491 tok/s step 9831/19560 | loss 3.392524 (-0.21z)| norm 0.2622 (+0.32z)| lr 4.30e-04 | 4559.68 ms | 29.6% bf16 MFU | 124966 tok/s step 9832/19560 | loss 3.397085 (-0.08z)| norm 0.2472 (-0.54z)| lr 4.30e-04 | 4599.17 ms | 29.4% bf16 MFU | 124417 tok/s step 9833/19560 | loss 3.383778 (-0.44z)| norm 0.2630 (+0.36z)| lr 4.30e-04 | 4704.38 ms | 28.7% bf16 MFU | 123769 tok/s step 9834/19560 | loss 3.368043 (-0.89z)| norm 0.2219 (-1.95z)| lr 4.30e-04 | 4647.27 ms | 29.1% bf16 MFU | 123221 tok/s step 9835/19560 | loss 3.443845 (+1.26z)| norm 0.2567 (+0.01z)| lr 4.30e-04 | 4241.49 ms | 31.8% bf16 MFU | 123240 tok/s step 9836/19560 | loss 3.382133 (-0.51z)| norm 0.2593 (+0.15z)| lr 4.30e-04 | 4546.88 ms | 29.7% bf16 MFU | 122844 tok/s step 9837/19560 | loss 3.427735 (+0.79z)| norm 0.2924 (+1.98z)| lr 4.30e-04 | 4288.90 ms | 31.5% bf16 MFU | 122814 tok/s step 9838/19560 | loss 3.388201 (-0.34z)| norm 0.2617 (+0.26z)| lr 4.30e-04 | 4395.42 ms | 30.7% bf16 MFU | 122637 tok/s step 9839/19560 | loss 3.386376 (-0.39z)| norm 0.2517 (-0.31z)| lr 4.30e-04 | 4162.43 ms | 32.4% bf16 MFU | 122803 tok/s step 9840/19560 | loss 3.395263 (-0.14z)| norm 0.3239 (+3.55z)| lr 4.30e-04 | 4271.68 ms | 31.6% bf16 MFU | 122800 tok/s step 9841/19560 | loss 3.355498 (-1.27z)| norm 0.3001 (+2.23z)| lr 4.30e-04 | 4351.15 ms | 31.0% bf16 MFU | 122684 tok/s step 9842/19560 | loss 3.389587 (-0.27z)| norm 0.2565 (-0.07z)| lr 4.30e-04 | 4216.28 ms | 32.0% bf16 MFU | 122768 tok/s step 9843/19560 | loss 3.417750 (+0.58z)| norm 0.2789 (+1.10z)| lr 4.30e-04 | 4170.32 ms | 32.4% bf16 MFU | 122915 tok/s step 9844/19560 | loss 3.338387 (-1.81z)| norm 0.2870 (+1.50z)| lr 4.30e-04 | 4185.21 ms | 32.3% bf16 MFU | 123033 tok/s step 9845/19560 | loss 3.403064 (+0.14z)| norm 0.2472 (-0.57z)| lr 4.29e-04 | 4173.72 ms | 32.3% bf16 MFU | 123162 tok/s step 9846/19560 | loss 3.389248 (-0.28z)| norm 0.2743 (+0.83z)| lr 4.29e-04 | 4186.73 ms | 32.2% bf16 MFU | 123265 tok/s step 9847/19560 | loss 3.362659 (-1.08z)| norm 0.2501 (-0.44z)| lr 4.29e-04 | 4185.34 ms | 32.3% bf16 MFU | 123365 tok/s step 9848/19560 | loss 3.484339 (+2.54z)| norm 0.2569 (-0.09z)| lr 4.29e-04 | 4274.92 ms | 31.6% bf16 MFU | 123329 tok/s step 9849/19560 | loss 3.315999 (-2.40z)| norm 0.2323 (-1.36z)| lr 4.29e-04 | 4300.08 ms | 31.4% bf16 MFU | 123259 tok/s step 9850/19560 | loss 3.367952 (-0.86z)| norm 0.2364 (-1.12z)| lr 4.29e-04 | 4285.55 ms | 31.5% bf16 MFU | 123213 tok/s step 9851/19560 | loss 3.364441 (-0.95z)| norm 0.2436 (-0.74z)| lr 4.29e-04 | 4287.71 ms | 31.5% bf16 MFU | 123166 tok/s step 9852/19560 | loss 3.428439 (+0.99z)| norm 0.2317 (-1.41z)| lr 4.29e-04 | 4230.13 ms | 31.9% bf16 MFU | 123205 tok/s step 9853/19560 | loss 3.477572 (+2.40z)| norm 0.2371 (-1.09z)| lr 4.29e-04 | 4245.49 ms | 31.8% bf16 MFU | 123219 tok/s step 9854/19560 | loss 3.328247 (-1.98z)| norm 0.2368 (-1.09z)| lr 4.29e-04 | 4148.04 ms | 32.5% bf16 MFU | 123378 tok/s step 9855/19560 | loss 3.368705 (-0.79z)| norm 0.2628 (+0.39z)| lr 4.29e-04 | 4140.55 ms | 32.6% bf16 MFU | 123540 tok/s step 9856/19560 | loss 3.357843 (-1.10z)| norm 0.2369 (-1.08z)| lr 4.29e-04 | 4147.19 ms | 32.6% bf16 MFU | 123684 tok/s step 9857/19560 | loss 3.377540 (-0.52z)| norm 0.2383 (-0.98z)| lr 4.29e-04 | 4143.58 ms | 32.6% bf16 MFU | 123827 tok/s step 9858/19560 | loss 3.386609 (-0.27z)| norm 0.2428 (-0.72z)| lr 4.29e-04 | 4153.30 ms | 32.5% bf16 MFU | 123947 tok/s step 9859/19560 | loss 3.444551 (+1.41z)| norm 0.2421 (-0.75z)| lr 4.29e-04 | 4226.53 ms | 31.9% bf16 MFU | 123952 tok/s step 9860/19560 | loss 3.386653 (-0.26z)| norm 0.2768 (+1.25z)| lr 4.28e-04 | 4216.24 ms | 32.0% bf16 MFU | 123972 tok/s step 9861/19560 | loss 3.390446 (-0.16z)| norm 0.2692 (+0.81z)| lr 4.28e-04 | 4146.54 ms | 32.6% bf16 MFU | 124095 tok/s step 9862/19560 | loss 3.403846 (+0.24z)| norm 0.2502 (-0.27z)| lr 4.28e-04 | 4151.02 ms | 32.5% bf16 MFU | 124206 tok/s step 9863/19560 | loss 3.370665 (-0.72z)| norm 0.2811 (+1.49z)| lr 4.28e-04 | 4145.62 ms | 32.6% bf16 MFU | 124319 tok/s step 9864/19560 | loss 3.385600 (-0.28z)| norm 0.2543 (-0.04z)| lr 4.28e-04 | 4137.57 ms | 32.6% bf16 MFU | 124439 tok/s step 9865/19560 | loss 3.376233 (-0.55z)| norm 0.2582 (+0.18z)| lr 4.28e-04 | 4147.35 ms | 32.6% bf16 MFU | 124537 tok/s step 9866/19560 | loss 3.367979 (-0.79z)| norm 0.2431 (-0.68z)| lr 4.28e-04 | 4170.66 ms | 32.4% bf16 MFU | 124596 tok/s step 9867/19560 | loss 3.473851 (+2.24z)| norm 0.2451 (-0.55z)| lr 4.28e-04 | 4150.15 ms | 32.5% bf16 MFU | 124683 tok/s step 9868/19560 | loss 3.437052 (+1.17z)| norm 0.2368 (-1.01z)| lr 4.28e-04 | 4150.76 ms | 32.5% bf16 MFU | 124764 tok/s step 9869/19560 | loss 3.377072 (-0.53z)| norm 0.2505 (-0.24z)| lr 4.28e-04 | 4195.39 ms | 32.2% bf16 MFU | 124774 tok/s step 9870/19560 | loss 3.376507 (-0.54z)| norm 0.2288 (-1.49z)| lr 4.28e-04 | 4301.12 ms | 31.4% bf16 MFU | 124630 tok/s step 9871/19560 | loss 3.442758 (+1.33z)| norm 0.2700 (+0.89z)| lr 4.28e-04 | 4167.06 ms | 32.4% bf16 MFU | 124690 tok/s step 9872/19560 | loss 3.411302 (+0.43z)| norm 0.2824 (+1.58z)| lr 4.28e-04 | 4155.55 ms | 32.5% bf16 MFU | 124764 tok/s step 9873/19560 | loss 3.342578 (-1.50z)| norm 0.2866 (+1.78z)| lr 4.28e-04 | 4180.78 ms | 32.3% bf16 MFU | 124796 tok/s step 9874/19560 | loss 3.401373 (+0.15z)| norm 0.2767 (+1.20z)| lr 4.28e-04 | 4151.69 ms | 32.5% bf16 MFU | 124870 tok/s step 9875/19560 | loss 3.411094 (+0.42z)| norm 0.2344 (-1.19z)| lr 4.27e-04 | 4149.30 ms | 32.5% bf16 MFU | 124944 tok/s step 9876/19560 | loss 3.432846 (+1.02z)| norm 0.2618 (+0.36z)| lr 4.27e-04 | 4165.89 ms | 32.4% bf16 MFU | 124990 tok/s step 9877/19560 | loss 3.396908 (+0.02z)| norm 0.2648 (+0.53z)| lr 4.27e-04 | 4157.10 ms | 32.5% bf16 MFU | 125046 tok/s step 9878/19560 | loss 3.343654 (-1.50z)| norm 0.2608 (+0.29z)| lr 4.27e-04 | 4155.46 ms | 32.5% bf16 MFU | 125102 tok/s step 9879/19560 | loss 3.370411 (-0.73z)| norm 0.2371 (-1.04z)| lr 4.27e-04 | 4151.11 ms | 32.5% bf16 MFU | 125162 tok/s step 9880/19560 | loss 3.298122 (-2.73z)| norm 0.2613 (+0.33z)| lr 4.27e-04 | 4180.58 ms | 32.3% bf16 MFU | 125174 tok/s step 9881/19560 | loss 3.397674 (+0.06z)| norm 0.2495 (-0.35z)| lr 4.27e-04 | 4145.26 ms | 32.6% bf16 MFU | 125240 tok/s step 9882/19560 | loss 3.400961 (+0.15z)| norm 0.2436 (-0.68z)| lr 4.27e-04 | 4159.55 ms | 32.5% bf16 MFU | 125280 tok/s step 9883/19560 | loss 3.380581 (-0.41z)| norm 0.2444 (-0.63z)| lr 4.27e-04 | 4150.70 ms | 32.5% bf16 MFU | 125332 tok/s step 9884/19560 | loss 3.363319 (-0.89z)| norm 0.2625 (+0.40z)| lr 4.27e-04 | 4155.16 ms | 32.5% bf16 MFU | 125374 tok/s step 9885/19560 | loss 3.437283 (+1.16z)| norm 0.2598 (+0.24z)| lr 4.27e-04 | 4146.78 ms | 32.6% bf16 MFU | 125427 tok/s step 9886/19560 | loss 3.440145 (+1.24z)| norm 0.2772 (+1.21z)| lr 4.27e-04 | 4145.85 ms | 32.6% bf16 MFU | 125479 tok/s step 9887/19560 | loss 3.393090 (-0.07z)| norm 0.2683 (+0.70z)| lr 4.27e-04 | 4224.11 ms | 32.0% bf16 MFU | 125411 tok/s step 9888/19560 | loss 3.480590 (+2.36z)| norm 0.2872 (+1.74z)| lr 4.27e-04 | 4159.44 ms | 32.5% bf16 MFU | 125442 tok/s step 9889/19560 | loss 3.410907 (+0.41z)| norm 0.2915 (+1.94z)| lr 4.27e-04 | 4147.73 ms | 32.6% bf16 MFU | 125490 tok/s step 9890/19560 | loss 3.486655 (+2.47z)| norm 0.2819 (+1.38z)| lr 4.26e-04 | 4154.07 ms | 32.5% bf16 MFU | 125526 tok/s step 9891/19560 | loss 3.402637 (+0.14z)| norm 0.2632 (+0.34z)| lr 4.26e-04 | 4147.68 ms | 32.6% bf16 MFU | 125570 tok/s step 9892/19560 | loss 3.400248 (+0.07z)| norm 0.3084 (+2.74z)| lr 4.26e-04 | 4156.13 ms | 32.5% bf16 MFU | 125599 tok/s step 9893/19560 | loss 3.368032 (-0.82z)| norm 0.2800 (+1.19z)| lr 4.26e-04 | 4153.39 ms | 32.5% bf16 MFU | 125631 tok/s step 9894/19560 | loss 3.428121 (+0.85z)| norm 0.2730 (+0.80z)| lr 4.26e-04 | 4154.02 ms | 32.5% bf16 MFU | 125660 tok/s step 9895/19560 | loss 3.372697 (-0.68z)| norm 0.2497 (-0.47z)| lr 4.26e-04 | 4159.66 ms | 32.5% bf16 MFU | 125679 tok/s step 9896/19560 | loss 3.426853 (+0.82z)| norm 0.2520 (-0.34z)| lr 4.26e-04 | 4155.60 ms | 32.5% bf16 MFU | 125703 tok/s step 9897/19560 | loss 3.399477 (+0.06z)| norm 0.2455 (-0.70z)| lr 4.26e-04 | 4148.63 ms | 32.5% bf16 MFU | 125737 tok/s step 9898/19560 | loss 3.348473 (-1.33z)| norm 0.2673 (+0.47z)| lr 4.26e-04 | 4153.00 ms | 32.5% bf16 MFU | 125762 tok/s step 9899/19560 | loss 3.479464 (+2.21z)| norm 0.2493 (-0.51z)| lr 4.26e-04 | 4150.21 ms | 32.5% bf16 MFU | 125790 tok/s step 9900/19560 | loss 3.412960 (+0.41z)| norm 0.2624 (+0.19z)| lr 4.26e-04 | 4150.34 ms | 32.5% bf16 MFU | 125817 tok/s step 9901/19560 | loss 3.351478 (-1.23z)| norm 0.2232 (-1.95z)| lr 4.26e-04 | 4156.97 ms | 32.5% bf16 MFU | 125832 tok/s step 9902/19560 | loss 3.438566 (+1.11z)| norm 0.2609 (+0.11z)| lr 4.26e-04 | 4149.24 ms | 32.5% bf16 MFU | 125859 tok/s step 9903/19560 | loss 3.304654 (-2.42z)| norm 0.2470 (-0.67z)| lr 4.26e-04 | 4147.45 ms | 32.6% bf16 MFU | 125886 tok/s step 9904/19560 | loss 3.333102 (-1.64z)| norm 0.2408 (-1.02z)| lr 4.26e-04 | 4160.35 ms | 32.5% bf16 MFU | 125893 tok/s step 9905/19560 | loss 3.363526 (-0.85z)| norm 0.2581 (-0.06z)| lr 4.25e-04 | 4167.12 ms | 32.4% bf16 MFU | 125889 tok/s step 9906/19560 | loss 3.427349 (+0.82z)| norm 0.2632 (+0.21z)| lr 4.25e-04 | 4160.02 ms | 32.5% bf16 MFU | 125896 tok/s step 9907/19560 | loss 3.424189 (+0.72z)| norm 0.3170 (+3.08z)| lr 4.25e-04 | 4163.14 ms | 32.4% bf16 MFU | 125898 tok/s step 9908/19560 | loss 3.392205 (-0.10z)| norm 0.2472 (-0.69z)| lr 4.25e-04 | 4156.96 ms | 32.5% bf16 MFU | 125909 tok/s step 9909/19560 | loss 3.335762 (-1.54z)| norm 0.2863 (+1.40z)| lr 4.25e-04 | 4158.54 ms | 32.5% bf16 MFU | 125918 tok/s step 9910/19560 | loss 3.413588 (+0.45z)| norm 0.2465 (-0.72z)| lr 4.25e-04 | 4163.09 ms | 32.4% bf16 MFU | 125919 tok/s step 9911/19560 | loss 3.376480 (-0.50z)| norm 0.2766 (+0.89z)| lr 4.25e-04 | 4161.56 ms | 32.4% bf16 MFU | 125922 tok/s step 9912/19560 | loss 3.330424 (-1.66z)| norm 0.2428 (-0.91z)| lr 4.25e-04 | 4159.03 ms | 32.5% bf16 MFU | 125929 tok/s step 9913/19560 | loss 3.482543 (+2.19z)| norm 0.2638 (+0.22z)| lr 4.25e-04 | 4162.95 ms | 32.4% bf16 MFU | 125929 tok/s step 9914/19560 | loss 3.386038 (-0.24z)| norm 0.2559 (-0.20z)| lr 4.25e-04 | 4167.53 ms | 32.4% bf16 MFU | 125923 tok/s step 9915/19560 | loss 3.390296 (-0.13z)| norm 0.2544 (-0.27z)| lr 4.25e-04 | 4158.61 ms | 32.5% bf16 MFU | 125931 tok/s step 9916/19560 | loss 3.521590 (+3.06z)| norm 0.2591 (+0.00z)| lr 4.25e-04 | 4159.35 ms | 32.5% bf16 MFU | 125937 tok/s step 9917/19560 | loss 3.429374 (+0.79z)| norm 0.2608 (+0.09z)| lr 4.25e-04 | 4170.81 ms | 32.4% bf16 MFU | 125925 tok/s step 9918/19560 | loss 3.369027 (-0.68z)| norm 0.2457 (-0.74z)| lr 4.25e-04 | 4160.01 ms | 32.5% bf16 MFU | 125930 tok/s step 9919/19560 | loss 3.375501 (-0.54z)| norm 0.2309 (-1.55z)| lr 4.24e-04 | 4154.91 ms | 32.5% bf16 MFU | 125943 tok/s step 9920/19560 | loss 3.361663 (-0.87z)| norm 0.2621 (+0.17z)| lr 4.24e-04 | 4156.98 ms | 32.5% bf16 MFU | 125952 tok/s step 9921/19560 | loss 3.360230 (-0.90z)| norm 0.2519 (-0.39z)| lr 4.24e-04 | 4160.54 ms | 32.5% bf16 MFU | 125955 tok/s step 9922/19560 | loss 3.425817 (+0.71z)| norm 0.2592 (+0.02z)| lr 4.24e-04 | 4202.62 ms | 32.1% bf16 MFU | 125895 tok/s step 9923/19560 | loss 3.390062 (-0.16z)| norm 0.2478 (-0.61z)| lr 4.24e-04 | 4162.60 ms | 32.4% bf16 MFU | 125898 tok/s step 9924/19560 | loss 3.518964 (+2.88z)| norm 0.2828 (+1.30z)| lr 4.24e-04 | 4157.25 ms | 32.5% bf16 MFU | 125909 tok/s step 9925/19560 | loss 3.330360 (-1.57z)| norm 0.2593 (+0.01z)| lr 4.24e-04 | 4152.19 ms | 32.5% bf16 MFU | 125927 tok/s step 9926/19560 | loss 3.309432 (-2.02z)| norm 0.2726 (+0.73z)| lr 4.24e-04 | 4150.88 ms | 32.5% bf16 MFU | 125946 tok/s step 9927/19560 | loss 3.342436 (-1.24z)| norm 0.2601 (+0.03z)| lr 4.24e-04 | 4162.80 ms | 32.4% bf16 MFU | 125946 tok/s step 9928/19560 | loss 3.411211 (+0.35z)| norm 0.2486 (-0.59z)| lr 4.24e-04 | 4197.75 ms | 32.2% bf16 MFU | 125893 tok/s step 9929/19560 | loss 3.389692 (-0.15z)| norm 0.2585 (-0.05z)| lr 4.24e-04 | 4165.54 ms | 32.4% bf16 MFU | 125892 tok/s step 9930/19560 | loss 3.439495 (+1.00z)| norm 0.2634 (+0.22z)| lr 4.24e-04 | 4162.49 ms | 32.4% bf16 MFU | 125895 tok/s step 9931/19560 | loss 3.315916 (-1.83z)| norm 0.2778 (+1.02z)| lr 4.24e-04 | 4151.98 ms | 32.5% bf16 MFU | 125914 tok/s step 9932/19560 | loss 3.352074 (-0.99z)| norm 0.2926 (+1.80z)| lr 4.24e-04 | 4168.45 ms | 32.4% bf16 MFU | 125907 tok/s step 9933/19560 | loss 3.372249 (-0.53z)| norm 0.2436 (-0.88z)| lr 4.24e-04 | 4168.86 ms | 32.4% bf16 MFU | 125900 tok/s step 9934/19560 | loss 3.356608 (-0.87z)| norm 0.2777 (+0.99z)| lr 4.23e-04 | 4183.02 ms | 32.3% bf16 MFU | 125872 tok/s step 9935/19560 | loss 3.395874 (+0.03z)| norm 0.2500 (-0.53z)| lr 4.23e-04 | 4158.89 ms | 32.5% bf16 MFU | 125881 tok/s step 9936/19560 | loss 3.333713 (-1.37z)| norm 0.2744 (+0.80z)| lr 4.23e-04 | 4172.19 ms | 32.4% bf16 MFU | 125870 tok/s step 9937/19560 | loss 3.574011 (+3.83z)| norm 0.2849 (+1.35z)| lr 4.23e-04 | 4149.67 ms | 32.5% bf16 MFU | 125894 tok/s step 9938/19560 | loss 3.337933 (-1.22z)| norm 0.2753 (+0.83z)| lr 4.23e-04 | 4151.73 ms | 32.5% bf16 MFU | 125913 tok/s step 9939/19560 | loss 3.345361 (-1.05z)| norm 0.2833 (+1.24z)| lr 4.23e-04 | 4152.07 ms | 32.5% bf16 MFU | 125931 tok/s step 9940/19560 | loss 3.382085 (-0.26z)| norm 0.2446 (-0.84z)| lr 4.23e-04 | 4152.80 ms | 32.5% bf16 MFU | 125947 tok/s step 9941/19560 | loss 3.382154 (-0.25z)| norm 0.2612 (+0.06z)| lr 4.23e-04 | 4152.19 ms | 32.5% bf16 MFU | 125963 tok/s step 9942/19560 | loss 3.364552 (-0.62z)| norm 0.2630 (+0.15z)| lr 4.23e-04 | 4167.86 ms | 32.4% bf16 MFU | 125955 tok/s step 9943/19560 | loss 3.413623 (+0.43z)| norm 0.2552 (-0.26z)| lr 4.23e-04 | 4160.40 ms | 32.5% bf16 MFU | 125958 tok/s step 9944/19560 | loss 3.433755 (+0.86z)| norm 0.2647 (+0.25z)| lr 4.23e-04 | 4163.49 ms | 32.4% bf16 MFU | 125956 tok/s step 9945/19560 | loss 3.375873 (-0.39z)| norm 0.2499 (-0.55z)| lr 4.23e-04 | 4163.23 ms | 32.4% bf16 MFU | 125955 tok/s step 9946/19560 | loss 3.414082 (+0.43z)| norm 0.2475 (-0.66z)| lr 4.23e-04 | 4163.31 ms | 32.4% bf16 MFU | 125954 tok/s step 9947/19560 | loss 3.405582 (+0.25z)| norm 0.2566 (-0.18z)| lr 4.23e-04 | 4191.62 ms | 32.2% bf16 MFU | 125910 tok/s step 9948/19560 | loss 3.361320 (-0.70z)| norm 0.2511 (-0.48z)| lr 4.23e-04 | 4156.00 ms | 32.5% bf16 MFU | 125922 tok/s step 9949/19560 | loss 3.370853 (-0.49z)| norm 0.2576 (-0.11z)| lr 4.22e-04 | 4164.14 ms | 32.4% bf16 MFU | 125921 tok/s step 9950/19560 | loss 3.351905 (-0.90z)| norm 0.2500 (-0.54z)| lr 4.22e-04 | 4151.65 ms | 32.5% bf16 MFU | 125940 tok/s step 9951/19560 | loss 3.443460 (+1.06z)| norm 0.2493 (-0.57z)| lr 4.22e-04 | 4158.79 ms | 32.5% bf16 MFU | 125946 tok/s step 9952/19560 | loss 3.330367 (-1.35z)| norm 0.2413 (-1.00z)| lr 4.22e-04 | 4162.68 ms | 32.4% bf16 MFU | 125946 tok/s step 9953/19560 | loss 3.457215 (+1.40z)| norm 0.2438 (-0.86z)| lr 4.22e-04 | 4154.34 ms | 32.5% bf16 MFU | 125959 tok/s step 9954/19560 | loss 3.405519 (+0.27z)| norm 0.2465 (-0.70z)| lr 4.22e-04 | 4161.43 ms | 32.4% bf16 MFU | 125960 tok/s step 9955/19560 | loss 3.492799 (+2.13z)| norm 0.2758 (+0.89z)| lr 4.22e-04 | 4162.25 ms | 32.4% bf16 MFU | 125961 tok/s step 9956/19560 | loss 3.347967 (-0.97z)| norm 0.2529 (-0.36z)| lr 4.22e-04 | 4162.38 ms | 32.4% bf16 MFU | 125960 tok/s step 9957/19560 | loss 3.350420 (-0.92z)| norm 0.2666 (+0.38z)| lr 4.22e-04 | 4159.30 ms | 32.5% bf16 MFU | 125965 tok/s step 9958/19560 | loss 3.371434 (-0.46z)| norm 0.2532 (-0.35z)| lr 4.22e-04 | 4163.73 ms | 32.4% bf16 MFU | 125963 tok/s step 9959/19560 | loss 3.412395 (+0.41z)| norm 0.2587 (-0.04z)| lr 4.22e-04 | 4168.62 ms | 32.4% bf16 MFU | 125953 tok/s step 9960/19560 | loss 3.305076 (-1.84z)| norm 0.2637 (+0.23z)| lr 4.22e-04 | 4151.94 ms | 32.5% bf16 MFU | 125969 tok/s step 9961/19560 | loss 3.571460 (+3.55z)| norm 0.2630 (+0.19z)| lr 4.22e-04 | 4389.31 ms | 30.8% bf16 MFU | 125643 tok/s step 9962/19560 | loss 3.430520 (+0.72z)| norm 0.2616 (+0.10z)| lr 4.22e-04 | 4161.39 ms | 32.4% bf16 MFU | 125660 tok/s step 9963/19560 | loss 3.355208 (-0.77z)| norm 0.2590 (-0.05z)| lr 4.22e-04 | 4167.70 ms | 32.4% bf16 MFU | 125667 tok/s step 9964/19560 | loss 3.439263 (+0.90z)| norm 0.2590 (-0.05z)| lr 4.21e-04 | 4166.23 ms | 32.4% bf16 MFU | 125676 tok/s step 9965/19560 | loss 3.365165 (-0.57z)| norm 0.3129 (+2.90z)| lr 4.21e-04 | 4157.30 ms | 32.5% bf16 MFU | 125698 tok/s step 9966/19560 | loss 3.376397 (-0.34z)| norm 0.2426 (-0.95z)| lr 4.21e-04 | 4161.90 ms | 32.4% bf16 MFU | 125712 tok/s step 9967/19560 | loss 3.405640 (+0.24z)| norm 0.2495 (-0.57z)| lr 4.21e-04 | 4213.08 ms | 32.0% bf16 MFU | 125648 tok/s step 9968/19560 | loss 3.461795 (+1.34z)| norm 0.2475 (-0.68z)| lr 4.21e-04 | 4152.20 ms | 32.5% bf16 MFU | 125679 tok/s step 9969/19560 | loss 3.374268 (-0.40z)| norm 0.2388 (-1.17z)| lr 4.21e-04 | 4154.20 ms | 32.5% bf16 MFU | 125705 tok/s step 9970/19560 | loss 3.369187 (-0.50z)| norm 0.2401 (-1.08z)| lr 4.21e-04 | 4155.96 ms | 32.5% bf16 MFU | 125728 tok/s step 9971/19560 | loss 3.392097 (-0.04z)| norm 0.2371 (-1.24z)| lr 4.21e-04 | 4163.25 ms | 32.4% bf16 MFU | 125738 tok/s step 9972/19560 | loss 3.424482 (+0.59z)| norm 0.2397 (-1.07z)| lr 4.21e-04 | 4158.51 ms | 32.5% bf16 MFU | 125755 tok/s step 9973/19560 | loss 3.369071 (-0.51z)| norm 0.2284 (-1.71z)| lr 4.21e-04 | 4157.66 ms | 32.5% bf16 MFU | 125772 tok/s step 9974/19560 | loss 3.445556 (+1.00z)| norm 0.2645 (+0.39z)| lr 4.21e-04 | 4153.36 ms | 32.5% bf16 MFU | 125795 tok/s step 9975/19560 | loss 3.378426 (-0.33z)| norm 0.2651 (+0.42z)| lr 4.21e-04 | 4158.16 ms | 32.5% bf16 MFU | 125810 tok/s step 9976/19560 | loss 3.438607 (+0.88z)| norm 0.2623 (+0.25z)| lr 4.21e-04 | 4768.98 ms | 28.3% bf16 MFU | 125016 tok/s step 9977/19560 | loss 3.411706 (+0.33z)| norm 0.2374 (-1.20z)| lr 4.21e-04 | 4156.06 ms | 32.5% bf16 MFU | 125073 tok/s step 9978/19560 | loss 3.401070 (+0.11z)| norm 0.2774 (+1.11z)| lr 4.21e-04 | 4157.62 ms | 32.5% bf16 MFU | 125124 tok/s step 9979/19560 | loss 3.332913 (-1.26z)| norm 0.2495 (-0.52z)| lr 4.20e-04 | 4168.80 ms | 32.4% bf16 MFU | 125156 tok/s step 9980/19560 | loss 3.399962 (+0.10z)| norm 0.2570 (-0.09z)| lr 4.20e-04 | 4154.02 ms | 32.5% bf16 MFU | 125209 tok/s step 9981/19560 | loss 3.394990 (+0.01z)| norm 0.2497 (-0.53z)| lr 4.20e-04 | 4160.25 ms | 32.5% bf16 MFU | 125250 tok/s step 9982/19560 | loss 3.385853 (-0.19z)| norm 0.2675 (+0.51z)| lr 4.20e-04 | 4164.26 ms | 32.4% bf16 MFU | 125283 tok/s step 9983/19560 | loss 3.481682 (+1.74z)| norm 0.2595 (+0.04z)| lr 4.20e-04 | 4155.01 ms | 32.5% bf16 MFU | 125328 tok/s step 9984/19560 | loss 3.354914 (-0.83z)| norm 0.2788 (+1.17z)| lr 4.20e-04 | 4157.92 ms | 32.5% bf16 MFU | 125366 tok/s step 9985/19560 | loss 3.356142 (-0.80z)| norm 0.2567 (-0.16z)| lr 4.20e-04 | 4153.95 ms | 32.5% bf16 MFU | 125408 tok/s step 9986/19560 | loss 3.350175 (-0.92z)| norm 0.2693 (+0.59z)| lr 4.20e-04 | 4163.26 ms | 32.4% bf16 MFU | 125434 tok/s step 9987/19560 | loss 3.354399 (-0.82z)| norm 0.2643 (+0.28z)| lr 4.20e-04 | 4162.58 ms | 32.4% bf16 MFU | 125460 tok/s step 9988/19560 | loss 3.397825 (+0.06z)| norm 0.2998 (+2.37z)| lr 4.20e-04 | 4167.61 ms | 32.4% bf16 MFU | 125477 tok/s step 9989/19560 | loss 3.371913 (-0.46z)| norm 0.2463 (-0.80z)| lr 4.20e-04 | 4161.39 ms | 32.4% bf16 MFU | 125503 tok/s step 9990/19560 | loss 3.434725 (+0.80z)| norm 0.3151 (+3.13z)| lr 4.20e-04 | 4147.62 ms | 32.6% bf16 MFU | 125548 tok/s step 9991/19560 | loss 3.416839 (+0.43z)| norm 0.2545 (-0.32z)| lr 4.20e-04 | 4161.62 ms | 32.4% bf16 MFU | 125570 tok/s step 9992/19560 | loss 3.371831 (-0.47z)| norm 0.2673 (+0.41z)| lr 4.20e-04 | 4157.13 ms | 32.5% bf16 MFU | 125597 tok/s step 9993/19560 | loss 3.388365 (-0.14z)| norm 0.2963 (+2.03z)| lr 4.19e-04 | 4178.03 ms | 32.3% bf16 MFU | 125592 tok/s step 9994/19560 | loss 3.434619 (+0.78z)| norm 0.2717 (+0.63z)| lr 4.19e-04 | 4158.47 ms | 32.5% bf16 MFU | 125616 tok/s step 9995/19560 | loss 3.346181 (-0.99z)| norm 0.2598 (-0.06z)| lr 4.19e-04 | 4168.65 ms | 32.4% bf16 MFU | 125624 tok/s step 9996/19560 | loss 3.454354 (+1.20z)| norm 0.2508 (-0.57z)| lr 4.19e-04 | 4167.25 ms | 32.4% bf16 MFU | 125633 tok/s step 9997/19560 | loss 3.386311 (-0.18z)| norm 0.2634 (+0.14z)| lr 4.19e-04 | 4160.17 ms | 32.5% bf16 MFU | 125653 tok/s step 9998/19560 | loss 3.297698 (-1.93z)| norm 0.2794 (+1.04z)| lr 4.19e-04 | 4161.80 ms | 32.4% bf16 MFU | 125669 tok/s step 9999/19560 | loss 3.393094 (-0.02z)| norm 0.2474 (-0.79z)| lr 4.19e-04 | 4143.92 ms | 32.6% bf16 MFU | 125711 tok/s step 10000/19560 | loss 3.345940 (-0.95z)| norm 0.2439 (-0.98z)| lr 4.19e-04 | 4155.29 ms | 32.5% bf16 MFU | 125734 tok/s val loss 3.382316 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2937/10042 = 0.292472 Writing checkpoint at step 10000 Writing model to log124M/model_00010000.bin Writing state to log124M/state_00010000_00000.bin step 10001/19560 | loss 3.381694 (-0.24z)| norm 0.2594 (-0.07z)| lr 4.19e-04 | 4241.42 ms | 31.8% bf16 MFU | 125628 tok/s step 10002/19560 | loss 3.363692 (-0.60z)| norm 0.2440 (-0.96z)| lr 4.19e-04 | 4233.08 ms | 31.9% bf16 MFU | 125540 tok/s step 10003/19560 | loss 3.352721 (-0.80z)| norm 0.2496 (-0.64z)| lr 4.19e-04 | 4143.78 ms | 32.6% bf16 MFU | 125589 tok/s step 10004/19560 | loss 3.368354 (-0.48z)| norm 0.2365 (-1.39z)| lr 4.19e-04 | 4180.73 ms | 32.3% bf16 MFU | 125580 tok/s step 10005/19560 | loss 3.372315 (-0.40z)| norm 0.2345 (-1.48z)| lr 4.19e-04 | 4227.44 ms | 31.9% bf16 MFU | 125502 tok/s step 10006/19560 | loss 3.401695 (+0.18z)| norm 0.2471 (-0.74z)| lr 4.19e-04 | 4140.94 ms | 32.6% bf16 MFU | 125557 tok/s step 10007/19560 | loss 3.455747 (+1.24z)| norm 0.2526 (-0.44z)| lr 4.19e-04 | 4215.67 ms | 32.0% bf16 MFU | 125498 tok/s step 10008/19560 | loss 3.504452 (+2.17z)| norm 0.2444 (-0.90z)| lr 4.18e-04 | 4155.36 ms | 32.5% bf16 MFU | 125531 tok/s step 10009/19560 | loss 3.423082 (+0.55z)| norm 0.2536 (-0.37z)| lr 4.18e-04 | 4201.20 ms | 32.1% bf16 MFU | 125495 tok/s step 10010/19560 | loss 3.302822 (-1.79z)| norm 0.2626 (+0.14z)| lr 4.18e-04 | 4249.94 ms | 31.8% bf16 MFU | 125388 tok/s step 10011/19560 | loss 3.351852 (-0.83z)| norm 0.2971 (+2.09z)| lr 4.18e-04 | 4147.82 ms | 32.6% bf16 MFU | 125439 tok/s step 10012/19560 | loss 3.331565 (-1.21z)| norm 0.2601 (-0.03z)| lr 4.18e-04 | 4389.19 ms | 30.8% bf16 MFU | 125139 tok/s step 10013/19560 | loss 3.421262 (+0.53z)| norm 0.2528 (-0.44z)| lr 4.18e-04 | 4150.36 ms | 32.5% bf16 MFU | 125198 tok/s step 10014/19560 | loss 3.429288 (+0.69z)| norm 0.2892 (+1.62z)| lr 4.18e-04 | 4143.93 ms | 32.6% bf16 MFU | 125264 tok/s step 10015/19560 | loss 3.383892 (-0.19z)| norm 0.2435 (-0.96z)| lr 4.18e-04 | 4152.94 ms | 32.5% bf16 MFU | 125314 tok/s step 10016/19560 | loss 3.361306 (-0.62z)| norm 0.2792 (+1.07z)| lr 4.18e-04 | 4154.07 ms | 32.5% bf16 MFU | 125358 tok/s step 10017/19560 | loss 3.436392 (+0.85z)| norm 0.2432 (-0.96z)| lr 4.18e-04 | 4165.90 ms | 32.4% bf16 MFU | 125383 tok/s step 10018/19560 | loss 3.329433 (-1.23z)| norm 0.2675 (+0.44z)| lr 4.18e-04 | 4153.61 ms | 32.5% bf16 MFU | 125425 tok/s step 10019/19560 | loss 3.382020 (-0.19z)| norm 0.2570 (-0.16z)| lr 4.18e-04 | 4193.49 ms | 32.2% bf16 MFU | 125405 tok/s step 10020/19560 | loss 3.377083 (-0.28z)| norm 0.2517 (-0.46z)| lr 4.18e-04 | 4153.48 ms | 32.5% bf16 MFU | 125446 tok/s step 10021/19560 | loss 3.466866 (+1.47z)| norm 0.2409 (-1.08z)| lr 4.18e-04 | 4160.22 ms | 32.5% bf16 MFU | 125475 tok/s step 10022/19560 | loss 3.408392 (+0.32z)| norm 0.2695 (+0.62z)| lr 4.18e-04 | 4731.65 ms | 28.5% bf16 MFU | 124742 tok/s step 10023/19560 | loss 3.472829 (+1.56z)| norm 0.2362 (-1.34z)| lr 4.17e-04 | 4788.47 ms | 28.2% bf16 MFU | 123979 tok/s step 10024/19560 | loss 3.374011 (-0.36z)| norm 0.2613 (+0.14z)| lr 4.17e-04 | 5170.64 ms | 26.1% bf16 MFU | 122850 tok/s step 10025/19560 | loss 3.445479 (+1.02z)| norm 0.2550 (-0.24z)| lr 4.17e-04 | 4338.31 ms | 31.1% bf16 MFU | 122750 tok/s step 10026/19560 | loss 3.419379 (+0.51z)| norm 0.2992 (+2.32z)| lr 4.17e-04 | 4447.32 ms | 30.4% bf16 MFU | 122507 tok/s step 10027/19560 | loss 3.413765 (+0.41z)| norm 0.3452 (+4.54z)| lr 4.17e-04 | 4173.96 ms | 32.3% bf16 MFU | 122662 tok/s step 10028/19560 | loss 3.465438 (+1.41z)| norm 0.3422 (+4.06z)| lr 4.17e-04 | 4358.70 ms | 31.0% bf16 MFU | 122543 tok/s step 10029/19560 | loss 3.385908 (-0.15z)| norm 0.2750 (+0.70z)| lr 4.17e-04 | 4179.71 ms | 32.3% bf16 MFU | 122688 tok/s step 10030/19560 | loss 3.498178 (+2.01z)| norm 0.3166 (+2.70z)| lr 4.17e-04 | 4234.70 ms | 31.9% bf16 MFU | 122744 tok/s step 10031/19560 | loss 3.365222 (-0.57z)| norm 0.2717 (+0.49z)| lr 4.17e-04 | 4205.11 ms | 32.1% bf16 MFU | 122841 tok/s step 10032/19560 | loss 3.484526 (+1.72z)| norm 0.2900 (+1.36z)| lr 4.17e-04 | 4232.02 ms | 31.9% bf16 MFU | 122893 tok/s step 10033/19560 | loss 3.475190 (+1.52z)| norm 0.2721 (+0.48z)| lr 4.17e-04 | 4248.25 ms | 31.8% bf16 MFU | 122919 tok/s step 10034/19560 | loss 3.416733 (+0.39z)| norm 0.2874 (+1.21z)| lr 4.17e-04 | 4297.73 ms | 31.4% bf16 MFU | 122872 tok/s step 10035/19560 | loss 3.373259 (-0.44z)| norm 0.2543 (-0.38z)| lr 4.17e-04 | 4225.85 ms | 32.0% bf16 MFU | 122932 tok/s step 10036/19560 | loss 3.469289 (+1.39z)| norm 0.2771 (+0.75z)| lr 4.17e-04 | 4178.51 ms | 32.3% bf16 MFU | 123059 tok/s step 10037/19560 | loss 3.325925 (-1.35z)| norm 0.2380 (-1.18z)| lr 4.17e-04 | 4204.03 ms | 32.1% bf16 MFU | 123142 tok/s step 10038/19560 | loss 3.361051 (-0.67z)| norm 0.2605 (-0.07z)| lr 4.16e-04 | 4192.06 ms | 32.2% bf16 MFU | 123238 tok/s step 10039/19560 | loss 3.422323 (+0.49z)| norm 0.2403 (-1.06z)| lr 4.16e-04 | 4167.88 ms | 32.4% bf16 MFU | 123366 tok/s step 10040/19560 | loss 3.400492 (+0.07z)| norm 0.2541 (-0.38z)| lr 4.16e-04 | 4187.24 ms | 32.2% bf16 MFU | 123458 tok/s step 10041/19560 | loss 3.407978 (+0.22z)| norm 0.2556 (-0.30z)| lr 4.16e-04 | 4217.06 ms | 32.0% bf16 MFU | 123501 tok/s step 10042/19560 | loss 3.420909 (+0.47z)| norm 0.2482 (-0.67z)| lr 4.16e-04 | 4154.20 ms | 32.5% bf16 MFU | 123637 tok/s step 10043/19560 | loss 3.326014 (-1.35z)| norm 0.2676 (+0.29z)| lr 4.16e-04 | 4172.72 ms | 32.4% bf16 MFU | 123737 tok/s step 10044/19560 | loss 3.424079 (+0.56z)| norm 0.2445 (-0.85z)| lr 4.16e-04 | 4261.67 ms | 31.7% bf16 MFU | 123702 tok/s step 10045/19560 | loss 3.364240 (-0.61z)| norm 0.2506 (-0.54z)| lr 4.16e-04 | 4170.08 ms | 32.4% bf16 MFU | 123803 tok/s step 10046/19560 | loss 3.412805 (+0.34z)| norm 0.2605 (-0.05z)| lr 4.16e-04 | 4199.08 ms | 32.2% bf16 MFU | 123855 tok/s step 10047/19560 | loss 3.418990 (+0.46z)| norm 0.2556 (-0.31z)| lr 4.16e-04 | 4162.84 ms | 32.4% bf16 MFU | 123960 tok/s step 10048/19560 | loss 3.389961 (-0.12z)| norm 0.2537 (-0.41z)| lr 4.16e-04 | 4166.67 ms | 32.4% bf16 MFU | 124053 tok/s step 10049/19560 | loss 3.421920 (+0.50z)| norm 0.2517 (-0.50z)| lr 4.16e-04 | 4247.08 ms | 31.8% bf16 MFU | 124023 tok/s step 10050/19560 | loss 3.409521 (+0.26z)| norm 0.2289 (-1.62z)| lr 4.16e-04 | 4162.24 ms | 32.4% bf16 MFU | 124120 tok/s step 10051/19560 | loss 3.369785 (-0.52z)| norm 0.2500 (-0.57z)| lr 4.16e-04 | 4155.19 ms | 32.5% bf16 MFU | 124223 tok/s step 10052/19560 | loss 3.372191 (-0.46z)| norm 0.2359 (-1.25z)| lr 4.16e-04 | 4171.95 ms | 32.4% bf16 MFU | 124295 tok/s step 10053/19560 | loss 3.359190 (-0.73z)| norm 0.2374 (-1.17z)| lr 4.15e-04 | 4183.12 ms | 32.3% bf16 MFU | 124347 tok/s step 10054/19560 | loss 3.394279 (-0.03z)| norm 0.2526 (-0.40z)| lr 4.15e-04 | 4178.50 ms | 32.3% bf16 MFU | 124403 tok/s step 10055/19560 | loss 3.377565 (-0.38z)| norm 0.2396 (-1.04z)| lr 4.15e-04 | 4162.35 ms | 32.4% bf16 MFU | 124481 tok/s step 10056/19560 | loss 3.493696 (+1.97z)| norm 0.2512 (-0.46z)| lr 4.15e-04 | 4209.86 ms | 32.1% bf16 MFU | 124484 tok/s step 10057/19560 | loss 3.447501 (+1.02z)| norm 0.2417 (-0.93z)| lr 4.15e-04 | 4159.09 ms | 32.5% bf16 MFU | 124563 tok/s step 10058/19560 | loss 3.435783 (+0.78z)| norm 0.2510 (-0.46z)| lr 4.15e-04 | 4161.26 ms | 32.4% bf16 MFU | 124634 tok/s step 10059/19560 | loss 3.368628 (-0.59z)| norm 0.2519 (-0.41z)| lr 4.15e-04 | 4165.43 ms | 32.4% bf16 MFU | 124696 tok/s step 10060/19560 | loss 3.446419 (+0.98z)| norm 0.2392 (-1.02z)| lr 4.15e-04 | 4159.27 ms | 32.5% bf16 MFU | 124764 tok/s step 10061/19560 | loss 3.342487 (-1.13z)| norm 0.2609 (+0.05z)| lr 4.15e-04 | 4162.16 ms | 32.4% bf16 MFU | 124824 tok/s step 10062/19560 | loss 3.396618 (-0.04z)| norm 0.2473 (-0.61z)| lr 4.15e-04 | 4233.29 ms | 31.9% bf16 MFU | 124775 tok/s step 10063/19560 | loss 3.411873 (+0.27z)| norm 0.2541 (-0.28z)| lr 4.15e-04 | 4151.39 ms | 32.5% bf16 MFU | 124851 tok/s step 10064/19560 | loss 3.370054 (-0.59z)| norm 0.2547 (-0.24z)| lr 4.15e-04 | 4167.47 ms | 32.4% bf16 MFU | 124899 tok/s step 10065/19560 | loss 3.428229 (+0.66z)| norm 0.2491 (-0.51z)| lr 4.15e-04 | 4190.84 ms | 32.2% bf16 MFU | 124909 tok/s step 10066/19560 | loss 3.372862 (-0.54z)| norm 0.2330 (-1.30z)| lr 4.15e-04 | 4173.98 ms | 32.3% bf16 MFU | 124944 tok/s step 10067/19560 | loss 3.515813 (+2.48z)| norm 0.2482 (-0.52z)| lr 4.14e-04 | 4168.42 ms | 32.4% bf16 MFU | 124985 tok/s step 10068/19560 | loss 3.401431 (+0.04z)| norm 0.2417 (-0.85z)| lr 4.14e-04 | 4157.40 ms | 32.5% bf16 MFU | 125042 tok/s step 10069/19560 | loss 3.435732 (+0.76z)| norm 0.2553 (-0.17z)| lr 4.14e-04 | 4158.25 ms | 32.5% bf16 MFU | 125094 tok/s step 10070/19560 | loss 3.416402 (+0.34z)| norm 0.2444 (-0.71z)| lr 4.14e-04 | 4160.52 ms | 32.5% bf16 MFU | 125140 tok/s step 10071/19560 | loss 3.420798 (+0.44z)| norm 0.2396 (-0.93z)| lr 4.14e-04 | 4230.94 ms | 31.9% bf16 MFU | 125079 tok/s step 10072/19560 | loss 3.411293 (+0.24z)| norm 0.2551 (-0.16z)| lr 4.14e-04 | 4229.11 ms | 31.9% bf16 MFU | 125023 tok/s step 10073/19560 | loss 3.377067 (-0.49z)| norm 0.2598 (+0.07z)| lr 4.14e-04 | 4175.51 ms | 32.3% bf16 MFU | 125050 tok/s step 10074/19560 | loss 3.344219 (-1.18z)| norm 0.2397 (-0.92z)| lr 4.14e-04 | 4383.62 ms | 30.8% bf16 MFU | 124778 tok/s step 10075/19560 | loss 3.377961 (-0.45z)| norm 0.2381 (-0.99z)| lr 4.14e-04 | 4160.08 ms | 32.5% bf16 MFU | 124840 tok/s step 10076/19560 | loss 3.364193 (-0.75z)| norm 0.2287 (-1.44z)| lr 4.14e-04 | 4159.40 ms | 32.5% bf16 MFU | 124901 tok/s step 10077/19560 | loss 3.419448 (+0.42z)| norm 0.2601 (+0.10z)| lr 4.14e-04 | 4210.14 ms | 32.1% bf16 MFU | 124882 tok/s step 10078/19560 | loss 3.430171 (+0.63z)| norm 0.2662 (+0.40z)| lr 4.14e-04 | 4161.02 ms | 32.4% bf16 MFU | 124938 tok/s step 10079/19560 | loss 3.401068 (+0.02z)| norm 0.2604 (+0.11z)| lr 4.14e-04 | 4153.34 ms | 32.5% bf16 MFU | 125003 tok/s step 10080/19560 | loss 3.329692 (-1.50z)| norm 0.3028 (+2.14z)| lr 4.14e-04 | 4177.08 ms | 32.3% bf16 MFU | 125028 tok/s step 10081/19560 | loss 3.389343 (-0.22z)| norm 0.2760 (+0.83z)| lr 4.14e-04 | 4160.79 ms | 32.4% bf16 MFU | 125077 tok/s step 10082/19560 | loss 3.412528 (+0.28z)| norm 0.2423 (-0.80z)| lr 4.13e-04 | 4160.65 ms | 32.5% bf16 MFU | 125124 tok/s step 10083/19560 | loss 3.426036 (+0.59z)| norm 0.2658 (+0.34z)| lr 4.13e-04 | 4162.42 ms | 32.4% bf16 MFU | 125166 tok/s step 10084/19560 | loss 3.366351 (-0.72z)| norm 0.2680 (+0.44z)| lr 4.13e-04 | 4158.49 ms | 32.5% bf16 MFU | 125211 tok/s step 10085/19560 | loss 3.499573 (+2.14z)| norm 0.2774 (+0.89z)| lr 4.13e-04 | 4164.79 ms | 32.4% bf16 MFU | 125245 tok/s step 10086/19560 | loss 3.387433 (-0.28z)| norm 0.2556 (-0.17z)| lr 4.13e-04 | 4159.78 ms | 32.5% bf16 MFU | 125285 tok/s step 10087/19560 | loss 3.412734 (+0.26z)| norm 0.2784 (+0.92z)| lr 4.13e-04 | 4168.80 ms | 32.4% bf16 MFU | 125309 tok/s step 10088/19560 | loss 3.400729 (-0.01z)| norm 0.2418 (-0.83z)| lr 4.13e-04 | 4164.53 ms | 32.4% bf16 MFU | 125338 tok/s step 10089/19560 | loss 3.398135 (-0.04z)| norm 0.2821 (+1.09z)| lr 4.13e-04 | 4202.65 ms | 32.1% bf16 MFU | 125309 tok/s step 10090/19560 | loss 3.414285 (+0.34z)| norm 0.2286 (-1.44z)| lr 4.13e-04 | 4156.25 ms | 32.5% bf16 MFU | 125350 tok/s step 10091/19560 | loss 3.418299 (+0.42z)| norm 0.2880 (+1.36z)| lr 4.13e-04 | 4154.19 ms | 32.5% bf16 MFU | 125393 tok/s step 10092/19560 | loss 3.396757 (-0.07z)| norm 0.2622 (+0.14z)| lr 4.13e-04 | 4160.11 ms | 32.5% bf16 MFU | 125425 tok/s step 10093/19560 | loss 3.425316 (+0.59z)| norm 0.2664 (+0.37z)| lr 4.13e-04 | 4158.82 ms | 32.5% bf16 MFU | 125457 tok/s step 10094/19560 | loss 3.422878 (+0.52z)| norm 0.2868 (+1.33z)| lr 4.13e-04 | 4163.95 ms | 32.4% bf16 MFU | 125480 tok/s step 10095/19560 | loss 3.407608 (+0.16z)| norm 0.3119 (+2.45z)| lr 4.13e-04 | 4165.69 ms | 32.4% bf16 MFU | 125499 tok/s step 10096/19560 | loss 3.393061 (-0.17z)| norm 0.2742 (+0.67z)| lr 4.13e-04 | 4163.78 ms | 32.4% bf16 MFU | 125520 tok/s step 10097/19560 | loss 3.459677 (+1.38z)| norm 0.2846 (+1.14z)| lr 4.12e-04 | 4159.89 ms | 32.5% bf16 MFU | 125545 tok/s step 10098/19560 | loss 3.446021 (+1.05z)| norm 0.2583 (-0.10z)| lr 4.12e-04 | 4155.31 ms | 32.5% bf16 MFU | 125577 tok/s step 10099/19560 | loss 3.464205 (+1.45z)| norm 0.2661 (+0.26z)| lr 4.12e-04 | 4173.12 ms | 32.4% bf16 MFU | 125580 tok/s step 10100/19560 | loss 3.378063 (-0.55z)| norm 0.2602 (-0.03z)| lr 4.12e-04 | 4158.44 ms | 32.5% bf16 MFU | 125604 tok/s step 10101/19560 | loss 3.381891 (-0.46z)| norm 0.2554 (-0.27z)| lr 4.12e-04 | 4160.50 ms | 32.5% bf16 MFU | 125625 tok/s step 10102/19560 | loss 3.438759 (+0.86z)| norm 0.2513 (-0.46z)| lr 4.12e-04 | 4159.29 ms | 32.5% bf16 MFU | 125646 tok/s step 10103/19560 | loss 3.389377 (-0.29z)| norm 0.2585 (-0.11z)| lr 4.12e-04 | 4163.75 ms | 32.4% bf16 MFU | 125660 tok/s step 10104/19560 | loss 3.393791 (-0.18z)| norm 0.2424 (-0.87z)| lr 4.12e-04 | 4164.86 ms | 32.4% bf16 MFU | 125671 tok/s step 10105/19560 | loss 3.350063 (-1.19z)| norm 0.2809 (+0.95z)| lr 4.12e-04 | 4256.49 ms | 31.7% bf16 MFU | 125546 tok/s step 10106/19560 | loss 3.387797 (-0.31z)| norm 0.2358 (-1.18z)| lr 4.12e-04 | 4165.66 ms | 32.4% bf16 MFU | 125562 tok/s step 10107/19560 | loss 3.408391 (+0.16z)| norm 0.2567 (-0.19z)| lr 4.12e-04 | 4149.66 ms | 32.5% bf16 MFU | 125601 tok/s step 10108/19560 | loss 3.435572 (+0.79z)| norm 0.2425 (-0.86z)| lr 4.12e-04 | 4164.90 ms | 32.4% bf16 MFU | 125615 tok/s step 10109/19560 | loss 3.438162 (+0.84z)| norm 0.2491 (-0.55z)| lr 4.12e-04 | 4158.54 ms | 32.5% bf16 MFU | 125638 tok/s step 10110/19560 | loss 3.381927 (-0.47z)| norm 0.2634 (+0.13z)| lr 4.12e-04 | 4152.59 ms | 32.5% bf16 MFU | 125669 tok/s step 10111/19560 | loss 3.456131 (+1.28z)| norm 0.2731 (+0.59z)| lr 4.12e-04 | 4164.66 ms | 32.4% bf16 MFU | 125680 tok/s step 10112/19560 | loss 3.356411 (-1.07z)| norm 0.2627 (+0.10z)| lr 4.11e-04 | 4177.39 ms | 32.3% bf16 MFU | 125671 tok/s step 10113/19560 | loss 3.430871 (+0.67z)| norm 0.2531 (-0.35z)| lr 4.11e-04 | 4165.25 ms | 32.4% bf16 MFU | 125681 tok/s step 10114/19560 | loss 3.389236 (-0.32z)| norm 0.2854 (+1.17z)| lr 4.11e-04 | 4164.32 ms | 32.4% bf16 MFU | 125692 tok/s step 10115/19560 | loss 3.544868 (+3.22z)| norm 0.2717 (+0.52z)| lr 4.11e-04 | 4157.76 ms | 32.5% bf16 MFU | 125713 tok/s step 10116/19560 | loss 3.412533 (+0.19z)| norm 0.2456 (-0.70z)| lr 4.11e-04 | 4167.33 ms | 32.4% bf16 MFU | 125717 tok/s step 10117/19560 | loss 3.371871 (-0.74z)| norm 0.2590 (-0.07z)| lr 4.11e-04 | 4164.53 ms | 32.4% bf16 MFU | 125726 tok/s step 10118/19560 | loss 3.395941 (-0.19z)| norm 0.2510 (-0.44z)| lr 4.11e-04 | 4164.41 ms | 32.4% bf16 MFU | 125735 tok/s step 10119/19560 | loss 3.384067 (-0.45z)| norm 0.2530 (-0.34z)| lr 4.11e-04 | 4164.18 ms | 32.4% bf16 MFU | 125743 tok/s step 10120/19560 | loss 3.439159 (+0.80z)| norm 0.2393 (-1.00z)| lr 4.11e-04 | 4156.22 ms | 32.5% bf16 MFU | 125763 tok/s step 10121/19560 | loss 3.392404 (-0.27z)| norm 0.2413 (-0.89z)| lr 4.11e-04 | 4157.45 ms | 32.5% bf16 MFU | 125781 tok/s step 10122/19560 | loss 3.411258 (+0.16z)| norm 0.2369 (-1.09z)| lr 4.11e-04 | 4156.60 ms | 32.5% bf16 MFU | 125798 tok/s step 10123/19560 | loss 3.380286 (-0.56z)| norm 0.2469 (-0.59z)| lr 4.11e-04 | 4167.75 ms | 32.4% bf16 MFU | 125798 tok/s step 10124/19560 | loss 3.487412 (+1.89z)| norm 0.2427 (-0.79z)| lr 4.11e-04 | 4162.34 ms | 32.4% bf16 MFU | 125806 tok/s step 10125/19560 | loss 3.386724 (-0.41z)| norm 0.2435 (-0.75z)| lr 4.11e-04 | 4160.01 ms | 32.5% bf16 MFU | 125817 tok/s step 10126/19560 | loss 3.347374 (-1.35z)| norm 0.2354 (-1.12z)| lr 4.10e-04 | 4161.39 ms | 32.4% bf16 MFU | 125826 tok/s step 10127/19560 | loss 3.411822 (+0.15z)| norm 0.2515 (-0.34z)| lr 4.10e-04 | 4161.14 ms | 32.4% bf16 MFU | 125835 tok/s step 10128/19560 | loss 3.429973 (+0.56z)| norm 0.2686 (+0.49z)| lr 4.10e-04 | 4161.91 ms | 32.4% bf16 MFU | 125841 tok/s step 10129/19560 | loss 3.416800 (+0.25z)| norm 0.2319 (-1.29z)| lr 4.10e-04 | 4166.55 ms | 32.4% bf16 MFU | 125841 tok/s step 10130/19560 | loss 3.403462 (-0.07z)| norm 0.2585 (+0.01z)| lr 4.10e-04 | 4167.89 ms | 32.4% bf16 MFU | 125839 tok/s step 10131/19560 | loss 3.392060 (-0.35z)| norm 0.2628 (+0.21z)| lr 4.10e-04 | 4166.89 ms | 32.4% bf16 MFU | 125838 tok/s step 10132/19560 | loss 3.414049 (+0.16z)| norm 0.2574 (-0.06z)| lr 4.10e-04 | 4158.67 ms | 32.5% bf16 MFU | 125849 tok/s step 10133/19560 | loss 3.428574 (+0.50z)| norm 0.2433 (-0.76z)| lr 4.10e-04 | 4167.34 ms | 32.4% bf16 MFU | 125847 tok/s step 10134/19560 | loss 3.423271 (+0.37z)| norm 0.2584 (-0.02z)| lr 4.10e-04 | 4164.84 ms | 32.4% bf16 MFU | 125849 tok/s step 10135/19560 | loss 3.429215 (+0.52z)| norm 0.2394 (-0.95z)| lr 4.10e-04 | 4163.53 ms | 32.4% bf16 MFU | 125853 tok/s step 10136/19560 | loss 3.374274 (-0.79z)| norm 0.2826 (+1.16z)| lr 4.10e-04 | 4166.62 ms | 32.4% bf16 MFU | 125852 tok/s step 10137/19560 | loss 3.366859 (-0.95z)| norm 0.2482 (-0.53z)| lr 4.10e-04 | 4160.52 ms | 32.5% bf16 MFU | 125860 tok/s step 10138/19560 | loss 3.439411 (+0.80z)| norm 0.2791 (+0.98z)| lr 4.10e-04 | 4169.49 ms | 32.4% bf16 MFU | 125854 tok/s step 10139/19560 | loss 3.443211 (+0.88z)| norm 0.2380 (-1.02z)| lr 4.10e-04 | 4165.27 ms | 32.4% bf16 MFU | 125855 tok/s step 10140/19560 | loss 3.416078 (+0.19z)| norm 0.2505 (-0.40z)| lr 4.10e-04 | 4171.27 ms | 32.4% bf16 MFU | 125847 tok/s step 10141/19560 | loss 3.361177 (-1.18z)| norm 0.2479 (-0.53z)| lr 4.09e-04 | 4162.21 ms | 32.4% bf16 MFU | 125853 tok/s step 10142/19560 | loss 3.394125 (-0.35z)| norm 0.2390 (-0.95z)| lr 4.09e-04 | 4165.49 ms | 32.4% bf16 MFU | 125853 tok/s step 10143/19560 | loss 3.402496 (-0.14z)| norm 0.2412 (-0.84z)| lr 4.09e-04 | 4165.34 ms | 32.4% bf16 MFU | 125854 tok/s step 10144/19560 | loss 3.412150 (+0.10z)| norm 0.2628 (+0.24z)| lr 4.09e-04 | 4175.67 ms | 32.3% bf16 MFU | 125839 tok/s step 10145/19560 | loss 3.435289 (+0.68z)| norm 0.2600 (+0.09z)| lr 4.09e-04 | 4164.31 ms | 32.4% bf16 MFU | 125842 tok/s step 10146/19560 | loss 3.370160 (-0.99z)| norm 0.2367 (-1.05z)| lr 4.09e-04 | 4165.19 ms | 32.4% bf16 MFU | 125844 tok/s step 10147/19560 | loss 3.349883 (-1.49z)| norm 0.2648 (+0.34z)| lr 4.09e-04 | 4166.33 ms | 32.4% bf16 MFU | 125844 tok/s step 10148/19560 | loss 3.388128 (-0.52z)| norm 0.2445 (-0.66z)| lr 4.09e-04 | 4181.56 ms | 32.3% bf16 MFU | 125821 tok/s step 10149/19560 | loss 3.405938 (-0.05z)| norm 0.2432 (-0.73z)| lr 4.09e-04 | 4157.76 ms | 32.5% bf16 MFU | 125834 tok/s step 10150/19560 | loss 3.385417 (-0.58z)| norm 0.2393 (-0.91z)| lr 4.09e-04 | 4166.96 ms | 32.4% bf16 MFU | 125834 tok/s step 10151/19560 | loss 3.442926 (+0.92z)| norm 0.2546 (-0.16z)| lr 4.09e-04 | 4163.51 ms | 32.4% bf16 MFU | 125838 tok/s step 10152/19560 | loss 3.403124 (-0.12z)| norm 0.2585 (+0.04z)| lr 4.09e-04 | 4158.29 ms | 32.5% bf16 MFU | 125850 tok/s step 10153/19560 | loss 3.402130 (-0.14z)| norm 0.2522 (-0.28z)| lr 4.09e-04 | 4158.13 ms | 32.5% bf16 MFU | 125862 tok/s step 10154/19560 | loss 3.439099 (+0.82z)| norm 0.2644 (+0.35z)| lr 4.09e-04 | 4160.97 ms | 32.4% bf16 MFU | 125869 tok/s step 10155/19560 | loss 3.461042 (+1.37z)| norm 0.2511 (-0.31z)| lr 4.09e-04 | 4165.61 ms | 32.4% bf16 MFU | 125869 tok/s step 10156/19560 | loss 3.389452 (-0.47z)| norm 0.2400 (-0.97z)| lr 4.08e-04 | 4166.55 ms | 32.4% bf16 MFU | 125867 tok/s step 10157/19560 | loss 3.366096 (-1.07z)| norm 0.2302 (-1.52z)| lr 4.08e-04 | 5689.23 ms | 23.7% bf16 MFU | 124181 tok/s step 10158/19560 | loss 3.423739 (+0.45z)| norm 0.2367 (-1.16z)| lr 4.08e-04 | 4150.68 ms | 32.5% bf16 MFU | 124288 tok/s step 10159/19560 | loss 3.512025 (+2.70z)| norm 0.5548 (+9.68z)| lr 4.08e-04 | 4158.91 ms | 32.5% bf16 MFU | 124377 tok/s step 10160/19560 | loss 3.403718 (-0.09z)| norm 0.3218 (+2.06z)| lr 4.08e-04 | 4165.91 ms | 32.4% bf16 MFU | 124451 tok/s step 10161/19560 | loss 3.370938 (-0.94z)| norm 0.2754 (+0.57z)| lr 4.08e-04 | 4161.17 ms | 32.4% bf16 MFU | 124528 tok/s step 10162/19560 | loss 3.479658 (+1.91z)| norm 0.3006 (+1.38z)| lr 4.08e-04 | 4207.56 ms | 32.1% bf16 MFU | 124532 tok/s step 10163/19560 | loss 3.326808 (-2.06z)| norm 0.2645 (+0.22z)| lr 4.08e-04 | 4153.93 ms | 32.5% bf16 MFU | 124616 tok/s step 10164/19560 | loss 3.416833 (+0.28z)| norm 0.2880 (+0.97z)| lr 4.08e-04 | 4162.03 ms | 32.4% bf16 MFU | 124684 tok/s step 10165/19560 | loss 3.489347 (+2.14z)| norm 0.2521 (-0.19z)| lr 4.08e-04 | 4165.83 ms | 32.4% bf16 MFU | 124742 tok/s step 10166/19560 | loss 3.424405 (+0.43z)| norm 0.3101 (+1.64z)| lr 4.08e-04 | 4160.35 ms | 32.5% bf16 MFU | 124806 tok/s step 10167/19560 | loss 3.402220 (-0.14z)| norm 0.2476 (-0.34z)| lr 4.08e-04 | 4160.62 ms | 32.5% bf16 MFU | 124866 tok/s step 10168/19560 | loss 3.457661 (+1.29z)| norm 0.3071 (+1.52z)| lr 4.08e-04 | 4163.59 ms | 32.4% bf16 MFU | 124919 tok/s step 10169/19560 | loss 3.404209 (-0.10z)| norm 0.2641 (+0.17z)| lr 4.08e-04 | 4166.33 ms | 32.4% bf16 MFU | 124965 tok/s step 10170/19560 | loss 3.463003 (+1.41z)| norm 0.2694 (+0.33z)| lr 4.08e-04 | 4152.62 ms | 32.5% bf16 MFU | 125030 tok/s step 10171/19560 | loss 3.418236 (+0.24z)| norm 0.2721 (+0.41z)| lr 4.07e-04 | 4157.06 ms | 32.5% bf16 MFU | 125084 tok/s step 10172/19560 | loss 3.356987 (-1.35z)| norm 0.2565 (-0.08z)| lr 4.07e-04 | 4159.56 ms | 32.5% bf16 MFU | 125132 tok/s step 10173/19560 | loss 3.386194 (-0.59z)| norm 0.2683 (+0.28z)| lr 4.07e-04 | 4168.21 ms | 32.4% bf16 MFU | 125165 tok/s step 10174/19560 | loss 3.433606 (+0.65z)| norm 0.2717 (+0.39z)| lr 4.07e-04 | 4163.03 ms | 32.4% bf16 MFU | 125203 tok/s step 10175/19560 | loss 3.366073 (-1.11z)| norm 0.2624 (+0.09z)| lr 4.07e-04 | 4159.39 ms | 32.5% bf16 MFU | 125246 tok/s step 10176/19560 | loss 3.419381 (+0.28z)| norm 0.2576 (-0.06z)| lr 4.07e-04 | 4161.59 ms | 32.4% bf16 MFU | 125282 tok/s step 10177/19560 | loss 3.457564 (+1.26z)| norm 0.2603 (+0.03z)| lr 4.07e-04 | 4161.26 ms | 32.4% bf16 MFU | 125318 tok/s step 10178/19560 | loss 3.455370 (+1.19z)| norm 0.2539 (-0.18z)| lr 4.07e-04 | 4157.53 ms | 32.5% bf16 MFU | 125357 tok/s step 10179/19560 | loss 3.377636 (-0.82z)| norm 0.2456 (-0.44z)| lr 4.07e-04 | 4159.08 ms | 32.5% bf16 MFU | 125392 tok/s step 10180/19560 | loss 3.399291 (-0.27z)| norm 0.2630 (+0.10z)| lr 4.07e-04 | 4161.56 ms | 32.4% bf16 MFU | 125422 tok/s step 10181/19560 | loss 3.408325 (-0.05z)| norm 0.2453 (-0.46z)| lr 4.07e-04 | 4161.82 ms | 32.4% bf16 MFU | 125450 tok/s step 10182/19560 | loss 3.442608 (+0.84z)| norm 0.2608 (+0.03z)| lr 4.07e-04 | 4162.93 ms | 32.4% bf16 MFU | 125474 tok/s step 10183/19560 | loss 3.392392 (-0.48z)| norm 0.2627 (+0.08z)| lr 4.07e-04 | 4168.40 ms | 32.4% bf16 MFU | 125489 tok/s step 10184/19560 | loss 3.404751 (-0.14z)| norm 0.2433 (-0.53z)| lr 4.07e-04 | 4164.84 ms | 32.4% bf16 MFU | 125509 tok/s step 10185/19560 | loss 3.396486 (-0.35z)| norm 0.2426 (-0.55z)| lr 4.06e-04 | 4167.44 ms | 32.4% bf16 MFU | 125524 tok/s step 10186/19560 | loss 3.421271 (+0.32z)| norm 0.2703 (+0.32z)| lr 4.06e-04 | 4165.16 ms | 32.4% bf16 MFU | 125542 tok/s step 10187/19560 | loss 3.381546 (-0.75z)| norm 0.2332 (-0.85z)| lr 4.06e-04 | 4163.73 ms | 32.4% bf16 MFU | 125560 tok/s step 10188/19560 | loss 3.449514 (+1.07z)| norm 0.2500 (-0.32z)| lr 4.06e-04 | 4164.16 ms | 32.4% bf16 MFU | 125578 tok/s step 10189/19560 | loss 3.432184 (+0.60z)| norm 0.2605 (+0.01z)| lr 4.06e-04 | 4163.35 ms | 32.4% bf16 MFU | 125595 tok/s step 10190/19560 | loss 3.417613 (+0.20z)| norm 0.2543 (-0.19z)| lr 4.06e-04 | 4171.70 ms | 32.4% bf16 MFU | 125599 tok/s step 10191/19560 | loss 3.449255 (+1.04z)| norm 0.2520 (-0.26z)| lr 4.06e-04 | 4159.18 ms | 32.5% bf16 MFU | 125622 tok/s step 10192/19560 | loss 3.335834 (-1.99z)| norm 0.2609 (+0.02z)| lr 4.06e-04 | 4159.87 ms | 32.5% bf16 MFU | 125643 tok/s step 10193/19560 | loss 3.388234 (-0.58z)| norm 0.2242 (-1.13z)| lr 4.06e-04 | 4171.09 ms | 32.4% bf16 MFU | 125645 tok/s step 10194/19560 | loss 3.433527 (+0.61z)| norm 0.2668 (+0.20z)| lr 4.06e-04 | 4157.01 ms | 32.5% bf16 MFU | 125669 tok/s step 10195/19560 | loss 3.438015 (+0.77z)| norm 0.2361 (-0.76z)| lr 4.06e-04 | 4166.74 ms | 32.4% bf16 MFU | 125677 tok/s step 10196/19560 | loss 3.428409 (+0.50z)| norm 0.2385 (-0.68z)| lr 4.06e-04 | 4218.82 ms | 32.0% bf16 MFU | 125607 tok/s step 10197/19560 | loss 3.405688 (-0.12z)| norm 0.2461 (-0.44z)| lr 4.06e-04 | 4162.00 ms | 32.4% bf16 MFU | 125625 tok/s step 10198/19560 | loss 3.433073 (+0.63z)| norm 0.3050 (+1.38z)| lr 4.06e-04 | 4165.20 ms | 32.4% bf16 MFU | 125637 tok/s step 10199/19560 | loss 3.358315 (-1.40z)| norm 0.2669 (+0.19z)| lr 4.06e-04 | 4163.53 ms | 32.4% bf16 MFU | 125652 tok/s step 10200/19560 | loss 3.407105 (-0.07z)| norm 0.2384 (-0.70z)| lr 4.05e-04 | 4180.43 ms | 32.3% bf16 MFU | 125640 tok/s step 10201/19560 | loss 3.497593 (+2.34z)| norm 0.2655 (+0.15z)| lr 4.05e-04 | 4159.28 ms | 32.5% bf16 MFU | 125661 tok/s step 10202/19560 | loss 3.457525 (+1.25z)| norm 0.2403 (-0.64z)| lr 4.05e-04 | 4170.14 ms | 32.4% bf16 MFU | 125664 tok/s step 10203/19560 | loss 3.435243 (+0.63z)| norm 0.2728 (+0.37z)| lr 4.05e-04 | 4169.06 ms | 32.4% bf16 MFU | 125668 tok/s step 10204/19560 | loss 3.422898 (+0.29z)| norm 0.2499 (-0.35z)| lr 4.05e-04 | 4173.48 ms | 32.4% bf16 MFU | 125666 tok/s step 10205/19560 | loss 3.397494 (-0.40z)| norm 0.2655 (+0.13z)| lr 4.05e-04 | 4173.33 ms | 32.4% bf16 MFU | 125664 tok/s step 10206/19560 | loss 3.510473 (+2.59z)| norm 0.2611 (-0.00z)| lr 4.05e-04 | 4159.08 ms | 32.5% bf16 MFU | 125684 tok/s step 10207/19560 | loss 3.388050 (-0.65z)| norm 0.2493 (-0.37z)| lr 4.05e-04 | 4167.13 ms | 32.4% bf16 MFU | 125691 tok/s step 10208/19560 | loss 3.439005 (+0.69z)| norm 0.2731 (+0.39z)| lr 4.05e-04 | 4171.63 ms | 32.4% bf16 MFU | 125690 tok/s step 10209/19560 | loss 3.382305 (-0.84z)| norm 0.2468 (-0.44z)| lr 4.05e-04 | 4167.63 ms | 32.4% bf16 MFU | 125695 tok/s step 10210/19560 | loss 3.363454 (-1.33z)| norm 0.2948 (+1.06z)| lr 4.05e-04 | 5348.75 ms | 25.2% bf16 MFU | 124312 tok/s step 10211/19560 | loss 3.410266 (-0.07z)| norm 0.2845 (+0.73z)| lr 4.05e-04 | 4159.60 ms | 32.5% bf16 MFU | 124398 tok/s step 10212/19560 | loss 3.429111 (+0.42z)| norm 0.2495 (-0.36z)| lr 4.05e-04 | 4301.91 ms | 31.4% bf16 MFU | 124272 tok/s step 10213/19560 | loss 3.345263 (-1.82z)| norm 0.2656 (+0.15z)| lr 4.05e-04 | 4737.73 ms | 28.5% bf16 MFU | 123592 tok/s step 10214/19560 | loss 3.394015 (-0.50z)| norm 0.2504 (-0.33z)| lr 4.05e-04 | 4671.02 ms | 28.9% bf16 MFU | 123024 tok/s step 10215/19560 | loss 3.405364 (-0.19z)| norm 0.2687 (+0.25z)| lr 4.04e-04 | 4555.71 ms | 29.6% bf16 MFU | 122627 tok/s step 10216/19560 | loss 3.443397 (+0.84z)| norm 0.2680 (+0.22z)| lr 4.04e-04 | 4241.90 ms | 31.8% bf16 MFU | 122676 tok/s step 10217/19560 | loss 3.426714 (+0.38z)| norm 0.2661 (+0.16z)| lr 4.04e-04 | 4521.83 ms | 29.9% bf16 MFU | 122339 tok/s step 10218/19560 | loss 3.429770 (+0.46z)| norm 0.2660 (+0.15z)| lr 4.04e-04 | 4321.72 ms | 31.2% bf16 MFU | 122288 tok/s step 10219/19560 | loss 3.355507 (-1.53z)| norm 0.2427 (-0.57z)| lr 4.04e-04 | 4235.41 ms | 31.9% bf16 MFU | 122363 tok/s step 10220/19560 | loss 3.379972 (-0.86z)| norm 0.2920 (+0.97z)| lr 4.04e-04 | 4220.41 ms | 32.0% bf16 MFU | 122456 tok/s step 10221/19560 | loss 3.408744 (-0.09z)| norm 0.2490 (-0.38z)| lr 4.04e-04 | 4171.10 ms | 32.4% bf16 MFU | 122618 tok/s step 10222/19560 | loss 3.474102 (+1.63z)| norm 0.3072 (+1.44z)| lr 4.04e-04 | 4180.62 ms | 32.3% bf16 MFU | 122758 tok/s step 10223/19560 | loss 3.389632 (-0.60z)| norm 0.2584 (-0.07z)| lr 4.04e-04 | 4347.67 ms | 31.1% bf16 MFU | 122649 tok/s step 10224/19560 | loss 3.465948 (+1.39z)| norm 0.2766 (+0.50z)| lr 4.04e-04 | 4264.09 ms | 31.7% bf16 MFU | 122664 tok/s step 10225/19560 | loss 3.367692 (-1.17z)| norm 0.2454 (-0.48z)| lr 4.04e-04 | 4195.81 ms | 32.2% bf16 MFU | 122779 tok/s step 10226/19560 | loss 3.358676 (-1.38z)| norm 0.2463 (-0.45z)| lr 4.04e-04 | 4296.45 ms | 31.4% bf16 MFU | 122741 tok/s step 10227/19560 | loss 3.422134 (+0.29z)| norm 0.2712 (+0.34z)| lr 4.04e-04 | 4248.91 ms | 31.8% bf16 MFU | 122774 tok/s step 10228/19560 | loss 3.358300 (-1.39z)| norm 0.2504 (-0.31z)| lr 4.04e-04 | 4168.37 ms | 32.4% bf16 MFU | 122924 tok/s step 10229/19560 | loss 3.438252 (+0.70z)| norm 0.2728 (+0.39z)| lr 4.04e-04 | 4178.33 ms | 32.3% bf16 MFU | 123052 tok/s step 10230/19560 | loss 3.430724 (+0.51z)| norm 0.2555 (-0.16z)| lr 4.03e-04 | 4187.11 ms | 32.2% bf16 MFU | 123160 tok/s step 10231/19560 | loss 3.353785 (-1.50z)| norm 0.2737 (+0.42z)| lr 4.03e-04 | 4226.10 ms | 31.9% bf16 MFU | 123205 tok/s step 10232/19560 | loss 3.397122 (-0.37z)| norm 0.2591 (-0.05z)| lr 4.03e-04 | 4165.53 ms | 32.4% bf16 MFU | 123338 tok/s step 10233/19560 | loss 3.381355 (-0.79z)| norm 0.3002 (+1.24z)| lr 4.03e-04 | 4165.21 ms | 32.4% bf16 MFU | 123465 tok/s step 10234/19560 | loss 3.496120 (+2.17z)| norm 0.2657 (+0.14z)| lr 4.03e-04 | 4175.80 ms | 32.3% bf16 MFU | 123569 tok/s step 10235/19560 | loss 3.381344 (-0.79z)| norm 0.2614 (+0.01z)| lr 4.03e-04 | 4175.58 ms | 32.3% bf16 MFU | 123669 tok/s step 10236/19560 | loss 3.399469 (-0.32z)| norm 0.2744 (+0.41z)| lr 4.03e-04 | 4231.73 ms | 31.9% bf16 MFU | 123680 tok/s step 10237/19560 | loss 3.330786 (-2.04z)| norm 0.2374 (-0.75z)| lr 4.03e-04 | 4168.40 ms | 32.4% bf16 MFU | 123785 tok/s step 10238/19560 | loss 3.425152 (+0.35z)| norm 0.2608 (-0.01z)| lr 4.03e-04 | 4210.12 ms | 32.1% bf16 MFU | 123822 tok/s step 10239/19560 | loss 3.414367 (+0.09z)| norm 0.2567 (-0.14z)| lr 4.03e-04 | 4185.72 ms | 32.3% bf16 MFU | 123894 tok/s step 10240/19560 | loss 3.434814 (+0.60z)| norm 0.2731 (+0.38z)| lr 4.03e-04 | 4175.58 ms | 32.3% bf16 MFU | 123977 tok/s step 10241/19560 | loss 3.408619 (-0.07z)| norm 0.2431 (-0.57z)| lr 4.03e-04 | 4164.88 ms | 32.4% bf16 MFU | 124073 tok/s step 10242/19560 | loss 3.383029 (-0.73z)| norm 0.2553 (-0.18z)| lr 4.03e-04 | 4184.63 ms | 32.3% bf16 MFU | 124133 tok/s step 10243/19560 | loss 3.347958 (-1.66z)| norm 0.2434 (-0.55z)| lr 4.03e-04 | 4216.07 ms | 32.0% bf16 MFU | 124144 tok/s step 10244/19560 | loss 3.377185 (-0.86z)| norm 0.2430 (-0.56z)| lr 4.03e-04 | 4177.91 ms | 32.3% bf16 MFU | 124212 tok/s step 10245/19560 | loss 3.383801 (-0.69z)| norm 0.2474 (-0.42z)| lr 4.02e-04 | 4195.54 ms | 32.2% bf16 MFU | 124249 tok/s step 10246/19560 | loss 3.386260 (-0.62z)| norm 0.2461 (-0.45z)| lr 4.02e-04 | 4165.96 ms | 32.4% bf16 MFU | 124329 tok/s step 10247/19560 | loss 3.307022 (-2.65z)| norm 0.2375 (-0.72z)| lr 4.02e-04 | 4173.35 ms | 32.4% bf16 MFU | 124394 tok/s step 10248/19560 | loss 3.338509 (-1.79z)| norm 0.2538 (-0.21z)| lr 4.02e-04 | 4174.30 ms | 32.3% bf16 MFU | 124455 tok/s step 10249/19560 | loss 3.304475 (-2.58z)| norm 0.2498 (-0.34z)| lr 4.02e-04 | 4173.27 ms | 32.4% bf16 MFU | 124513 tok/s step 10250/19560 | loss 3.418101 (+0.26z)| norm 0.2273 (-1.05z)| lr 4.02e-04 | 4181.83 ms | 32.3% bf16 MFU | 124556 tok/s val loss 3.373581 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2945/10042 = 0.293268 step 10251/19560 | loss 3.438694 (+0.77z)| norm 0.2675 (+0.21z)| lr 4.02e-04 | 4173.26 ms | 32.4% bf16 MFU | 124610 tok/s step 10252/19560 | loss 3.339856 (-1.69z)| norm 0.2313 (-0.92z)| lr 4.02e-04 | 4240.49 ms | 31.8% bf16 MFU | 124561 tok/s step 10253/19560 | loss 3.333850 (-1.81z)| norm 0.2537 (-0.22z)| lr 4.02e-04 | 4174.07 ms | 32.3% bf16 MFU | 124614 tok/s step 10254/19560 | loss 3.421298 (+0.36z)| norm 0.2534 (-0.23z)| lr 4.02e-04 | 4186.50 ms | 32.3% bf16 MFU | 124645 tok/s step 10255/19560 | loss 3.417558 (+0.26z)| norm 0.2489 (-0.37z)| lr 4.02e-04 | 4190.60 ms | 32.2% bf16 MFU | 124668 tok/s step 10256/19560 | loss 3.399670 (-0.18z)| norm 0.2274 (-1.04z)| lr 4.02e-04 | 4174.56 ms | 32.3% bf16 MFU | 124714 tok/s step 10257/19560 | loss 3.404394 (-0.06z)| norm 0.2517 (-0.28z)| lr 4.02e-04 | 4177.74 ms | 32.3% bf16 MFU | 124753 tok/s step 10258/19560 | loss 3.363167 (-1.08z)| norm 0.2428 (-0.56z)| lr 4.02e-04 | 4289.84 ms | 31.5% bf16 MFU | 124626 tok/s step 10259/19560 | loss 3.408079 (+0.04z)| norm 0.2621 (+0.05z)| lr 4.01e-04 | 4204.23 ms | 32.1% bf16 MFU | 124630 tok/s step 10260/19560 | loss 3.370871 (-0.88z)| norm 0.2471 (-0.42z)| lr 4.01e-04 | 4166.59 ms | 32.4% bf16 MFU | 124690 tok/s step 10261/19560 | loss 3.412889 (+0.17z)| norm 0.2450 (-0.48z)| lr 4.01e-04 | 4223.90 ms | 32.0% bf16 MFU | 124662 tok/s step 10262/19560 | loss 3.437394 (+0.78z)| norm 0.2594 (-0.03z)| lr 4.01e-04 | 4167.78 ms | 32.4% bf16 MFU | 124719 tok/s step 10263/19560 | loss 3.410158 (+0.10z)| norm 0.2725 (+0.37z)| lr 4.01e-04 | 4176.09 ms | 32.3% bf16 MFU | 124760 tok/s step 10264/19560 | loss 3.507272 (+2.45z)| norm 0.2458 (-0.46z)| lr 4.01e-04 | 4212.59 ms | 32.1% bf16 MFU | 124745 tok/s step 10265/19560 | loss 3.382158 (-0.61z)| norm 0.2727 (+0.38z)| lr 4.01e-04 | 4186.66 ms | 32.2% bf16 MFU | 124769 tok/s step 10266/19560 | loss 3.376742 (-0.73z)| norm 0.2306 (-0.93z)| lr 4.01e-04 | 4195.44 ms | 32.2% bf16 MFU | 124779 tok/s step 10267/19560 | loss 3.367761 (-0.94z)| norm 0.2312 (-0.91z)| lr 4.01e-04 | 4213.80 ms | 32.0% bf16 MFU | 124761 tok/s step 10268/19560 | loss 3.374291 (-0.77z)| norm 0.2311 (-0.91z)| lr 4.01e-04 | 4174.81 ms | 32.3% bf16 MFU | 124802 tok/s step 10269/19560 | loss 3.450494 (+1.07z)| norm 0.2418 (-0.57z)| lr 4.01e-04 | 4175.11 ms | 32.3% bf16 MFU | 124841 tok/s step 10270/19560 | loss 3.453601 (+1.13z)| norm 0.2385 (-0.67z)| lr 4.01e-04 | 4174.55 ms | 32.3% bf16 MFU | 124878 tok/s step 10271/19560 | loss 3.459553 (+1.26z)| norm 0.2421 (-0.56z)| lr 4.01e-04 | 4174.97 ms | 32.3% bf16 MFU | 124913 tok/s step 10272/19560 | loss 3.427998 (+0.49z)| norm 0.2425 (-0.54z)| lr 4.01e-04 | 4176.15 ms | 32.3% bf16 MFU | 124945 tok/s step 10273/19560 | loss 3.425762 (+0.44z)| norm 0.2383 (-0.67z)| lr 4.01e-04 | 4199.71 ms | 32.1% bf16 MFU | 124940 tok/s step 10274/19560 | loss 3.428457 (+0.50z)| norm 0.2414 (-0.57z)| lr 4.00e-04 | 4178.98 ms | 32.3% bf16 MFU | 124965 tok/s step 10275/19560 | loss 3.383804 (-0.59z)| norm 0.2507 (-0.28z)| lr 4.00e-04 | 4221.37 ms | 32.0% bf16 MFU | 124927 tok/s step 10276/19560 | loss 3.363188 (-1.09z)| norm 0.2447 (-0.46z)| lr 4.00e-04 | 4195.27 ms | 32.2% bf16 MFU | 124929 tok/s step 10277/19560 | loss 3.405633 (-0.06z)| norm 0.2505 (-0.29z)| lr 4.00e-04 | 4165.16 ms | 32.4% bf16 MFU | 124977 tok/s step 10278/19560 | loss 3.424451 (+0.39z)| norm 0.2171 (-1.32z)| lr 4.00e-04 | 4236.24 ms | 31.9% bf16 MFU | 124916 tok/s step 10279/19560 | loss 3.380261 (-0.67z)| norm 0.2284 (-0.96z)| lr 4.00e-04 | 4174.62 ms | 32.3% bf16 MFU | 124950 tok/s step 10280/19560 | loss 3.405525 (-0.06z)| norm 0.2524 (-0.21z)| lr 4.00e-04 | 4258.03 ms | 31.7% bf16 MFU | 124859 tok/s step 10281/19560 | loss 3.374177 (-0.81z)| norm 0.2329 (-0.81z)| lr 4.00e-04 | 4178.46 ms | 32.3% bf16 MFU | 124889 tok/s step 10282/19560 | loss 3.358983 (-1.16z)| norm 0.2716 (+0.39z)| lr 4.00e-04 | 4173.81 ms | 32.3% bf16 MFU | 124926 tok/s step 10283/19560 | loss 3.377002 (-0.71z)| norm 0.2504 (-0.27z)| lr 4.00e-04 | 4993.08 ms | 27.0% bf16 MFU | 123929 tok/s step 10284/19560 | loss 3.395809 (-0.26z)| norm 0.2483 (-0.34z)| lr 4.00e-04 | 4229.85 ms | 31.9% bf16 MFU | 123930 tok/s step 10285/19560 | loss 3.406959 (+0.01z)| norm 0.2563 (-0.10z)| lr 4.00e-04 | 4164.97 ms | 32.4% bf16 MFU | 124028 tok/s step 10286/19560 | loss 3.343057 (-1.52z)| norm 0.2416 (-0.55z)| lr 4.00e-04 | 4178.56 ms | 32.3% bf16 MFU | 124100 tok/s step 10287/19560 | loss 3.405361 (+0.00z)| norm 0.2695 (+0.66z)| lr 4.00e-04 | 4172.27 ms | 32.4% bf16 MFU | 124178 tok/s step 10288/19560 | loss 3.378255 (-0.66z)| norm 0.2743 (+0.98z)| lr 4.00e-04 | 4181.24 ms | 32.3% bf16 MFU | 124239 tok/s step 10289/19560 | loss 3.387334 (-0.44z)| norm 0.2532 (-0.19z)| lr 3.99e-04 | 4322.84 ms | 31.2% bf16 MFU | 124091 tok/s step 10290/19560 | loss 3.407595 (+0.08z)| norm 0.2571 (+0.04z)| lr 3.99e-04 | 4170.74 ms | 32.4% bf16 MFU | 124172 tok/s step 10291/19560 | loss 3.358195 (-1.19z)| norm 0.2404 (-0.90z)| lr 3.99e-04 | 4215.15 ms | 32.0% bf16 MFU | 124182 tok/s step 10292/19560 | loss 3.395914 (-0.22z)| norm 0.2712 (+0.88z)| lr 3.99e-04 | 4180.20 ms | 32.3% bf16 MFU | 124244 tok/s step 10293/19560 | loss 3.367642 (-0.93z)| norm 0.2465 (-0.55z)| lr 3.99e-04 | 4327.43 ms | 31.2% bf16 MFU | 124090 tok/s step 10294/19560 | loss 3.364447 (-1.00z)| norm 0.2290 (-1.57z)| lr 3.99e-04 | 4188.50 ms | 32.2% bf16 MFU | 124144 tok/s step 10295/19560 | loss 3.439289 (+0.92z)| norm 0.2368 (-1.09z)| lr 3.99e-04 | 4167.25 ms | 32.4% bf16 MFU | 124227 tok/s step 10296/19560 | loss 3.367875 (-0.90z)| norm 0.2544 (-0.03z)| lr 3.99e-04 | 4223.96 ms | 32.0% bf16 MFU | 124222 tok/s step 10297/19560 | loss 3.324876 (-1.96z)| norm 0.2510 (-0.23z)| lr 3.99e-04 | 4176.50 ms | 32.3% bf16 MFU | 124287 tok/s step 10298/19560 | loss 3.428467 (+0.68z)| norm 0.2362 (-1.13z)| lr 3.99e-04 | 4170.11 ms | 32.4% bf16 MFU | 124359 tok/s step 10299/19560 | loss 3.430337 (+0.72z)| norm 0.2721 (+1.09z)| lr 3.99e-04 | 4179.87 ms | 32.3% bf16 MFU | 124413 tok/s step 10300/19560 | loss 3.386640 (-0.40z)| norm 0.2454 (-0.55z)| lr 3.99e-04 | 4177.61 ms | 32.3% bf16 MFU | 124467 tok/s step 10301/19560 | loss 3.339324 (-1.59z)| norm 0.2629 (+0.53z)| lr 3.99e-04 | 4183.13 ms | 32.3% bf16 MFU | 124511 tok/s step 10302/19560 | loss 3.359453 (-1.06z)| norm 0.2750 (+1.27z)| lr 3.99e-04 | 4166.83 ms | 32.4% bf16 MFU | 124576 tok/s step 10303/19560 | loss 3.414416 (+0.32z)| norm 0.2504 (-0.23z)| lr 3.99e-04 | 4177.99 ms | 32.3% bf16 MFU | 124622 tok/s step 10304/19560 | loss 3.410495 (+0.23z)| norm 0.2425 (-0.72z)| lr 3.98e-04 | 4172.35 ms | 32.4% bf16 MFU | 124674 tok/s step 10305/19560 | loss 3.383908 (-0.44z)| norm 0.2614 (+0.45z)| lr 3.98e-04 | 4168.35 ms | 32.4% bf16 MFU | 124729 tok/s step 10306/19560 | loss 3.369832 (-0.79z)| norm 0.2384 (-0.96z)| lr 3.98e-04 | 4192.08 ms | 32.2% bf16 MFU | 124746 tok/s step 10307/19560 | loss 3.410770 (+0.26z)| norm 0.2463 (-0.47z)| lr 3.98e-04 | 4220.24 ms | 32.0% bf16 MFU | 124720 tok/s step 10308/19560 | loss 3.356295 (-1.13z)| norm 0.2463 (-0.47z)| lr 3.98e-04 | 4168.78 ms | 32.4% bf16 MFU | 124772 tok/s step 10309/19560 | loss 3.349729 (-1.28z)| norm 0.2382 (-0.96z)| lr 3.98e-04 | 4167.08 ms | 32.4% bf16 MFU | 124825 tok/s step 10310/19560 | loss 3.376333 (-0.59z)| norm 0.2365 (-1.05z)| lr 3.98e-04 | 4195.69 ms | 32.2% bf16 MFU | 124831 tok/s step 10311/19560 | loss 3.443649 (+1.12z)| norm 0.2267 (-1.61z)| lr 3.98e-04 | 4183.14 ms | 32.3% bf16 MFU | 124856 tok/s step 10312/19560 | loss 3.399040 (-0.02z)| norm 0.2797 (+1.56z)| lr 3.98e-04 | 4177.21 ms | 32.3% bf16 MFU | 124889 tok/s step 10313/19560 | loss 3.346467 (-1.34z)| norm 0.2389 (-0.88z)| lr 3.98e-04 | 4196.89 ms | 32.2% bf16 MFU | 124891 tok/s step 10314/19560 | loss 3.422548 (+0.59z)| norm 0.2621 (+0.51z)| lr 3.98e-04 | 4171.46 ms | 32.4% bf16 MFU | 124931 tok/s step 10315/19560 | loss 3.336332 (-1.57z)| norm 0.2490 (-0.28z)| lr 3.98e-04 | 4194.66 ms | 32.2% bf16 MFU | 124933 tok/s step 10316/19560 | loss 3.403658 (+0.13z)| norm 0.2385 (-0.91z)| lr 3.98e-04 | 4282.25 ms | 31.5% bf16 MFU | 124808 tok/s step 10317/19560 | loss 3.391506 (-0.17z)| norm 0.4816 (+8.68z)| lr 3.98e-04 | 4183.13 ms | 32.3% bf16 MFU | 124835 tok/s step 10318/19560 | loss 3.419042 (+0.53z)| norm 0.2398 (-0.59z)| lr 3.97e-04 | 4203.60 ms | 32.1% bf16 MFU | 124829 tok/s step 10319/19560 | loss 3.412676 (+0.37z)| norm 0.2375 (-0.67z)| lr 3.97e-04 | 4175.12 ms | 32.3% bf16 MFU | 124866 tok/s step 10320/19560 | loss 3.467926 (+1.75z)| norm 0.2381 (-0.64z)| lr 3.97e-04 | 4180.53 ms | 32.3% bf16 MFU | 124894 tok/s step 10321/19560 | loss 3.452490 (+1.34z)| norm 0.2723 (+0.65z)| lr 3.97e-04 | 4180.72 ms | 32.3% bf16 MFU | 124919 tok/s step 10322/19560 | loss 3.325489 (-1.83z)| norm 0.2334 (-0.83z)| lr 3.97e-04 | 4166.29 ms | 32.4% bf16 MFU | 124965 tok/s step 10323/19560 | loss 3.319877 (-1.92z)| norm 0.2369 (-0.69z)| lr 3.97e-04 | 4175.37 ms | 32.3% bf16 MFU | 124995 tok/s step 10324/19560 | loss 3.406427 (+0.22z)| norm 0.2638 (+0.33z)| lr 3.97e-04 | 4167.45 ms | 32.4% bf16 MFU | 125036 tok/s step 10325/19560 | loss 3.414652 (+0.42z)| norm 0.2354 (-0.75z)| lr 3.97e-04 | 4171.98 ms | 32.4% bf16 MFU | 125068 tok/s step 10326/19560 | loss 3.421372 (+0.59z)| norm 0.2654 (+0.41z)| lr 3.97e-04 | 4179.83 ms | 32.3% bf16 MFU | 125086 tok/s step 10327/19560 | loss 3.410189 (+0.30z)| norm 0.2466 (-0.31z)| lr 3.97e-04 | 4262.79 ms | 31.7% bf16 MFU | 124981 tok/s step 10328/19560 | loss 3.370759 (-0.67z)| norm 0.2580 (+0.12z)| lr 3.97e-04 | 4171.34 ms | 32.4% bf16 MFU | 125016 tok/s step 10329/19560 | loss 3.533551 (+3.30z)| norm 0.2608 (+0.23z)| lr 3.97e-04 | 4195.32 ms | 32.2% bf16 MFU | 125014 tok/s step 10330/19560 | loss 3.358539 (-0.95z)| norm 0.2428 (-0.47z)| lr 3.97e-04 | 4169.02 ms | 32.4% bf16 MFU | 125051 tok/s step 10331/19560 | loss 3.388521 (-0.20z)| norm 0.2397 (-0.58z)| lr 3.97e-04 | 4173.06 ms | 32.4% bf16 MFU | 125081 tok/s step 10332/19560 | loss 3.420649 (+0.58z)| norm 0.2751 (+0.79z)| lr 3.97e-04 | 4175.50 ms | 32.3% bf16 MFU | 125105 tok/s step 10333/19560 | loss 3.439791 (+1.04z)| norm 0.2694 (+0.57z)| lr 3.96e-04 | 4164.35 ms | 32.4% bf16 MFU | 125144 tok/s step 10334/19560 | loss 3.385379 (-0.27z)| norm 0.2816 (+1.03z)| lr 3.96e-04 | 4175.13 ms | 32.3% bf16 MFU | 125166 tok/s step 10335/19560 | loss 3.407696 (+0.29z)| norm 0.2806 (+0.98z)| lr 3.96e-04 | 4178.71 ms | 32.3% bf16 MFU | 125181 tok/s step 10336/19560 | loss 3.358344 (-0.94z)| norm 0.2767 (+0.83z)| lr 3.96e-04 | 4183.94 ms | 32.3% bf16 MFU | 125187 tok/s step 10337/19560 | loss 3.416811 (+0.53z)| norm 0.2638 (+0.33z)| lr 3.96e-04 | 4216.60 ms | 32.0% bf16 MFU | 125145 tok/s step 10338/19560 | loss 3.358130 (-0.95z)| norm 0.2474 (-0.29z)| lr 3.96e-04 | 4166.10 ms | 32.4% bf16 MFU | 125180 tok/s step 10339/19560 | loss 3.383382 (-0.31z)| norm 0.2905 (+1.38z)| lr 3.96e-04 | 4185.34 ms | 32.3% bf16 MFU | 125184 tok/s step 10340/19560 | loss 3.431200 (+0.89z)| norm 0.2621 (+0.27z)| lr 3.96e-04 | 4184.45 ms | 32.3% bf16 MFU | 125190 tok/s step 10341/19560 | loss 3.380002 (-0.41z)| norm 0.2725 (+0.67z)| lr 3.96e-04 | 4183.12 ms | 32.3% bf16 MFU | 125197 tok/s step 10342/19560 | loss 3.407681 (+0.29z)| norm 0.2481 (-0.27z)| lr 3.96e-04 | 4185.97 ms | 32.3% bf16 MFU | 125200 tok/s step 10343/19560 | loss 3.386252 (-0.25z)| norm 0.2771 (+0.84z)| lr 3.96e-04 | 4168.02 ms | 32.4% bf16 MFU | 125229 tok/s step 10344/19560 | loss 3.379414 (-0.41z)| norm 0.2713 (+0.62z)| lr 3.96e-04 | 4182.99 ms | 32.3% bf16 MFU | 125235 tok/s step 10345/19560 | loss 3.469331 (+1.85z)| norm 0.2802 (+0.96z)| lr 3.96e-04 | 4179.32 ms | 32.3% bf16 MFU | 125245 tok/s step 10346/19560 | loss 3.450471 (+1.37z)| norm 0.2803 (+0.96z)| lr 3.96e-04 | 4187.51 ms | 32.2% bf16 MFU | 125243 tok/s step 10347/19560 | loss 3.393447 (-0.07z)| norm 0.2531 (-0.09z)| lr 3.96e-04 | 4191.69 ms | 32.2% bf16 MFU | 125235 tok/s step 10348/19560 | loss 3.392655 (-0.09z)| norm 0.2798 (+0.94z)| lr 3.95e-04 | 4176.09 ms | 32.3% bf16 MFU | 125250 tok/s step 10349/19560 | loss 3.372531 (-0.59z)| norm 0.2695 (+0.54z)| lr 3.95e-04 | 4178.42 ms | 32.3% bf16 MFU | 125262 tok/s step 10350/19560 | loss 3.391170 (-0.11z)| norm 0.2849 (+1.15z)| lr 3.95e-04 | 4176.73 ms | 32.3% bf16 MFU | 125275 tok/s step 10351/19560 | loss 3.390846 (-0.12z)| norm 0.2629 (+0.29z)| lr 3.95e-04 | 4177.25 ms | 32.3% bf16 MFU | 125287 tok/s step 10352/19560 | loss 3.367917 (-0.69z)| norm 0.2630 (+0.30z)| lr 3.95e-04 | 4179.33 ms | 32.3% bf16 MFU | 125295 tok/s step 10353/19560 | loss 3.415717 (+0.53z)| norm 0.2493 (-0.24z)| lr 3.95e-04 | 4170.52 ms | 32.4% bf16 MFU | 125316 tok/s step 10354/19560 | loss 3.333502 (-1.57z)| norm 0.2653 (+0.38z)| lr 3.95e-04 | 4177.84 ms | 32.3% bf16 MFU | 125324 tok/s step 10355/19560 | loss 3.396818 (+0.06z)| norm 0.2478 (-0.30z)| lr 3.95e-04 | 4167.16 ms | 32.4% bf16 MFU | 125349 tok/s step 10356/19560 | loss 3.408113 (+0.34z)| norm 0.2591 (+0.14z)| lr 3.95e-04 | 4179.27 ms | 32.3% bf16 MFU | 125354 tok/s step 10357/19560 | loss 3.429743 (+0.90z)| norm 0.2320 (-0.90z)| lr 3.95e-04 | 4177.92 ms | 32.3% bf16 MFU | 125361 tok/s step 10358/19560 | loss 3.346685 (-1.23z)| norm 0.2748 (+0.76z)| lr 3.95e-04 | 4166.67 ms | 32.4% bf16 MFU | 125384 tok/s step 10359/19560 | loss 3.369563 (-0.64z)| norm 0.2371 (-0.70z)| lr 3.95e-04 | 4171.54 ms | 32.4% bf16 MFU | 125399 tok/s step 10360/19560 | loss 3.347970 (-1.18z)| norm 0.2741 (+0.74z)| lr 3.95e-04 | 4183.99 ms | 32.3% bf16 MFU | 125395 tok/s step 10361/19560 | loss 3.390951 (-0.08z)| norm 0.2615 (+0.26z)| lr 3.95e-04 | 4168.73 ms | 32.4% bf16 MFU | 125413 tok/s step 10362/19560 | loss 3.381908 (-0.30z)| norm 0.2371 (-0.68z)| lr 3.95e-04 | 4170.87 ms | 32.4% bf16 MFU | 125428 tok/s step 10363/19560 | loss 3.381299 (-0.32z)| norm 0.2505 (-0.16z)| lr 3.94e-04 | 4192.81 ms | 32.2% bf16 MFU | 125408 tok/s step 10364/19560 | loss 3.364692 (-0.75z)| norm 0.2412 (-0.51z)| lr 3.94e-04 | 4176.88 ms | 32.3% bf16 MFU | 125414 tok/s step 10365/19560 | loss 3.466870 (+1.91z)| norm 0.2556 (+0.05z)| lr 3.94e-04 | 4173.12 ms | 32.4% bf16 MFU | 125425 tok/s step 10366/19560 | loss 3.356804 (-0.96z)| norm 0.2830 (+1.12z)| lr 3.94e-04 | 4220.51 ms | 32.0% bf16 MFU | 125365 tok/s step 10367/19560 | loss 3.412336 (+0.50z)| norm 0.2437 (-0.42z)| lr 3.94e-04 | 4176.48 ms | 32.3% bf16 MFU | 125373 tok/s step 10368/19560 | loss 3.467675 (+1.92z)| norm 0.2733 (+0.74z)| lr 3.94e-04 | 4172.52 ms | 32.4% bf16 MFU | 125387 tok/s step 10369/19560 | loss 3.446611 (+1.36z)| norm 0.2417 (-0.50z)| lr 3.94e-04 | 4176.18 ms | 32.3% bf16 MFU | 125395 tok/s step 10370/19560 | loss 3.382565 (-0.30z)| norm 0.2613 (+0.27z)| lr 3.94e-04 | 4192.46 ms | 32.2% bf16 MFU | 125378 tok/s step 10371/19560 | loss 3.405386 (+0.28z)| norm 0.2399 (-0.57z)| lr 3.94e-04 | 4365.22 ms | 30.9% bf16 MFU | 125115 tok/s step 10372/19560 | loss 3.439854 (+1.16z)| norm 0.2733 (+0.73z)| lr 3.94e-04 | 4167.73 ms | 32.4% bf16 MFU | 125149 tok/s step 10373/19560 | loss 3.379633 (-0.39z)| norm 0.2460 (-0.34z)| lr 3.94e-04 | 4170.74 ms | 32.4% bf16 MFU | 125177 tok/s step 10374/19560 | loss 3.393402 (-0.04z)| norm 0.2773 (+0.87z)| lr 3.94e-04 | 4173.90 ms | 32.3% bf16 MFU | 125198 tok/s step 10375/19560 | loss 3.431845 (+0.94z)| norm 0.2542 (-0.03z)| lr 3.94e-04 | 4169.71 ms | 32.4% bf16 MFU | 125225 tok/s step 10376/19560 | loss 3.323506 (-1.89z)| norm 0.2774 (+0.86z)| lr 3.94e-04 | 4190.20 ms | 32.2% bf16 MFU | 125220 tok/s step 10377/19560 | loss 3.369659 (-0.71z)| norm 0.2821 (+1.03z)| lr 3.94e-04 | 4178.95 ms | 32.3% bf16 MFU | 125232 tok/s step 10378/19560 | loss 3.414891 (+0.50z)| norm 0.2768 (+0.82z)| lr 3.93e-04 | 4165.56 ms | 32.4% bf16 MFU | 125264 tok/s step 10379/19560 | loss 3.443618 (+1.26z)| norm 0.3086 (+2.01z)| lr 3.93e-04 | 4172.89 ms | 32.4% bf16 MFU | 125282 tok/s step 10380/19560 | loss 3.365843 (-0.82z)| norm 0.2767 (+0.77z)| lr 3.93e-04 | 4192.46 ms | 32.2% bf16 MFU | 125271 tok/s step 10381/19560 | loss 3.409326 (+0.33z)| norm 0.2946 (+1.44z)| lr 3.93e-04 | 4171.02 ms | 32.4% bf16 MFU | 125292 tok/s step 10382/19560 | loss 3.393487 (-0.09z)| norm 0.2553 (-0.06z)| lr 3.93e-04 | 4184.39 ms | 32.3% bf16 MFU | 125293 tok/s step 10383/19560 | loss 3.345450 (-1.37z)| norm 0.2732 (+0.61z)| lr 3.93e-04 | 4164.11 ms | 32.4% bf16 MFU | 125323 tok/s step 10384/19560 | loss 3.370750 (-0.68z)| norm 0.2571 (-0.01z)| lr 3.93e-04 | 4201.93 ms | 32.1% bf16 MFU | 125296 tok/s step 10385/19560 | loss 3.445664 (+1.32z)| norm 0.2537 (-0.14z)| lr 3.93e-04 | 4164.05 ms | 32.4% bf16 MFU | 125326 tok/s step 10386/19560 | loss 3.423295 (+0.71z)| norm 0.2568 (-0.02z)| lr 3.93e-04 | 4173.78 ms | 32.3% bf16 MFU | 125341 tok/s step 10387/19560 | loss 3.409955 (+0.35z)| norm 0.2450 (-0.47z)| lr 3.93e-04 | 4179.64 ms | 32.3% bf16 MFU | 125346 tok/s step 10388/19560 | loss 3.400832 (+0.10z)| norm 0.2589 (+0.06z)| lr 3.93e-04 | 4178.88 ms | 32.3% bf16 MFU | 125351 tok/s step 10389/19560 | loss 3.357443 (-1.05z)| norm 0.2620 (+0.17z)| lr 3.93e-04 | 4166.32 ms | 32.4% bf16 MFU | 125376 tok/s step 10390/19560 | loss 3.371573 (-0.66z)| norm 0.2889 (+1.19z)| lr 3.93e-04 | 4181.55 ms | 32.3% bf16 MFU | 125376 tok/s step 10391/19560 | loss 3.443754 (+1.26z)| norm 0.2675 (+0.37z)| lr 3.93e-04 | 4167.48 ms | 32.4% bf16 MFU | 125398 tok/s step 10392/19560 | loss 3.382640 (-0.36z)| norm 0.2909 (+1.24z)| lr 3.92e-04 | 4179.71 ms | 32.3% bf16 MFU | 125400 tok/s step 10393/19560 | loss 3.380968 (-0.40z)| norm 0.2652 (+0.27z)| lr 3.92e-04 | 4182.97 ms | 32.3% bf16 MFU | 125396 tok/s step 10394/19560 | loss 3.453899 (+1.58z)| norm 0.2628 (+0.17z)| lr 3.92e-04 | 4182.64 ms | 32.3% bf16 MFU | 125394 tok/s step 10395/19560 | loss 3.380522 (-0.43z)| norm 0.2555 (-0.11z)| lr 3.92e-04 | 4212.25 ms | 32.1% bf16 MFU | 125348 tok/s step 10396/19560 | loss 3.388715 (-0.21z)| norm 0.2561 (-0.10z)| lr 3.92e-04 | 4172.28 ms | 32.4% bf16 MFU | 125363 tok/s step 10397/19560 | loss 3.386841 (-0.25z)| norm 0.2468 (-0.46z)| lr 3.92e-04 | 4191.34 ms | 32.2% bf16 MFU | 125350 tok/s step 10398/19560 | loss 3.383039 (-0.34z)| norm 0.2528 (-0.23z)| lr 3.92e-04 | 4175.61 ms | 32.3% bf16 MFU | 125360 tok/s step 10399/19560 | loss 3.348601 (-1.29z)| norm 0.2525 (-0.24z)| lr 3.92e-04 | 4174.93 ms | 32.3% bf16 MFU | 125371 tok/s step 10400/19560 | loss 3.368591 (-0.71z)| norm 0.2624 (+0.13z)| lr 3.92e-04 | 4197.07 ms | 32.2% bf16 MFU | 125348 tok/s step 10401/19560 | loss 3.402211 (+0.24z)| norm 0.2542 (-0.19z)| lr 3.92e-04 | 4178.22 ms | 32.3% bf16 MFU | 125355 tok/s step 10402/19560 | loss 3.400032 (+0.18z)| norm 0.2544 (-0.19z)| lr 3.92e-04 | 4174.85 ms | 32.3% bf16 MFU | 125366 tok/s step 10403/19560 | loss 3.349062 (-1.25z)| norm 0.2413 (-0.69z)| lr 3.92e-04 | 4392.42 ms | 30.7% bf16 MFU | 125066 tok/s step 10404/19560 | loss 3.402195 (+0.24z)| norm 0.2746 (+0.59z)| lr 3.92e-04 | 4649.16 ms | 29.0% bf16 MFU | 124451 tok/s step 10405/19560 | loss 3.452385 (+1.64z)| norm 0.2487 (-0.41z)| lr 3.92e-04 | 4518.46 ms | 29.9% bf16 MFU | 124030 tok/s step 10406/19560 | loss 3.441594 (+1.32z)| norm 0.2864 (+1.03z)| lr 3.92e-04 | 4437.68 ms | 30.4% bf16 MFU | 123736 tok/s step 10407/19560 | loss 3.424655 (+0.84z)| norm 0.2616 (+0.06z)| lr 3.91e-04 | 4330.69 ms | 31.2% bf16 MFU | 123603 tok/s step 10408/19560 | loss 3.353675 (-1.12z)| norm 0.2545 (-0.22z)| lr 3.91e-04 | 4326.20 ms | 31.2% bf16 MFU | 123482 tok/s step 10409/19560 | loss 3.516151 (+3.22z)| norm 0.2747 (+0.56z)| lr 3.91e-04 | 4190.00 ms | 32.2% bf16 MFU | 123564 tok/s step 10410/19560 | loss 3.335016 (-1.58z)| norm 0.2476 (-0.50z)| lr 3.91e-04 | 4169.80 ms | 32.4% bf16 MFU | 123673 tok/s step 10411/19560 | loss 3.425015 (+0.78z)| norm 0.2649 (+0.17z)| lr 3.91e-04 | 4214.18 ms | 32.0% bf16 MFU | 123710 tok/s step 10412/19560 | loss 3.413607 (+0.48z)| norm 0.2668 (+0.24z)| lr 3.91e-04 | 4270.69 ms | 31.6% bf16 MFU | 123662 tok/s step 10413/19560 | loss 3.385116 (-0.27z)| norm 0.2354 (-0.98z)| lr 3.91e-04 | 4162.06 ms | 32.4% bf16 MFU | 123778 tok/s step 10414/19560 | loss 3.400129 (+0.12z)| norm 0.2698 (+0.36z)| lr 3.91e-04 | 4180.23 ms | 32.3% bf16 MFU | 123860 tok/s step 10415/19560 | loss 3.414068 (+0.48z)| norm 0.2597 (-0.03z)| lr 3.91e-04 | 4164.89 ms | 32.4% bf16 MFU | 123961 tok/s step 10416/19560 | loss 3.481774 (+2.22z)| norm 0.2468 (-0.53z)| lr 3.91e-04 | 4207.04 ms | 32.1% bf16 MFU | 123994 tok/s step 10417/19560 | loss 3.410538 (+0.36z)| norm 0.2807 (+0.79z)| lr 3.91e-04 | 4236.78 ms | 31.9% bf16 MFU | 123982 tok/s step 10418/19560 | loss 3.357721 (-1.00z)| norm 0.2750 (+0.56z)| lr 3.91e-04 | 4161.46 ms | 32.4% bf16 MFU | 124082 tok/s step 10419/19560 | loss 3.421194 (+0.63z)| norm 0.2548 (-0.24z)| lr 3.91e-04 | 4172.51 ms | 32.4% bf16 MFU | 124160 tok/s step 10420/19560 | loss 3.388567 (-0.21z)| norm 0.2937 (+1.28z)| lr 3.91e-04 | 4176.15 ms | 32.3% bf16 MFU | 124230 tok/s step 10421/19560 | loss 3.446233 (+1.26z)| norm 0.2555 (-0.22z)| lr 3.91e-04 | 4164.42 ms | 32.4% bf16 MFU | 124313 tok/s step 10422/19560 | loss 3.447088 (+1.26z)| norm 0.2753 (+0.55z)| lr 3.90e-04 | 4171.93 ms | 32.4% bf16 MFU | 124381 tok/s step 10423/19560 | loss 3.422427 (+0.63z)| norm 0.2273 (-1.33z)| lr 3.90e-04 | 4170.85 ms | 32.4% bf16 MFU | 124447 tok/s step 10424/19560 | loss 3.414379 (+0.42z)| norm 0.2534 (-0.31z)| lr 3.90e-04 | 4179.28 ms | 32.3% bf16 MFU | 124497 tok/s step 10425/19560 | loss 3.481718 (+2.12z)| norm 0.2634 (+0.08z)| lr 3.90e-04 | 4168.91 ms | 32.4% bf16 MFU | 124560 tok/s step 10426/19560 | loss 3.429965 (+0.78z)| norm 0.2451 (-0.64z)| lr 3.90e-04 | 4178.84 ms | 32.3% bf16 MFU | 124605 tok/s step 10427/19560 | loss 3.454174 (+1.40z)| norm 0.2577 (-0.15z)| lr 3.90e-04 | 4184.08 ms | 32.3% bf16 MFU | 124640 tok/s step 10428/19560 | loss 3.437647 (+0.96z)| norm 0.2517 (-0.38z)| lr 3.90e-04 | 4172.28 ms | 32.4% bf16 MFU | 124691 tok/s step 10429/19560 | loss 3.359334 (-1.05z)| norm 0.2644 (+0.11z)| lr 3.90e-04 | 4170.84 ms | 32.4% bf16 MFU | 124742 tok/s step 10430/19560 | loss 3.331490 (-1.75z)| norm 0.2322 (-1.14z)| lr 3.90e-04 | 4172.07 ms | 32.4% bf16 MFU | 124788 tok/s step 10431/19560 | loss 3.529598 (+3.15z)| norm 0.2957 (+1.33z)| lr 3.90e-04 | 4169.59 ms | 32.4% bf16 MFU | 124836 tok/s step 10432/19560 | loss 3.420169 (+0.47z)| norm 0.2618 (+0.01z)| lr 3.90e-04 | 4163.96 ms | 32.4% bf16 MFU | 124890 tok/s step 10433/19560 | loss 3.445363 (+1.07z)| norm 0.2697 (+0.31z)| lr 3.90e-04 | 4180.13 ms | 32.3% bf16 MFU | 124916 tok/s step 10434/19560 | loss 3.356244 (-1.10z)| norm 0.2402 (-0.84z)| lr 3.90e-04 | 4183.54 ms | 32.3% bf16 MFU | 124937 tok/s step 10435/19560 | loss 3.427138 (+0.62z)| norm 0.2911 (+1.13z)| lr 3.90e-04 | 4165.53 ms | 32.4% bf16 MFU | 124983 tok/s step 10436/19560 | loss 3.466662 (+1.55z)| norm 0.2432 (-0.73z)| lr 3.90e-04 | 4174.84 ms | 32.3% bf16 MFU | 125013 tok/s step 10437/19560 | loss 3.383193 (-0.47z)| norm 0.2525 (-0.38z)| lr 3.89e-04 | 4184.99 ms | 32.3% bf16 MFU | 125026 tok/s step 10438/19560 | loss 3.352107 (-1.22z)| norm 0.2547 (-0.30z)| lr 3.89e-04 | 4179.71 ms | 32.3% bf16 MFU | 125047 tok/s step 10439/19560 | loss 3.373950 (-0.68z)| norm 0.2503 (-0.48z)| lr 3.89e-04 | 4171.14 ms | 32.4% bf16 MFU | 125079 tok/s step 10440/19560 | loss 3.437803 (+0.86z)| norm 0.2515 (-0.42z)| lr 3.89e-04 | 4177.68 ms | 32.3% bf16 MFU | 125100 tok/s step 10441/19560 | loss 3.504898 (+2.42z)| norm 0.2484 (-0.55z)| lr 3.89e-04 | 4174.53 ms | 32.3% bf16 MFU | 125125 tok/s step 10442/19560 | loss 3.433023 (+0.70z)| norm 0.2543 (-0.32z)| lr 3.89e-04 | 4161.97 ms | 32.4% bf16 MFU | 125167 tok/s step 10443/19560 | loss 3.421637 (+0.42z)| norm 0.2677 (+0.21z)| lr 3.89e-04 | 4173.90 ms | 32.3% bf16 MFU | 125189 tok/s step 10444/19560 | loss 3.485043 (+1.90z)| norm 0.2795 (+0.66z)| lr 3.89e-04 | 4178.09 ms | 32.3% bf16 MFU | 125204 tok/s step 10445/19560 | loss 3.479158 (+1.72z)| norm 0.2603 (-0.04z)| lr 3.89e-04 | 4175.89 ms | 32.3% bf16 MFU | 125221 tok/s step 10446/19560 | loss 3.538060 (+2.97z)| norm 0.2911 (+1.82z)| lr 3.89e-04 | 4165.87 ms | 32.4% bf16 MFU | 125253 tok/s step 10447/19560 | loss 3.409155 (+0.06z)| norm 0.2485 (-0.80z)| lr 3.89e-04 | 4168.69 ms | 32.4% bf16 MFU | 125279 tok/s step 10448/19560 | loss 3.410948 (+0.11z)| norm 0.2694 (+0.48z)| lr 3.89e-04 | 4185.07 ms | 32.3% bf16 MFU | 125278 tok/s step 10449/19560 | loss 3.391505 (-0.32z)| norm 0.2388 (-1.41z)| lr 3.89e-04 | 4173.80 ms | 32.3% bf16 MFU | 125295 tok/s step 10450/19560 | loss 3.398230 (-0.18z)| norm 0.2444 (-1.07z)| lr 3.89e-04 | 4163.85 ms | 32.4% bf16 MFU | 125326 tok/s step 10451/19560 | loss 3.463151 (+1.31z)| norm 0.2481 (-0.85z)| lr 3.88e-04 | 4167.72 ms | 32.4% bf16 MFU | 125350 tok/s step 10452/19560 | loss 3.382957 (-0.56z)| norm 0.2502 (-0.71z)| lr 3.88e-04 | 4169.27 ms | 32.4% bf16 MFU | 125370 tok/s step 10453/19560 | loss 3.403886 (-0.07z)| norm 0.2524 (-0.59z)| lr 3.88e-04 | 4171.57 ms | 32.4% bf16 MFU | 125385 tok/s step 10454/19560 | loss 3.452476 (+1.05z)| norm 0.2843 (+1.41z)| lr 3.88e-04 | 4162.45 ms | 32.4% bf16 MFU | 125414 tok/s step 10455/19560 | loss 3.508585 (+2.29z)| norm 0.2699 (+0.50z)| lr 3.88e-04 | 4166.08 ms | 32.4% bf16 MFU | 125436 tok/s step 10456/19560 | loss 3.441684 (+0.76z)| norm 0.2453 (-1.04z)| lr 3.88e-04 | 4188.92 ms | 32.2% bf16 MFU | 125422 tok/s step 10457/19560 | loss 3.513403 (+2.42z)| norm 0.2849 (+1.42z)| lr 3.88e-04 | 4169.89 ms | 32.4% bf16 MFU | 125437 tok/s step 10458/19560 | loss 3.376180 (-0.75z)| norm 0.2521 (-0.63z)| lr 3.88e-04 | 4170.10 ms | 32.4% bf16 MFU | 125452 tok/s step 10459/19560 | loss 3.423423 (+0.34z)| norm 0.2990 (+2.24z)| lr 3.88e-04 | 4168.88 ms | 32.4% bf16 MFU | 125467 tok/s step 10460/19560 | loss 3.429404 (+0.47z)| norm 0.2635 (+0.06z)| lr 3.88e-04 | 4174.14 ms | 32.3% bf16 MFU | 125474 tok/s step 10461/19560 | loss 3.449062 (+0.92z)| norm 0.2970 (+2.08z)| lr 3.88e-04 | 4160.73 ms | 32.5% bf16 MFU | 125501 tok/s step 10462/19560 | loss 3.476413 (+1.53z)| norm 0.2656 (+0.18z)| lr 3.88e-04 | 4170.97 ms | 32.4% bf16 MFU | 125511 tok/s step 10463/19560 | loss 3.474191 (+1.45z)| norm 0.2619 (-0.04z)| lr 3.88e-04 | 4167.52 ms | 32.4% bf16 MFU | 125525 tok/s step 10464/19560 | loss 3.416261 (+0.13z)| norm 0.2589 (-0.22z)| lr 3.88e-04 | 4167.04 ms | 32.4% bf16 MFU | 125540 tok/s step 10465/19560 | loss 3.397000 (-0.31z)| norm 0.2565 (-0.35z)| lr 3.88e-04 | 4165.10 ms | 32.4% bf16 MFU | 125557 tok/s step 10466/19560 | loss 3.364158 (-1.06z)| norm 0.2425 (-1.21z)| lr 3.87e-04 | 4176.34 ms | 32.3% bf16 MFU | 125556 tok/s step 10467/19560 | loss 3.427037 (+0.37z)| norm 0.2645 (+0.15z)| lr 3.87e-04 | 4168.35 ms | 32.4% bf16 MFU | 125567 tok/s step 10468/19560 | loss 3.392878 (-0.40z)| norm 0.2283 (-2.05z)| lr 3.87e-04 | 4172.63 ms | 32.4% bf16 MFU | 125571 tok/s step 10469/19560 | loss 3.380034 (-0.70z)| norm 0.2697 (+0.49z)| lr 3.87e-04 | 4169.12 ms | 32.4% bf16 MFU | 125580 tok/s step 10470/19560 | loss 3.399766 (-0.24z)| norm 0.2473 (-0.88z)| lr 3.87e-04 | 4162.53 ms | 32.4% bf16 MFU | 125599 tok/s step 10471/19560 | loss 3.363018 (-1.08z)| norm 0.2549 (-0.41z)| lr 3.87e-04 | 4168.74 ms | 32.4% bf16 MFU | 125607 tok/s step 10472/19560 | loss 3.375812 (-0.78z)| norm 0.2508 (-0.65z)| lr 3.87e-04 | 4169.45 ms | 32.4% bf16 MFU | 125614 tok/s step 10473/19560 | loss 3.413169 (+0.08z)| norm 0.2554 (-0.36z)| lr 3.87e-04 | 4203.44 ms | 32.1% bf16 MFU | 125570 tok/s step 10474/19560 | loss 3.386155 (-0.53z)| norm 0.2669 (+0.35z)| lr 3.87e-04 | 4150.50 ms | 32.5% bf16 MFU | 125607 tok/s step 10475/19560 | loss 3.419055 (+0.22z)| norm 0.2594 (-0.11z)| lr 3.87e-04 | 4166.08 ms | 32.4% bf16 MFU | 125619 tok/s step 10476/19560 | loss 3.412992 (+0.08z)| norm 0.2561 (-0.31z)| lr 3.87e-04 | 4152.62 ms | 32.5% bf16 MFU | 125651 tok/s step 10477/19560 | loss 3.360294 (-1.13z)| norm 0.2606 (-0.02z)| lr 3.87e-04 | 4159.11 ms | 32.5% bf16 MFU | 125671 tok/s step 10478/19560 | loss 3.428671 (+0.43z)| norm 0.2568 (-0.25z)| lr 3.87e-04 | 4164.88 ms | 32.4% bf16 MFU | 125682 tok/s step 10479/19560 | loss 3.380625 (-0.67z)| norm 0.2598 (-0.05z)| lr 3.87e-04 | 4162.60 ms | 32.4% bf16 MFU | 125696 tok/s step 10480/19560 | loss 3.362838 (-1.07z)| norm 0.2628 (+0.13z)| lr 3.87e-04 | 4170.90 ms | 32.4% bf16 MFU | 125696 tok/s step 10481/19560 | loss 3.401869 (-0.18z)| norm 0.2636 (+0.18z)| lr 3.86e-04 | 4156.42 ms | 32.5% bf16 MFU | 125718 tok/s step 10482/19560 | loss 3.368339 (-0.96z)| norm 0.2503 (-0.65z)| lr 3.86e-04 | 4163.43 ms | 32.4% bf16 MFU | 125728 tok/s step 10483/19560 | loss 3.363851 (-1.05z)| norm 0.2626 (+0.11z)| lr 3.86e-04 | 4162.48 ms | 32.4% bf16 MFU | 125740 tok/s step 10484/19560 | loss 3.398956 (-0.24z)| norm 0.2534 (-0.47z)| lr 3.86e-04 | 4173.19 ms | 32.4% bf16 MFU | 125734 tok/s step 10485/19560 | loss 3.501626 (+2.07z)| norm 0.2512 (-0.62z)| lr 3.86e-04 | 4162.42 ms | 32.4% bf16 MFU | 125746 tok/s step 10486/19560 | loss 3.411824 (+0.03z)| norm 0.2834 (+1.42z)| lr 3.86e-04 | 4168.81 ms | 32.4% bf16 MFU | 125747 tok/s step 10487/19560 | loss 3.413996 (+0.07z)| norm 0.2532 (-0.50z)| lr 3.86e-04 | 4168.84 ms | 32.4% bf16 MFU | 125747 tok/s step 10488/19560 | loss 3.356007 (-1.26z)| norm 0.2785 (+1.11z)| lr 3.86e-04 | 4181.35 ms | 32.3% bf16 MFU | 125729 tok/s step 10489/19560 | loss 3.406055 (-0.12z)| norm 0.2677 (+0.42z)| lr 3.86e-04 | 4171.50 ms | 32.4% bf16 MFU | 125727 tok/s step 10490/19560 | loss 3.529540 (+2.62z)| norm 0.2966 (+2.21z)| lr 3.86e-04 | 4176.44 ms | 32.3% bf16 MFU | 125717 tok/s step 10491/19560 | loss 3.368469 (-0.98z)| norm 0.2534 (-0.52z)| lr 3.86e-04 | 4172.04 ms | 32.4% bf16 MFU | 125715 tok/s step 10492/19560 | loss 3.414911 (+0.05z)| norm 0.2628 (+0.06z)| lr 3.86e-04 | 4170.20 ms | 32.4% bf16 MFU | 125715 tok/s step 10493/19560 | loss 3.377809 (-0.77z)| norm 0.2723 (+0.66z)| lr 3.86e-04 | 4163.73 ms | 32.4% bf16 MFU | 125725 tok/s step 10494/19560 | loss 3.365510 (-1.05z)| norm 0.2645 (+0.17z)| lr 3.86e-04 | 4166.37 ms | 32.4% bf16 MFU | 125731 tok/s step 10495/19560 | loss 3.461439 (+1.10z)| norm 0.3061 (+2.73z)| lr 3.86e-04 | 4179.98 ms | 32.3% bf16 MFU | 125716 tok/s step 10496/19560 | loss 3.527879 (+2.53z)| norm 0.2548 (-0.46z)| lr 3.85e-04 | 4172.19 ms | 32.4% bf16 MFU | 125713 tok/s step 10497/19560 | loss 3.382684 (-0.65z)| norm 0.2987 (+2.22z)| lr 3.85e-04 | 4168.49 ms | 32.4% bf16 MFU | 125716 tok/s step 10498/19560 | loss 3.391752 (-0.46z)| norm 0.2465 (-0.98z)| lr 3.85e-04 | 4174.83 ms | 32.3% bf16 MFU | 125710 tok/s step 10499/19560 | loss 3.389456 (-0.50z)| norm 0.2619 (-0.05z)| lr 3.85e-04 | 4207.00 ms | 32.1% bf16 MFU | 125655 tok/s step 10500/19560 | loss 3.410615 (-0.03z)| norm 0.2425 (-1.23z)| lr 3.85e-04 | 4166.31 ms | 32.4% bf16 MFU | 125665 tok/s val loss 3.369168 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2950/10042 = 0.293766 step 10501/19560 | loss 3.305219 (-2.30z)| norm 0.2610 (-0.09z)| lr 3.85e-04 | 4174.67 ms | 32.3% bf16 MFU | 125661 tok/s step 10502/19560 | loss 3.386326 (-0.54z)| norm 0.2552 (-0.44z)| lr 3.85e-04 | 4171.49 ms | 32.4% bf16 MFU | 125662 tok/s step 10503/19560 | loss 3.338729 (-1.54z)| norm 0.2595 (-0.18z)| lr 3.85e-04 | 4169.38 ms | 32.4% bf16 MFU | 125666 tok/s step 10504/19560 | loss 3.440752 (+0.63z)| norm 0.2507 (-0.71z)| lr 3.85e-04 | 4170.57 ms | 32.4% bf16 MFU | 125668 tok/s step 10505/19560 | loss 3.434800 (+0.49z)| norm 0.2411 (-1.29z)| lr 3.85e-04 | 5497.90 ms | 24.6% bf16 MFU | 124153 tok/s step 10506/19560 | loss 3.382537 (-0.64z)| norm 0.2425 (-1.19z)| lr 3.85e-04 | 4163.36 ms | 32.4% bf16 MFU | 124242 tok/s step 10507/19560 | loss 3.313735 (-2.07z)| norm 0.2498 (-0.73z)| lr 3.85e-04 | 4171.76 ms | 32.4% bf16 MFU | 124313 tok/s step 10508/19560 | loss 3.383738 (-0.58z)| norm 0.2411 (-1.26z)| lr 3.85e-04 | 4167.24 ms | 32.4% bf16 MFU | 124388 tok/s step 10509/19560 | loss 3.436879 (+0.55z)| norm 0.2411 (-1.25z)| lr 3.85e-04 | 4177.85 ms | 32.3% bf16 MFU | 124444 tok/s step 10510/19560 | loss 3.377583 (-0.72z)| norm 0.2420 (-1.18z)| lr 3.84e-04 | 4175.65 ms | 32.3% bf16 MFU | 124499 tok/s step 10511/19560 | loss 3.383386 (-0.60z)| norm 0.2514 (-0.56z)| lr 3.84e-04 | 4164.10 ms | 32.4% bf16 MFU | 124570 tok/s step 10512/19560 | loss 3.441223 (+0.63z)| norm 0.2443 (-1.01z)| lr 3.84e-04 | 4164.80 ms | 32.4% bf16 MFU | 124636 tok/s step 10513/19560 | loss 3.452806 (+0.88z)| norm 0.2506 (-0.60z)| lr 3.84e-04 | 4182.95 ms | 32.3% bf16 MFU | 124671 tok/s step 10514/19560 | loss 3.376057 (-0.76z)| norm 0.2574 (-0.17z)| lr 3.84e-04 | 4170.18 ms | 32.4% bf16 MFU | 124723 tok/s step 10515/19560 | loss 3.391613 (-0.43z)| norm 0.2600 (-0.01z)| lr 3.84e-04 | 4165.17 ms | 32.4% bf16 MFU | 124781 tok/s step 10516/19560 | loss 3.369103 (-0.90z)| norm 0.2558 (-0.28z)| lr 3.84e-04 | 4165.96 ms | 32.4% bf16 MFU | 124834 tok/s step 10517/19560 | loss 3.391059 (-0.44z)| norm 0.2551 (-0.33z)| lr 3.84e-04 | 4182.64 ms | 32.3% bf16 MFU | 124860 tok/s step 10518/19560 | loss 3.452113 (+0.86z)| norm 0.2340 (-1.66z)| lr 3.84e-04 | 4173.32 ms | 32.4% bf16 MFU | 124898 tok/s step 10519/19560 | loss 3.385247 (-0.57z)| norm 0.2451 (-0.92z)| lr 3.84e-04 | 4172.07 ms | 32.4% bf16 MFU | 124937 tok/s step 10520/19560 | loss 3.367790 (-0.94z)| norm 0.2303 (-1.85z)| lr 3.84e-04 | 4177.87 ms | 32.3% bf16 MFU | 124965 tok/s step 10521/19560 | loss 3.474290 (+1.32z)| norm 0.2441 (-0.95z)| lr 3.84e-04 | 4176.83 ms | 32.3% bf16 MFU | 124993 tok/s step 10522/19560 | loss 3.387327 (-0.52z)| norm 0.2401 (-1.19z)| lr 3.84e-04 | 4167.41 ms | 32.4% bf16 MFU | 125033 tok/s step 10523/19560 | loss 3.410693 (-0.03z)| norm 0.2347 (-1.51z)| lr 3.84e-04 | 4176.45 ms | 32.3% bf16 MFU | 125058 tok/s step 10524/19560 | loss 3.509789 (+2.04z)| norm 0.2426 (-1.00z)| lr 3.84e-04 | 4171.17 ms | 32.4% bf16 MFU | 125090 tok/s step 10525/19560 | loss 3.543653 (+2.65z)| norm 0.2815 (+1.43z)| lr 3.83e-04 | 4162.47 ms | 32.4% bf16 MFU | 125133 tok/s step 10526/19560 | loss 3.409825 (-0.09z)| norm 0.2701 (+0.71z)| lr 3.83e-04 | 4172.43 ms | 32.4% bf16 MFU | 125159 tok/s step 10527/19560 | loss 3.548566 (+2.66z)| norm 0.2730 (+0.88z)| lr 3.83e-04 | 4170.87 ms | 32.4% bf16 MFU | 125187 tok/s step 10528/19560 | loss 3.348064 (-1.36z)| norm 0.2707 (+0.72z)| lr 3.83e-04 | 4170.30 ms | 32.4% bf16 MFU | 125213 tok/s step 10529/19560 | loss 3.375895 (-0.79z)| norm 0.2423 (-1.04z)| lr 3.83e-04 | 4189.49 ms | 32.2% bf16 MFU | 125210 tok/s step 10530/19560 | loss 3.419353 (+0.07z)| norm 0.2531 (-0.36z)| lr 3.83e-04 | 4174.83 ms | 32.3% bf16 MFU | 125228 tok/s step 10531/19560 | loss 3.425355 (+0.18z)| norm 0.2605 (+0.09z)| lr 3.83e-04 | 4174.23 ms | 32.3% bf16 MFU | 125247 tok/s step 10532/19560 | loss 3.401366 (-0.30z)| norm 0.2522 (-0.42z)| lr 3.83e-04 | 4168.87 ms | 32.4% bf16 MFU | 125273 tok/s step 10533/19560 | loss 3.454529 (+0.77z)| norm 0.2501 (-0.55z)| lr 3.83e-04 | 4171.84 ms | 32.4% bf16 MFU | 125293 tok/s step 10534/19560 | loss 3.362228 (-1.07z)| norm 0.2542 (-0.28z)| lr 3.83e-04 | 4173.62 ms | 32.4% bf16 MFU | 125309 tok/s step 10535/19560 | loss 3.363997 (-1.02z)| norm 0.2489 (-0.61z)| lr 3.83e-04 | 4170.48 ms | 32.4% bf16 MFU | 125329 tok/s step 10536/19560 | loss 3.430592 (+0.30z)| norm 0.2520 (-0.42z)| lr 3.83e-04 | 4175.51 ms | 32.3% bf16 MFU | 125341 tok/s step 10537/19560 | loss 3.418272 (+0.06z)| norm 0.2592 (+0.05z)| lr 3.83e-04 | 4175.62 ms | 32.3% bf16 MFU | 125352 tok/s step 10538/19560 | loss 3.439661 (+0.49z)| norm 0.2317 (-1.68z)| lr 3.83e-04 | 4161.59 ms | 32.4% bf16 MFU | 125384 tok/s step 10539/19560 | loss 3.380881 (-0.71z)| norm 0.2647 (+0.40z)| lr 3.83e-04 | 4170.53 ms | 32.4% bf16 MFU | 125400 tok/s step 10540/19560 | loss 3.406800 (-0.18z)| norm 0.2537 (-0.28z)| lr 3.82e-04 | 4171.11 ms | 32.4% bf16 MFU | 125415 tok/s step 10541/19560 | loss 3.403728 (-0.25z)| norm 0.2310 (-1.71z)| lr 3.82e-04 | 4164.68 ms | 32.4% bf16 MFU | 125438 tok/s step 10542/19560 | loss 3.439842 (+0.49z)| norm 0.2714 (+0.83z)| lr 3.82e-04 | 4172.44 ms | 32.4% bf16 MFU | 125449 tok/s step 10543/19560 | loss 3.347133 (-1.39z)| norm 0.2618 (+0.22z)| lr 3.82e-04 | 4172.53 ms | 32.4% bf16 MFU | 125459 tok/s step 10544/19560 | loss 3.370406 (-0.91z)| norm 0.2436 (-0.91z)| lr 3.82e-04 | 4171.81 ms | 32.4% bf16 MFU | 125470 tok/s step 10545/19560 | loss 3.485731 (+1.43z)| norm 0.2503 (-0.48z)| lr 3.82e-04 | 4174.74 ms | 32.3% bf16 MFU | 125476 tok/s step 10546/19560 | loss 3.407948 (-0.16z)| norm 0.2435 (-0.90z)| lr 3.82e-04 | 4167.18 ms | 32.4% bf16 MFU | 125493 tok/s step 10547/19560 | loss 3.359102 (-1.14z)| norm 0.2766 (+1.18z)| lr 3.82e-04 | 4178.73 ms | 32.3% bf16 MFU | 125491 tok/s step 10548/19560 | loss 3.347905 (-1.35z)| norm 0.2424 (-0.96z)| lr 3.82e-04 | 4164.30 ms | 32.4% bf16 MFU | 125512 tok/s step 10549/19560 | loss 3.406092 (-0.17z)| norm 0.2401 (-1.10z)| lr 3.82e-04 | 4175.06 ms | 32.3% bf16 MFU | 125515 tok/s step 10550/19560 | loss 3.504116 (+1.79z)| norm 0.2502 (-0.45z)| lr 3.82e-04 | 4161.78 ms | 32.4% bf16 MFU | 125538 tok/s step 10551/19560 | loss 3.447260 (+0.64z)| norm 0.2632 (+0.38z)| lr 3.82e-04 | 4168.06 ms | 32.4% bf16 MFU | 125551 tok/s step 10552/19560 | loss 3.356668 (-1.15z)| norm 0.2410 (-1.06z)| lr 3.82e-04 | 4190.44 ms | 32.2% bf16 MFU | 125529 tok/s step 10553/19560 | loss 3.408148 (-0.12z)| norm 0.2620 (+0.30z)| lr 3.82e-04 | 4165.42 ms | 32.4% bf16 MFU | 125546 tok/s step 10554/19560 | loss 3.478934 (+1.28z)| norm 0.2398 (-1.12z)| lr 3.82e-04 | 4188.52 ms | 32.2% bf16 MFU | 125527 tok/s step 10555/19560 | loss 3.462366 (+0.95z)| norm 0.2897 (+2.03z)| lr 3.81e-04 | 4169.55 ms | 32.4% bf16 MFU | 125538 tok/s step 10556/19560 | loss 3.460196 (+0.90z)| norm 0.2693 (+0.74z)| lr 3.81e-04 | 4167.64 ms | 32.4% bf16 MFU | 125551 tok/s step 10557/19560 | loss 3.313143 (-1.99z)| norm 0.2483 (-0.59z)| lr 3.81e-04 | 4163.90 ms | 32.4% bf16 MFU | 125569 tok/s step 10558/19560 | loss 3.371061 (-0.86z)| norm 0.2662 (+0.53z)| lr 3.81e-04 | 4165.33 ms | 32.4% bf16 MFU | 125584 tok/s step 10559/19560 | loss 3.458371 (+0.89z)| norm 0.2299 (-1.77z)| lr 3.81e-04 | 4164.22 ms | 32.4% bf16 MFU | 125600 tok/s step 10560/19560 | loss 3.436129 (+0.44z)| norm 0.2632 (+0.38z)| lr 3.81e-04 | 4165.01 ms | 32.4% bf16 MFU | 125614 tok/s step 10561/19560 | loss 3.395196 (-0.37z)| norm 0.2311 (-1.65z)| lr 3.81e-04 | 4199.03 ms | 32.2% bf16 MFU | 125576 tok/s step 10562/19560 | loss 3.352211 (-1.24z)| norm 0.2454 (-0.74z)| lr 3.81e-04 | 4169.81 ms | 32.4% bf16 MFU | 125584 tok/s step 10563/19560 | loss 3.373215 (-0.81z)| norm 0.2335 (-1.49z)| lr 3.81e-04 | 4174.23 ms | 32.3% bf16 MFU | 125585 tok/s step 10564/19560 | loss 3.404123 (-0.18z)| norm 0.2680 (+0.72z)| lr 3.81e-04 | 4172.24 ms | 32.4% bf16 MFU | 125589 tok/s step 10565/19560 | loss 3.484603 (+1.42z)| norm 0.2493 (-0.48z)| lr 3.81e-04 | 4173.53 ms | 32.4% bf16 MFU | 125590 tok/s step 10566/19560 | loss 3.356335 (-1.15z)| norm 0.2637 (+0.44z)| lr 3.81e-04 | 4172.31 ms | 32.4% bf16 MFU | 125594 tok/s step 10567/19560 | loss 3.423087 (+0.18z)| norm 0.2773 (+1.30z)| lr 3.81e-04 | 4166.10 ms | 32.4% bf16 MFU | 125607 tok/s step 10568/19560 | loss 3.398955 (-0.30z)| norm 0.2323 (-1.57z)| lr 3.81e-04 | 4171.19 ms | 32.4% bf16 MFU | 125611 tok/s step 10569/19560 | loss 3.370160 (-0.86z)| norm 0.2653 (+0.53z)| lr 3.81e-04 | 4165.65 ms | 32.4% bf16 MFU | 125623 tok/s step 10570/19560 | loss 3.455414 (+0.86z)| norm 0.2484 (-0.55z)| lr 3.80e-04 | 4168.72 ms | 32.4% bf16 MFU | 125630 tok/s step 10571/19560 | loss 3.377671 (-0.71z)| norm 0.2421 (-0.93z)| lr 3.80e-04 | 4171.90 ms | 32.4% bf16 MFU | 125633 tok/s step 10572/19560 | loss 3.391486 (-0.41z)| norm 0.2514 (-0.33z)| lr 3.80e-04 | 4170.53 ms | 32.4% bf16 MFU | 125637 tok/s step 10573/19560 | loss 3.413094 (+0.04z)| norm 0.2410 (-0.98z)| lr 3.80e-04 | 4168.70 ms | 32.4% bf16 MFU | 125643 tok/s step 10574/19560 | loss 3.428366 (+0.38z)| norm 0.2491 (-0.45z)| lr 3.80e-04 | 4171.33 ms | 32.4% bf16 MFU | 125645 tok/s step 10575/19560 | loss 3.440424 (+0.63z)| norm 0.2406 (-1.00z)| lr 3.80e-04 | 4180.84 ms | 32.3% bf16 MFU | 125633 tok/s step 10576/19560 | loss 3.324995 (-1.77z)| norm 0.2498 (-0.39z)| lr 3.80e-04 | 4167.23 ms | 32.4% bf16 MFU | 125642 tok/s step 10577/19560 | loss 3.429544 (+0.40z)| norm 0.2828 (+1.71z)| lr 3.80e-04 | 4169.47 ms | 32.4% bf16 MFU | 125647 tok/s step 10578/19560 | loss 3.426708 (+0.34z)| norm 0.2469 (-0.60z)| lr 3.80e-04 | 4161.90 ms | 32.4% bf16 MFU | 125664 tok/s step 10579/19560 | loss 3.475009 (+1.34z)| norm 0.2652 (+0.57z)| lr 3.80e-04 | 4168.94 ms | 32.4% bf16 MFU | 125668 tok/s step 10580/19560 | loss 3.400625 (-0.21z)| norm 0.2686 (+0.78z)| lr 3.80e-04 | 4176.52 ms | 32.3% bf16 MFU | 125662 tok/s step 10581/19560 | loss 3.397914 (-0.27z)| norm 0.2563 (-0.02z)| lr 3.80e-04 | 4170.08 ms | 32.4% bf16 MFU | 125665 tok/s step 10582/19560 | loss 3.443364 (+0.68z)| norm 0.2425 (-0.89z)| lr 3.80e-04 | 4164.50 ms | 32.4% bf16 MFU | 125676 tok/s step 10583/19560 | loss 3.437904 (+0.59z)| norm 0.2694 (+0.86z)| lr 3.80e-04 | 4162.72 ms | 32.4% bf16 MFU | 125690 tok/s step 10584/19560 | loss 3.373801 (-0.75z)| norm 0.2460 (-0.67z)| lr 3.79e-04 | 4173.35 ms | 32.4% bf16 MFU | 125687 tok/s step 10585/19560 | loss 3.424227 (+0.33z)| norm 0.2503 (-0.37z)| lr 3.79e-04 | 4170.09 ms | 32.4% bf16 MFU | 125689 tok/s step 10586/19560 | loss 3.397693 (-0.24z)| norm 0.2775 (+1.40z)| lr 3.79e-04 | 4167.44 ms | 32.4% bf16 MFU | 125695 tok/s step 10587/19560 | loss 3.331798 (-1.63z)| norm 0.2436 (-0.82z)| lr 3.79e-04 | 4176.16 ms | 32.3% bf16 MFU | 125687 tok/s step 10588/19560 | loss 3.401924 (-0.13z)| norm 0.2801 (+1.62z)| lr 3.79e-04 | 4175.16 ms | 32.3% bf16 MFU | 125681 tok/s step 10589/19560 | loss 3.411409 (+0.08z)| norm 0.2437 (-0.80z)| lr 3.79e-04 | 4168.62 ms | 32.4% bf16 MFU | 125686 tok/s step 10590/19560 | loss 3.396898 (-0.22z)| norm 0.2723 (+1.15z)| lr 3.79e-04 | 4173.12 ms | 32.4% bf16 MFU | 125683 tok/s step 10591/19560 | loss 3.356092 (-1.09z)| norm 0.2420 (-0.91z)| lr 3.79e-04 | 4159.40 ms | 32.5% bf16 MFU | 125701 tok/s step 10592/19560 | loss 3.383192 (-0.49z)| norm 0.2669 (+0.78z)| lr 3.79e-04 | 4189.42 ms | 32.2% bf16 MFU | 125674 tok/s step 10593/19560 | loss 3.392995 (-0.28z)| norm 0.2526 (-0.19z)| lr 3.79e-04 | 4160.78 ms | 32.5% bf16 MFU | 125690 tok/s step 10594/19560 | loss 3.358666 (-1.02z)| norm 0.2775 (+1.48z)| lr 3.79e-04 | 4708.67 ms | 28.7% bf16 MFU | 124973 tok/s step 10595/19560 | loss 3.398136 (-0.16z)| norm 0.2431 (-0.83z)| lr 3.79e-04 | 4575.21 ms | 29.5% bf16 MFU | 124454 tok/s step 10596/19560 | loss 3.344520 (-1.31z)| norm 0.2759 (+1.37z)| lr 3.79e-04 | 4441.02 ms | 30.4% bf16 MFU | 124134 tok/s step 10597/19560 | loss 3.327568 (-1.65z)| norm 0.2461 (-0.65z)| lr 3.79e-04 | 4290.85 ms | 31.5% bf16 MFU | 124037 tok/s step 10598/19560 | loss 3.353496 (-1.08z)| norm 0.2968 (+2.70z)| lr 3.79e-04 | 4250.09 ms | 31.8% bf16 MFU | 124003 tok/s step 10599/19560 | loss 3.376189 (-0.60z)| norm 0.2497 (-0.41z)| lr 3.78e-04 | 4504.16 ms | 30.0% bf16 MFU | 123623 tok/s step 10600/19560 | loss 3.370672 (-0.72z)| norm 0.2492 (-0.45z)| lr 3.78e-04 | 4275.47 ms | 31.6% bf16 MFU | 123573 tok/s step 10601/19560 | loss 3.405361 (+0.02z)| norm 0.2458 (-0.67z)| lr 3.78e-04 | 4686.17 ms | 28.8% bf16 MFU | 122988 tok/s step 10602/19560 | loss 3.323931 (-1.68z)| norm 0.2489 (-0.46z)| lr 3.78e-04 | 4294.18 ms | 31.4% bf16 MFU | 122944 tok/s step 10603/19560 | loss 3.328458 (-1.56z)| norm 0.2673 (+0.76z)| lr 3.78e-04 | 4211.81 ms | 32.1% bf16 MFU | 123020 tok/s step 10604/19560 | loss 3.400805 (-0.05z)| norm 0.2632 (+0.48z)| lr 3.78e-04 | 4250.65 ms | 31.8% bf16 MFU | 123037 tok/s step 10605/19560 | loss 3.374328 (-0.61z)| norm 0.2577 (+0.12z)| lr 3.78e-04 | 4211.87 ms | 32.1% bf16 MFU | 123109 tok/s step 10606/19560 | loss 3.402885 (-0.01z)| norm 0.2478 (-0.53z)| lr 3.78e-04 | 4183.32 ms | 32.3% bf16 MFU | 123220 tok/s step 10607/19560 | loss 3.405983 (+0.05z)| norm 0.2678 (+0.79z)| lr 3.78e-04 | 4170.46 ms | 32.4% bf16 MFU | 123344 tok/s step 10608/19560 | loss 3.399245 (-0.09z)| norm 0.2942 (+2.45z)| lr 3.78e-04 | 4195.39 ms | 32.2% bf16 MFU | 123426 tok/s step 10609/19560 | loss 3.377765 (-0.54z)| norm 0.2491 (-0.44z)| lr 3.78e-04 | 4167.08 ms | 32.4% bf16 MFU | 123545 tok/s step 10610/19560 | loss 3.345750 (-1.20z)| norm 0.2783 (+1.41z)| lr 3.78e-04 | 4212.54 ms | 32.1% bf16 MFU | 123591 tok/s step 10611/19560 | loss 3.341924 (-1.27z)| norm 0.2664 (+0.65z)| lr 3.78e-04 | 4175.55 ms | 32.3% bf16 MFU | 123689 tok/s step 10612/19560 | loss 3.387515 (-0.32z)| norm 0.2762 (+1.26z)| lr 3.78e-04 | 4230.89 ms | 31.9% bf16 MFU | 123701 tok/s step 10613/19560 | loss 3.311328 (-1.88z)| norm 0.2747 (+1.14z)| lr 3.78e-04 | 4286.49 ms | 31.5% bf16 MFU | 123631 tok/s step 10614/19560 | loss 3.363679 (-0.78z)| norm 0.2607 (+0.27z)| lr 3.77e-04 | 4171.67 ms | 32.4% bf16 MFU | 123734 tok/s step 10615/19560 | loss 3.304255 (-1.96z)| norm 0.2564 (-0.01z)| lr 3.77e-04 | 4170.56 ms | 32.4% bf16 MFU | 123833 tok/s step 10616/19560 | loss 3.364711 (-0.73z)| norm 0.2656 (+0.59z)| lr 3.77e-04 | 4269.86 ms | 31.6% bf16 MFU | 123780 tok/s step 10617/19560 | loss 3.398866 (-0.03z)| norm 0.2662 (+0.63z)| lr 3.77e-04 | 4180.10 ms | 32.3% bf16 MFU | 123863 tok/s step 10618/19560 | loss 3.383029 (-0.34z)| norm 0.2517 (-0.28z)| lr 3.77e-04 | 4206.46 ms | 32.1% bf16 MFU | 123901 tok/s step 10619/19560 | loss 3.449550 (+1.04z)| norm 0.2753 (+1.26z)| lr 3.77e-04 | 4201.12 ms | 32.1% bf16 MFU | 123946 tok/s step 10620/19560 | loss 3.428919 (+0.61z)| norm 0.3135 (+3.55z)| lr 3.77e-04 | 4163.53 ms | 32.4% bf16 MFU | 124045 tok/s step 10621/19560 | loss 3.468189 (+1.41z)| norm 0.2762 (+1.22z)| lr 3.77e-04 | 4168.91 ms | 32.4% bf16 MFU | 124131 tok/s step 10622/19560 | loss 3.413171 (+0.25z)| norm 0.2456 (-0.68z)| lr 3.77e-04 | 4188.52 ms | 32.2% bf16 MFU | 124183 tok/s step 10623/19560 | loss 3.443536 (+0.89z)| norm 0.8111 (+10.74z)| lr 3.77e-04 | 4168.45 ms | 32.4% bf16 MFU | 124263 tok/s step 10624/19560 | loss 3.384632 (-0.33z)| norm 0.3014 (+0.79z)| lr 3.77e-04 | 4186.76 ms | 32.2% bf16 MFU | 124311 tok/s step 10625/19560 | loss 3.424201 (+0.52z)| norm 0.3073 (+0.90z)| lr 3.77e-04 | 4180.92 ms | 32.3% bf16 MFU | 124365 tok/s step 10626/19560 | loss 3.401797 (+0.04z)| norm 0.2712 (+0.20z)| lr 3.77e-04 | 4218.70 ms | 32.0% bf16 MFU | 124361 tok/s step 10627/19560 | loss 3.371652 (-0.61z)| norm 0.2468 (-0.27z)| lr 3.77e-04 | 4178.57 ms | 32.3% bf16 MFU | 124416 tok/s step 10628/19560 | loss 3.349489 (-1.07z)| norm 0.2718 (+0.21z)| lr 3.77e-04 | 4174.06 ms | 32.3% bf16 MFU | 124476 tok/s step 10629/19560 | loss 3.415111 (+0.32z)| norm 0.2470 (-0.27z)| lr 3.76e-04 | 4162.78 ms | 32.4% bf16 MFU | 124549 tok/s step 10630/19560 | loss 3.400111 (-0.01z)| norm 0.2493 (-0.23z)| lr 3.76e-04 | 4164.34 ms | 32.4% bf16 MFU | 124617 tok/s step 10631/19560 | loss 3.387546 (-0.29z)| norm 0.2450 (-0.31z)| lr 3.76e-04 | 4229.53 ms | 31.9% bf16 MFU | 124584 tok/s step 10632/19560 | loss 3.365505 (-0.76z)| norm 0.2261 (-0.67z)| lr 3.76e-04 | 4208.55 ms | 32.1% bf16 MFU | 124584 tok/s step 10633/19560 | loss 3.376860 (-0.50z)| norm 0.2728 (+0.23z)| lr 3.76e-04 | 4172.30 ms | 32.4% bf16 MFU | 124637 tok/s step 10634/19560 | loss 3.347917 (-1.13z)| norm 0.2318 (-0.56z)| lr 3.76e-04 | 4171.03 ms | 32.4% bf16 MFU | 124690 tok/s step 10635/19560 | loss 3.407925 (+0.17z)| norm 0.2634 (+0.05z)| lr 3.76e-04 | 4171.89 ms | 32.4% bf16 MFU | 124739 tok/s step 10636/19560 | loss 3.357235 (-0.95z)| norm 0.2360 (-0.48z)| lr 3.76e-04 | 4216.58 ms | 32.0% bf16 MFU | 124719 tok/s step 10637/19560 | loss 3.446268 (+1.02z)| norm 0.2511 (-0.19z)| lr 3.76e-04 | 4174.22 ms | 32.3% bf16 MFU | 124764 tok/s step 10638/19560 | loss 3.385346 (-0.33z)| norm 0.2231 (-0.73z)| lr 3.76e-04 | 4173.17 ms | 32.4% bf16 MFU | 124807 tok/s step 10639/19560 | loss 3.450964 (+1.10z)| norm 0.2394 (-0.41z)| lr 3.76e-04 | 4165.67 ms | 32.4% bf16 MFU | 124860 tok/s step 10640/19560 | loss 3.390907 (-0.21z)| norm 0.2528 (-0.15z)| lr 3.76e-04 | 4167.68 ms | 32.4% bf16 MFU | 124907 tok/s step 10641/19560 | loss 3.351073 (-1.07z)| norm 0.2353 (-0.49z)| lr 3.76e-04 | 4177.67 ms | 32.3% bf16 MFU | 124936 tok/s step 10642/19560 | loss 3.456700 (+1.24z)| norm 0.2779 (+0.33z)| lr 3.76e-04 | 4336.14 ms | 31.1% bf16 MFU | 124735 tok/s step 10643/19560 | loss 3.391712 (-0.19z)| norm 0.2607 (-0.00z)| lr 3.76e-04 | 4173.83 ms | 32.3% bf16 MFU | 124779 tok/s step 10644/19560 | loss 3.346139 (-1.18z)| norm 0.2426 (-0.35z)| lr 3.75e-04 | 4178.19 ms | 32.3% bf16 MFU | 124814 tok/s step 10645/19560 | loss 3.327495 (-1.56z)| norm 0.2276 (-0.63z)| lr 3.75e-04 | 4224.83 ms | 32.0% bf16 MFU | 124778 tok/s step 10646/19560 | loss 3.394434 (-0.10z)| norm 0.2485 (-0.23z)| lr 3.75e-04 | 4198.83 ms | 32.2% bf16 MFU | 124782 tok/s step 10647/19560 | loss 3.415026 (+0.34z)| norm 0.2257 (-0.67z)| lr 3.75e-04 | 4281.55 ms | 31.5% bf16 MFU | 124666 tok/s step 10648/19560 | loss 3.416628 (+0.37z)| norm 0.2964 (+0.68z)| lr 3.75e-04 | 4177.64 ms | 32.3% bf16 MFU | 124708 tok/s step 10649/19560 | loss 3.447994 (+1.06z)| norm 0.2803 (+0.37z)| lr 3.75e-04 | 4184.52 ms | 32.3% bf16 MFU | 124737 tok/s step 10650/19560 | loss 3.347815 (-1.12z)| norm 0.2683 (+0.13z)| lr 3.75e-04 | 4167.21 ms | 32.4% bf16 MFU | 124791 tok/s step 10651/19560 | loss 3.345184 (-1.16z)| norm 0.2521 (-0.18z)| lr 3.75e-04 | 4176.88 ms | 32.3% bf16 MFU | 124827 tok/s step 10652/19560 | loss 3.363838 (-0.75z)| norm 0.2416 (-0.39z)| lr 3.75e-04 | 4167.36 ms | 32.4% bf16 MFU | 124876 tok/s step 10653/19560 | loss 3.329077 (-1.53z)| norm 0.2511 (-0.20z)| lr 3.75e-04 | 4201.08 ms | 32.1% bf16 MFU | 124872 tok/s step 10654/19560 | loss 3.359816 (-0.82z)| norm 0.2611 (-0.00z)| lr 3.75e-04 | 4201.47 ms | 32.1% bf16 MFU | 124868 tok/s step 10655/19560 | loss 3.424639 (+0.72z)| norm 0.2464 (-0.28z)| lr 3.75e-04 | 4186.08 ms | 32.3% bf16 MFU | 124887 tok/s step 10656/19560 | loss 3.415308 (+0.49z)| norm 0.2573 (-0.07z)| lr 3.75e-04 | 4172.57 ms | 32.4% bf16 MFU | 124925 tok/s step 10657/19560 | loss 3.417758 (+0.54z)| norm 0.2272 (-0.65z)| lr 3.75e-04 | 4170.03 ms | 32.4% bf16 MFU | 124965 tok/s step 10658/19560 | loss 3.349988 (-1.08z)| norm 0.2527 (-0.16z)| lr 3.74e-04 | 4167.26 ms | 32.4% bf16 MFU | 125008 tok/s step 10659/19560 | loss 3.373631 (-0.50z)| norm 0.2282 (-0.62z)| lr 3.74e-04 | 4177.41 ms | 32.3% bf16 MFU | 125032 tok/s step 10660/19560 | loss 3.364994 (-0.70z)| norm 0.2603 (-0.01z)| lr 3.74e-04 | 4175.90 ms | 32.3% bf16 MFU | 125058 tok/s step 10661/19560 | loss 3.389012 (-0.11z)| norm 0.2379 (-0.44z)| lr 3.74e-04 | 4178.29 ms | 32.3% bf16 MFU | 125079 tok/s step 10662/19560 | loss 3.310554 (-1.98z)| norm 0.2653 (+0.09z)| lr 3.74e-04 | 4173.18 ms | 32.4% bf16 MFU | 125107 tok/s step 10663/19560 | loss 3.338929 (-1.29z)| norm 0.2535 (-0.14z)| lr 3.74e-04 | 4189.94 ms | 32.2% bf16 MFU | 125108 tok/s step 10664/19560 | loss 3.378746 (-0.33z)| norm 0.2353 (-0.49z)| lr 3.74e-04 | 4170.55 ms | 32.4% bf16 MFU | 125138 tok/s step 10665/19560 | loss 3.345617 (-1.10z)| norm 0.2537 (-0.13z)| lr 3.74e-04 | 4175.05 ms | 32.3% bf16 MFU | 125160 tok/s step 10666/19560 | loss 3.317199 (-1.74z)| norm 0.2549 (-0.11z)| lr 3.74e-04 | 4178.07 ms | 32.3% bf16 MFU | 125177 tok/s step 10667/19560 | loss 3.374720 (-0.38z)| norm 0.2367 (-0.46z)| lr 3.74e-04 | 4169.14 ms | 32.4% bf16 MFU | 125205 tok/s step 10668/19560 | loss 3.344432 (-1.08z)| norm 0.2594 (-0.02z)| lr 3.74e-04 | 4178.21 ms | 32.3% bf16 MFU | 125219 tok/s step 10669/19560 | loss 3.407019 (+0.39z)| norm 0.2454 (-0.29z)| lr 3.74e-04 | 4177.55 ms | 32.3% bf16 MFU | 125233 tok/s step 10670/19560 | loss 3.432942 (+1.00z)| norm 0.2732 (+0.24z)| lr 3.74e-04 | 4169.46 ms | 32.4% bf16 MFU | 125259 tok/s step 10671/19560 | loss 3.457717 (+1.55z)| norm 0.2399 (-0.40z)| lr 3.74e-04 | 4173.09 ms | 32.4% bf16 MFU | 125278 tok/s step 10672/19560 | loss 3.329617 (-1.43z)| norm 0.2685 (+0.15z)| lr 3.74e-04 | 4172.15 ms | 32.4% bf16 MFU | 125297 tok/s step 10673/19560 | loss 3.391242 (+0.02z)| norm 0.2601 (-0.01z)| lr 3.73e-04 | 4216.38 ms | 32.0% bf16 MFU | 125249 tok/s step 10674/19560 | loss 3.402097 (+0.28z)| norm 0.2597 (-0.02z)| lr 3.73e-04 | 4175.75 ms | 32.3% bf16 MFU | 125265 tok/s step 10675/19560 | loss 3.371595 (-0.45z)| norm 0.2454 (-0.29z)| lr 3.73e-04 | 4174.84 ms | 32.3% bf16 MFU | 125281 tok/s step 10676/19560 | loss 3.409638 (+0.45z)| norm 0.2404 (-0.39z)| lr 3.73e-04 | 4178.44 ms | 32.3% bf16 MFU | 125290 tok/s step 10677/19560 | loss 3.348504 (-0.99z)| norm 0.2775 (+0.32z)| lr 3.73e-04 | 4183.36 ms | 32.3% bf16 MFU | 125292 tok/s step 10678/19560 | loss 3.381226 (-0.20z)| norm 0.2367 (-0.46z)| lr 3.73e-04 | 4177.00 ms | 32.3% bf16 MFU | 125303 tok/s step 10679/19560 | loss 3.518541 (+3.05z)| norm 0.2540 (-0.13z)| lr 3.73e-04 | 4172.02 ms | 32.4% bf16 MFU | 125322 tok/s step 10680/19560 | loss 3.424933 (+0.82z)| norm 0.2737 (+0.24z)| lr 3.73e-04 | 4161.65 ms | 32.4% bf16 MFU | 125355 tok/s step 10681/19560 | loss 3.347945 (-1.00z)| norm 0.2577 (-0.06z)| lr 3.73e-04 | 4172.26 ms | 32.4% bf16 MFU | 125370 tok/s step 10682/19560 | loss 3.420784 (+0.75z)| norm 0.2719 (+0.20z)| lr 3.73e-04 | 4167.99 ms | 32.4% bf16 MFU | 125391 tok/s step 10683/19560 | loss 3.351973 (-0.89z)| norm 0.2720 (+0.21z)| lr 3.73e-04 | 4166.23 ms | 32.4% bf16 MFU | 125413 tok/s step 10684/19560 | loss 3.335549 (-1.27z)| norm 0.2579 (-0.06z)| lr 3.73e-04 | 4179.20 ms | 32.3% bf16 MFU | 125415 tok/s step 10685/19560 | loss 3.318527 (-1.69z)| norm 0.2474 (-0.26z)| lr 3.73e-04 | 4174.01 ms | 32.3% bf16 MFU | 125425 tok/s step 10686/19560 | loss 3.417761 (+0.72z)| norm 0.2425 (-0.35z)| lr 3.73e-04 | 4181.90 ms | 32.3% bf16 MFU | 125422 tok/s step 10687/19560 | loss 3.348909 (-0.94z)| norm 0.2435 (-0.34z)| lr 3.73e-04 | 4163.95 ms | 32.4% bf16 MFU | 125447 tok/s step 10688/19560 | loss 3.373722 (-0.32z)| norm 0.2576 (-0.06z)| lr 3.72e-04 | 4175.78 ms | 32.3% bf16 MFU | 125452 tok/s step 10689/19560 | loss 3.396577 (+0.24z)| norm 0.2436 (-0.33z)| lr 3.72e-04 | 4168.56 ms | 32.4% bf16 MFU | 125468 tok/s step 10690/19560 | loss 3.389806 (+0.07z)| norm 0.2774 (+0.31z)| lr 3.72e-04 | 4164.03 ms | 32.4% bf16 MFU | 125490 tok/s step 10691/19560 | loss 3.355703 (-0.77z)| norm 0.2455 (-0.31z)| lr 3.72e-04 | 4177.31 ms | 32.3% bf16 MFU | 125491 tok/s step 10692/19560 | loss 3.430715 (+1.07z)| norm 0.2577 (-0.07z)| lr 3.72e-04 | 4165.09 ms | 32.4% bf16 MFU | 125510 tok/s step 10693/19560 | loss 3.366322 (-0.50z)| norm 0.2660 (+0.09z)| lr 3.72e-04 | 4227.07 ms | 31.9% bf16 MFU | 125436 tok/s step 10694/19560 | loss 3.383384 (-0.08z)| norm 0.2400 (-0.41z)| lr 3.72e-04 | 4172.57 ms | 32.4% bf16 MFU | 125447 tok/s step 10695/19560 | loss 3.379201 (-0.18z)| norm 0.2453 (-0.30z)| lr 3.72e-04 | 4269.46 ms | 31.6% bf16 MFU | 125315 tok/s step 10696/19560 | loss 3.339910 (-1.15z)| norm 0.2454 (-0.30z)| lr 3.72e-04 | 4175.18 ms | 32.3% bf16 MFU | 125328 tok/s step 10697/19560 | loss 3.373138 (-0.32z)| norm 0.2429 (-0.35z)| lr 3.72e-04 | 4166.96 ms | 32.4% bf16 MFU | 125352 tok/s step 10698/19560 | loss 3.331023 (-1.36z)| norm 0.2371 (-0.46z)| lr 3.72e-04 | 4174.88 ms | 32.3% bf16 MFU | 125364 tok/s step 10699/19560 | loss 3.399830 (+0.38z)| norm 0.2318 (-0.56z)| lr 3.72e-04 | 4172.12 ms | 32.4% bf16 MFU | 125379 tok/s step 10700/19560 | loss 3.340992 (-1.10z)| norm 0.2442 (-0.32z)| lr 3.72e-04 | 4181.70 ms | 32.3% bf16 MFU | 125379 tok/s step 10701/19560 | loss 3.396237 (+0.30z)| norm 0.2341 (-0.51z)| lr 3.72e-04 | 4173.07 ms | 32.4% bf16 MFU | 125392 tok/s step 10702/19560 | loss 3.399749 (+0.39z)| norm 0.2314 (-0.56z)| lr 3.72e-04 | 4172.84 ms | 32.4% bf16 MFU | 125404 tok/s step 10703/19560 | loss 3.399315 (+0.39z)| norm 0.2620 (+0.03z)| lr 3.71e-04 | 4169.77 ms | 32.4% bf16 MFU | 125421 tok/s step 10704/19560 | loss 3.375729 (-0.22z)| norm 0.2525 (-0.16z)| lr 3.71e-04 | 4179.43 ms | 32.3% bf16 MFU | 125422 tok/s step 10705/19560 | loss 3.416744 (+0.84z)| norm 0.2373 (-0.44z)| lr 3.71e-04 | 4166.86 ms | 32.4% bf16 MFU | 125442 tok/s step 10706/19560 | loss 3.409984 (+0.67z)| norm 0.2731 (+0.24z)| lr 3.71e-04 | 4171.43 ms | 32.4% bf16 MFU | 125454 tok/s step 10707/19560 | loss 3.355105 (-0.74z)| norm 0.5449 (+4.90z)| lr 3.71e-04 | 4178.62 ms | 32.3% bf16 MFU | 125455 tok/s step 10708/19560 | loss 3.419206 (+0.95z)| norm 0.2657 (+0.05z)| lr 3.71e-04 | 4173.79 ms | 32.3% bf16 MFU | 125463 tok/s step 10709/19560 | loss 3.425656 (+1.11z)| norm 0.2653 (+0.04z)| lr 3.71e-04 | 4168.26 ms | 32.4% bf16 MFU | 125479 tok/s step 10710/19560 | loss 3.371772 (-0.29z)| norm 0.2643 (+0.02z)| lr 3.71e-04 | 4176.61 ms | 32.3% bf16 MFU | 125481 tok/s step 10711/19560 | loss 3.415118 (+0.86z)| norm 0.2668 (+0.07z)| lr 3.71e-04 | 4195.61 ms | 32.2% bf16 MFU | 125455 tok/s step 10712/19560 | loss 3.368090 (-0.39z)| norm 0.2395 (-0.40z)| lr 3.71e-04 | 4169.56 ms | 32.4% bf16 MFU | 125470 tok/s step 10713/19560 | loss 3.358521 (-0.63z)| norm 0.2585 (-0.08z)| lr 3.71e-04 | 4187.84 ms | 32.2% bf16 MFU | 125456 tok/s step 10714/19560 | loss 3.326304 (-1.47z)| norm 0.2554 (-0.13z)| lr 3.71e-04 | 4174.43 ms | 32.3% bf16 MFU | 125463 tok/s step 10715/19560 | loss 3.393954 (+0.32z)| norm 0.2446 (-0.31z)| lr 3.71e-04 | 4171.25 ms | 32.4% bf16 MFU | 125474 tok/s step 10716/19560 | loss 3.423981 (+1.11z)| norm 0.2559 (-0.12z)| lr 3.71e-04 | 4172.91 ms | 32.4% bf16 MFU | 125482 tok/s step 10717/19560 | loss 3.349869 (-0.85z)| norm 0.2375 (-0.43z)| lr 3.71e-04 | 4165.43 ms | 32.4% bf16 MFU | 125502 tok/s step 10718/19560 | loss 3.422523 (+1.07z)| norm 0.2520 (-0.18z)| lr 3.70e-04 | 4166.81 ms | 32.4% bf16 MFU | 125518 tok/s step 10719/19560 | loss 3.370958 (-0.29z)| norm 0.2629 (+0.01z)| lr 3.70e-04 | 4164.71 ms | 32.4% bf16 MFU | 125536 tok/s step 10720/19560 | loss 3.410864 (+0.76z)| norm 0.2433 (-0.33z)| lr 3.70e-04 | 4168.67 ms | 32.4% bf16 MFU | 125548 tok/s step 10721/19560 | loss 3.478550 (+2.47z)| norm 0.2462 (-0.28z)| lr 3.70e-04 | 4166.85 ms | 32.4% bf16 MFU | 125562 tok/s step 10722/19560 | loss 3.360593 (-0.58z)| norm 0.2321 (-0.52z)| lr 3.70e-04 | 4168.27 ms | 32.4% bf16 MFU | 125573 tok/s step 10723/19560 | loss 3.396047 (+0.34z)| norm 0.2451 (-0.29z)| lr 3.70e-04 | 4171.04 ms | 32.4% bf16 MFU | 125579 tok/s step 10724/19560 | loss 3.364786 (-0.47z)| norm 0.2375 (-0.42z)| lr 3.70e-04 | 4168.62 ms | 32.4% bf16 MFU | 125588 tok/s step 10725/19560 | loss 3.358855 (-0.64z)| norm 0.2404 (-0.37z)| lr 3.70e-04 | 4168.46 ms | 32.4% bf16 MFU | 125598 tok/s step 10726/19560 | loss 3.380641 (-0.08z)| norm 0.2557 (-0.10z)| lr 3.70e-04 | 4176.33 ms | 32.3% bf16 MFU | 125595 tok/s step 10727/19560 | loss 3.353412 (-0.78z)| norm 0.2314 (-0.52z)| lr 3.70e-04 | 4174.39 ms | 32.3% bf16 MFU | 125595 tok/s step 10728/19560 | loss 3.414374 (+0.80z)| norm 0.2515 (-0.17z)| lr 3.70e-04 | 4171.91 ms | 32.4% bf16 MFU | 125599 tok/s step 10729/19560 | loss 3.390016 (+0.17z)| norm 0.2496 (-0.20z)| lr 3.70e-04 | 4170.20 ms | 32.4% bf16 MFU | 125605 tok/s step 10730/19560 | loss 3.336292 (-1.24z)| norm 0.2662 (+0.09z)| lr 3.70e-04 | 4171.83 ms | 32.4% bf16 MFU | 125608 tok/s step 10731/19560 | loss 3.357859 (-0.69z)| norm 0.2461 (-0.26z)| lr 3.70e-04 | 4166.36 ms | 32.4% bf16 MFU | 125620 tok/s step 10732/19560 | loss 3.396415 (+0.33z)| norm 0.2709 (+0.17z)| lr 3.69e-04 | 4164.88 ms | 32.4% bf16 MFU | 125633 tok/s step 10733/19560 | loss 3.380314 (-0.10z)| norm 0.2503 (-0.19z)| lr 3.69e-04 | 4181.81 ms | 32.3% bf16 MFU | 125620 tok/s step 10734/19560 | loss 3.342739 (-1.07z)| norm 0.2538 (-0.13z)| lr 3.69e-04 | 4171.02 ms | 32.4% bf16 MFU | 125624 tok/s step 10735/19560 | loss 3.393651 (+0.27z)| norm 0.2560 (-0.09z)| lr 3.69e-04 | 4170.67 ms | 32.4% bf16 MFU | 125628 tok/s step 10736/19560 | loss 3.393502 (+0.27z)| norm 0.2560 (-0.08z)| lr 3.69e-04 | 4176.92 ms | 32.3% bf16 MFU | 125623 tok/s step 10737/19560 | loss 3.342768 (-1.05z)| norm 0.2555 (-0.09z)| lr 3.69e-04 | 4165.80 ms | 32.4% bf16 MFU | 125634 tok/s step 10738/19560 | loss 3.403583 (+0.53z)| norm 0.2207 (-0.69z)| lr 3.69e-04 | 4165.47 ms | 32.4% bf16 MFU | 125646 tok/s step 10739/19560 | loss 3.303795 (-2.06z)| norm 0.2709 (+0.18z)| lr 3.69e-04 | 4172.39 ms | 32.4% bf16 MFU | 125646 tok/s step 10740/19560 | loss 3.375227 (-0.20z)| norm 0.2511 (-0.16z)| lr 3.69e-04 | 4172.76 ms | 32.4% bf16 MFU | 125646 tok/s step 10741/19560 | loss 3.401294 (+0.46z)| norm 0.2616 (+0.03z)| lr 3.69e-04 | 4201.82 ms | 32.1% bf16 MFU | 125603 tok/s step 10742/19560 | loss 3.407647 (+0.62z)| norm 0.2649 (+0.08z)| lr 3.69e-04 | 4169.47 ms | 32.4% bf16 MFU | 125610 tok/s step 10743/19560 | loss 3.421939 (+0.98z)| norm 0.2589 (-0.02z)| lr 3.69e-04 | 4172.36 ms | 32.4% bf16 MFU | 125612 tok/s step 10744/19560 | loss 3.413989 (+0.76z)| norm 0.2584 (-0.03z)| lr 3.69e-04 | 4164.90 ms | 32.4% bf16 MFU | 125626 tok/s step 10745/19560 | loss 3.365692 (-0.52z)| norm 0.2556 (-0.08z)| lr 3.69e-04 | 4170.24 ms | 32.4% bf16 MFU | 125631 tok/s step 10746/19560 | loss 3.367285 (-0.47z)| norm 0.2417 (-0.32z)| lr 3.69e-04 | 4156.90 ms | 32.5% bf16 MFU | 125655 tok/s step 10747/19560 | loss 3.394063 (+0.25z)| norm 0.2448 (-0.26z)| lr 3.68e-04 | 4175.87 ms | 32.3% bf16 MFU | 125650 tok/s step 10748/19560 | loss 3.434189 (+1.33z)| norm 0.2335 (-0.45z)| lr 3.68e-04 | 4182.06 ms | 32.3% bf16 MFU | 125636 tok/s step 10749/19560 | loss 3.374575 (-0.26z)| norm 0.2447 (-0.25z)| lr 3.68e-04 | 4160.40 ms | 32.5% bf16 MFU | 125655 tok/s step 10750/19560 | loss 3.413261 (+0.80z)| norm 0.2400 (-0.33z)| lr 3.68e-04 | 4166.88 ms | 32.4% bf16 MFU | 125663 tok/s val loss 3.364943 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2937/10042 = 0.292472 step 10751/19560 | loss 3.417443 (+0.93z)| norm 0.2362 (-0.60z)| lr 3.68e-04 | 4167.80 ms | 32.4% bf16 MFU | 125670 tok/s step 10752/19560 | loss 3.362911 (-0.57z)| norm 0.2425 (-0.38z)| lr 3.68e-04 | 4174.37 ms | 32.3% bf16 MFU | 125666 tok/s step 10753/19560 | loss 3.306551 (-2.07z)| norm 0.2501 (-0.11z)| lr 3.68e-04 | 4201.61 ms | 32.1% bf16 MFU | 125622 tok/s step 10754/19560 | loss 3.385028 (+0.07z)| norm 0.2483 (-0.17z)| lr 3.68e-04 | 4167.83 ms | 32.4% bf16 MFU | 125631 tok/s step 10755/19560 | loss 3.337265 (-1.22z)| norm 0.2627 (+0.32z)| lr 3.68e-04 | 4159.93 ms | 32.5% bf16 MFU | 125651 tok/s step 10756/19560 | loss 3.423299 (+1.09z)| norm 0.2666 (+0.45z)| lr 3.68e-04 | 4207.55 ms | 32.1% bf16 MFU | 125599 tok/s step 10757/19560 | loss 3.512466 (+3.33z)| norm 0.2498 (-0.12z)| lr 3.68e-04 | 4234.61 ms | 31.9% bf16 MFU | 125509 tok/s step 10758/19560 | loss 3.306845 (-1.94z)| norm 0.2468 (-0.22z)| lr 3.68e-04 | 4331.40 ms | 31.2% bf16 MFU | 125286 tok/s step 10759/19560 | loss 3.427661 (+1.13z)| norm 0.2623 (+0.30z)| lr 3.68e-04 | 4162.17 ms | 32.4% bf16 MFU | 125320 tok/s step 10760/19560 | loss 3.368419 (-0.38z)| norm 0.2498 (-0.13z)| lr 3.68e-04 | 4272.42 ms | 31.6% bf16 MFU | 125190 tok/s step 10761/19560 | loss 3.336164 (-1.18z)| norm 0.2457 (-0.27z)| lr 3.68e-04 | 4164.18 ms | 32.4% bf16 MFU | 125225 tok/s step 10762/19560 | loss 3.361534 (-0.54z)| norm 0.2639 (+0.35z)| lr 3.67e-04 | 4167.75 ms | 32.4% bf16 MFU | 125254 tok/s step 10763/19560 | loss 3.382338 (-0.01z)| norm 0.2330 (-0.70z)| lr 3.67e-04 | 4168.65 ms | 32.4% bf16 MFU | 125280 tok/s step 10764/19560 | loss 3.342904 (-1.01z)| norm 0.2499 (-0.13z)| lr 3.67e-04 | 4176.58 ms | 32.3% bf16 MFU | 125292 tok/s step 10765/19560 | loss 3.331476 (-1.28z)| norm 0.2300 (-0.80z)| lr 3.67e-04 | 4233.48 ms | 31.9% bf16 MFU | 125220 tok/s step 10766/19560 | loss 3.371987 (-0.25z)| norm 0.2593 (+0.19z)| lr 3.67e-04 | 4159.29 ms | 32.5% bf16 MFU | 125261 tok/s step 10767/19560 | loss 3.458582 (+1.94z)| norm 0.2809 (+0.92z)| lr 3.67e-04 | 4203.11 ms | 32.1% bf16 MFU | 125235 tok/s step 10768/19560 | loss 3.406930 (+0.63z)| norm 0.2556 (+0.05z)| lr 3.67e-04 | 4209.46 ms | 32.1% bf16 MFU | 125201 tok/s step 10769/19560 | loss 3.413762 (+0.79z)| norm 0.2568 (+0.09z)| lr 3.67e-04 | 4165.52 ms | 32.4% bf16 MFU | 125234 tok/s step 10770/19560 | loss 3.311106 (-1.77z)| norm 0.2469 (-0.24z)| lr 3.67e-04 | 4340.61 ms | 31.1% bf16 MFU | 125012 tok/s step 10771/19560 | loss 3.379586 (-0.04z)| norm 0.2604 (+0.22z)| lr 3.67e-04 | 4164.43 ms | 32.4% bf16 MFU | 125056 tok/s step 10772/19560 | loss 3.400004 (+0.47z)| norm 0.2711 (+0.58z)| lr 3.67e-04 | 4161.30 ms | 32.4% bf16 MFU | 125103 tok/s step 10773/19560 | loss 3.384549 (+0.06z)| norm 0.2832 (+0.98z)| lr 3.67e-04 | 4168.90 ms | 32.4% bf16 MFU | 125136 tok/s step 10774/19560 | loss 3.373505 (-0.21z)| norm 0.2540 (-0.02z)| lr 3.67e-04 | 4160.77 ms | 32.5% bf16 MFU | 125179 tok/s step 10775/19560 | loss 3.416337 (+0.88z)| norm 0.2690 (+0.49z)| lr 3.67e-04 | 4171.28 ms | 32.4% bf16 MFU | 125205 tok/s step 10776/19560 | loss 3.362491 (-0.49z)| norm 0.2353 (-0.66z)| lr 3.67e-04 | 4162.69 ms | 32.4% bf16 MFU | 125242 tok/s step 10777/19560 | loss 3.376508 (-0.11z)| norm 0.2615 (+0.25z)| lr 3.66e-04 | 4175.98 ms | 32.3% bf16 MFU | 125257 tok/s step 10778/19560 | loss 3.395288 (+0.36z)| norm 0.2505 (-0.13z)| lr 3.66e-04 | 4162.48 ms | 32.4% bf16 MFU | 125292 tok/s step 10779/19560 | loss 3.410419 (+0.75z)| norm 0.2443 (-0.34z)| lr 3.66e-04 | 4165.55 ms | 32.4% bf16 MFU | 125321 tok/s step 10780/19560 | loss 3.337818 (-1.13z)| norm 0.2493 (-0.17z)| lr 3.66e-04 | 4171.27 ms | 32.4% bf16 MFU | 125339 tok/s step 10781/19560 | loss 3.422842 (+1.05z)| norm 0.2286 (-0.88z)| lr 3.66e-04 | 4219.65 ms | 32.0% bf16 MFU | 125285 tok/s step 10782/19560 | loss 3.366702 (-0.41z)| norm 0.2704 (+0.57z)| lr 3.66e-04 | 4167.58 ms | 32.4% bf16 MFU | 125311 tok/s step 10783/19560 | loss 3.403721 (+0.56z)| norm 0.2457 (-0.29z)| lr 3.66e-04 | 4160.34 ms | 32.5% bf16 MFU | 125346 tok/s step 10784/19560 | loss 3.393401 (+0.30z)| norm 0.2610 (+0.24z)| lr 3.66e-04 | 4166.67 ms | 32.4% bf16 MFU | 125370 tok/s step 10785/19560 | loss 3.459952 (+2.00z)| norm 0.2617 (+0.25z)| lr 3.66e-04 | 4779.06 ms | 28.3% bf16 MFU | 124587 tok/s step 10786/19560 | loss 3.334835 (-1.22z)| norm 0.2678 (+0.46z)| lr 3.66e-04 | 4599.57 ms | 29.4% bf16 MFU | 124057 tok/s step 10787/19560 | loss 3.371120 (-0.29z)| norm 0.2611 (+0.22z)| lr 3.66e-04 | 4363.19 ms | 30.9% bf16 MFU | 123862 tok/s step 10788/19560 | loss 3.348449 (-0.86z)| norm 0.2527 (-0.07z)| lr 3.66e-04 | 4209.93 ms | 32.1% bf16 MFU | 123896 tok/s step 10789/19560 | loss 3.299858 (-2.06z)| norm 0.2497 (-0.18z)| lr 3.66e-04 | 4163.44 ms | 32.4% bf16 MFU | 123997 tok/s step 10790/19560 | loss 3.350647 (-0.79z)| norm 0.2410 (-0.47z)| lr 3.66e-04 | 4203.63 ms | 32.1% bf16 MFU | 124034 tok/s step 10791/19560 | loss 3.340826 (-1.04z)| norm 0.2336 (-0.73z)| lr 3.66e-04 | 4200.69 ms | 32.1% bf16 MFU | 124072 tok/s step 10792/19560 | loss 3.358241 (-0.59z)| norm 0.2601 (+0.19z)| lr 3.65e-04 | 4196.98 ms | 32.2% bf16 MFU | 124115 tok/s step 10793/19560 | loss 3.353056 (-0.73z)| norm 0.2252 (-1.02z)| lr 3.65e-04 | 4160.42 ms | 32.5% bf16 MFU | 124210 tok/s step 10794/19560 | loss 3.356468 (-0.65z)| norm 0.2664 (+0.41z)| lr 3.65e-04 | 4504.31 ms | 30.0% bf16 MFU | 123819 tok/s step 10795/19560 | loss 3.377055 (-0.13z)| norm 0.2546 (-0.00z)| lr 3.65e-04 | 4226.42 ms | 31.9% bf16 MFU | 123831 tok/s step 10796/19560 | loss 3.408561 (+0.67z)| norm 0.2643 (+0.33z)| lr 3.65e-04 | 4217.48 ms | 32.0% bf16 MFU | 123855 tok/s step 10797/19560 | loss 3.339740 (-1.08z)| norm 0.3557 (+3.33z)| lr 3.65e-04 | 4158.78 ms | 32.5% bf16 MFU | 123966 tok/s step 10798/19560 | loss 3.338498 (-1.10z)| norm 0.2792 (+0.78z)| lr 3.65e-04 | 4170.20 ms | 32.4% bf16 MFU | 124054 tok/s step 10799/19560 | loss 3.338959 (-1.08z)| norm 0.2625 (+0.22z)| lr 3.65e-04 | 4170.58 ms | 32.4% bf16 MFU | 124136 tok/s step 10800/19560 | loss 3.348027 (-0.85z)| norm 0.2577 (+0.07z)| lr 3.65e-04 | 4160.92 ms | 32.4% bf16 MFU | 124230 tok/s step 10801/19560 | loss 3.323049 (-1.47z)| norm 0.2623 (+0.22z)| lr 3.65e-04 | 4162.11 ms | 32.4% bf16 MFU | 124317 tok/s step 10802/19560 | loss 3.340499 (-1.01z)| norm 0.2492 (-0.21z)| lr 3.65e-04 | 4162.00 ms | 32.4% bf16 MFU | 124399 tok/s step 10803/19560 | loss 3.324184 (-1.41z)| norm 0.2471 (-0.28z)| lr 3.65e-04 | 4165.81 ms | 32.4% bf16 MFU | 124472 tok/s step 10804/19560 | loss 3.311988 (-1.69z)| norm 0.2394 (-0.54z)| lr 3.65e-04 | 4154.38 ms | 32.5% bf16 MFU | 124559 tok/s step 10805/19560 | loss 3.442283 (+1.60z)| norm 0.2599 (+0.15z)| lr 3.65e-04 | 4159.58 ms | 32.5% bf16 MFU | 124633 tok/s step 10806/19560 | loss 3.360846 (-0.45z)| norm 0.2578 (+0.07z)| lr 3.65e-04 | 4167.17 ms | 32.4% bf16 MFU | 124692 tok/s step 10807/19560 | loss 3.345830 (-0.84z)| norm 0.2447 (-0.36z)| lr 3.64e-04 | 4160.69 ms | 32.5% bf16 MFU | 124758 tok/s step 10808/19560 | loss 3.345101 (-0.84z)| norm 0.2684 (+0.43z)| lr 3.64e-04 | 4166.05 ms | 32.4% bf16 MFU | 124812 tok/s step 10809/19560 | loss 3.363168 (-0.37z)| norm 0.2433 (-0.40z)| lr 3.64e-04 | 4159.76 ms | 32.5% bf16 MFU | 124874 tok/s step 10810/19560 | loss 3.364016 (-0.33z)| norm 0.2781 (+0.75z)| lr 3.64e-04 | 4161.55 ms | 32.4% bf16 MFU | 124929 tok/s step 10811/19560 | loss 3.371994 (-0.13z)| norm 0.2492 (-0.20z)| lr 3.64e-04 | 4168.28 ms | 32.4% bf16 MFU | 124972 tok/s step 10812/19560 | loss 3.343345 (-0.90z)| norm 0.2684 (+0.43z)| lr 3.64e-04 | 4164.29 ms | 32.4% bf16 MFU | 125018 tok/s step 10813/19560 | loss 3.381733 (+0.12z)| norm 0.2316 (-0.79z)| lr 3.64e-04 | 4165.58 ms | 32.4% bf16 MFU | 125060 tok/s step 10814/19560 | loss 3.355426 (-0.58z)| norm 0.2529 (-0.08z)| lr 3.64e-04 | 4163.84 ms | 32.4% bf16 MFU | 125103 tok/s step 10815/19560 | loss 3.376785 (-0.00z)| norm 0.2448 (-0.35z)| lr 3.64e-04 | 4162.67 ms | 32.4% bf16 MFU | 125145 tok/s step 10816/19560 | loss 3.303170 (-1.97z)| norm 0.2521 (-0.11z)| lr 3.64e-04 | 4161.93 ms | 32.4% bf16 MFU | 125187 tok/s step 10817/19560 | loss 3.377382 (+0.03z)| norm 0.2538 (-0.05z)| lr 3.64e-04 | 4160.02 ms | 32.5% bf16 MFU | 125229 tok/s step 10818/19560 | loss 3.328523 (-1.26z)| norm 0.2468 (-0.28z)| lr 3.64e-04 | 4161.99 ms | 32.4% bf16 MFU | 125266 tok/s step 10819/19560 | loss 3.526611 (+3.77z)| norm 0.2617 (+0.21z)| lr 3.64e-04 | 4160.51 ms | 32.5% bf16 MFU | 125303 tok/s step 10820/19560 | loss 3.473543 (+2.39z)| norm 1.8762 (+11.03z)| lr 3.64e-04 | 4165.74 ms | 32.4% bf16 MFU | 125331 tok/s step 10821/19560 | loss 3.420951 (+1.06z)| norm 0.3292 (+0.42z)| lr 3.63e-04 | 4160.21 ms | 32.5% bf16 MFU | 125366 tok/s step 10822/19560 | loss 3.489354 (+2.66z)| norm 1.7304 (+7.45z)| lr 3.63e-04 | 4158.46 ms | 32.5% bf16 MFU | 125401 tok/s step 10823/19560 | loss 3.381794 (+0.07z)| norm 1.3260 (+4.82z)| lr 3.63e-04 | 4160.47 ms | 32.5% bf16 MFU | 125432 tok/s step 10824/19560 | loss 3.382005 (+0.07z)| norm 0.3085 (+0.09z)| lr 3.63e-04 | 4160.23 ms | 32.5% bf16 MFU | 125462 tok/s step 10825/19560 | loss 3.419628 (+0.97z)| norm 0.2995 (+0.05z)| lr 3.63e-04 | 4162.41 ms | 32.4% bf16 MFU | 125486 tok/s step 10826/19560 | loss 3.358406 (-0.51z)| norm 0.2823 (-0.03z)| lr 3.63e-04 | 4159.48 ms | 32.5% bf16 MFU | 125514 tok/s step 10827/19560 | loss 3.344905 (-0.83z)| norm 0.2729 (-0.08z)| lr 3.63e-04 | 4158.44 ms | 32.5% bf16 MFU | 125543 tok/s step 10828/19560 | loss 3.345517 (-0.81z)| norm 0.2769 (-0.06z)| lr 3.63e-04 | 4161.32 ms | 32.4% bf16 MFU | 125565 tok/s step 10829/19560 | loss 3.343830 (-0.84z)| norm 0.2451 (-0.21z)| lr 3.63e-04 | 4161.72 ms | 32.4% bf16 MFU | 125586 tok/s step 10830/19560 | loss 3.468380 (+2.10z)| norm 0.2663 (-0.11z)| lr 3.63e-04 | 4161.28 ms | 32.4% bf16 MFU | 125606 tok/s step 10831/19560 | loss 3.386290 (+0.17z)| norm 0.2411 (-0.23z)| lr 3.63e-04 | 4159.35 ms | 32.5% bf16 MFU | 125628 tok/s step 10832/19560 | loss 3.432587 (+1.24z)| norm 0.2524 (-0.18z)| lr 3.63e-04 | 4157.26 ms | 32.5% bf16 MFU | 125653 tok/s step 10833/19560 | loss 3.393795 (+0.34z)| norm 0.2466 (-0.21z)| lr 3.63e-04 | 4161.71 ms | 32.4% bf16 MFU | 125669 tok/s step 10834/19560 | loss 3.421919 (+1.00z)| norm 0.2411 (-0.23z)| lr 3.63e-04 | 4160.83 ms | 32.4% bf16 MFU | 125686 tok/s step 10835/19560 | loss 3.392290 (+0.29z)| norm 0.2527 (-0.17z)| lr 3.63e-04 | 4161.47 ms | 32.4% bf16 MFU | 125701 tok/s step 10836/19560 | loss 3.328840 (-1.18z)| norm 0.2448 (-0.20z)| lr 3.62e-04 | 4165.36 ms | 32.4% bf16 MFU | 125709 tok/s step 10837/19560 | loss 3.343070 (-0.84z)| norm 0.2296 (-0.27z)| lr 3.62e-04 | 4159.47 ms | 32.5% bf16 MFU | 125726 tok/s step 10838/19560 | loss 3.381954 (+0.08z)| norm 0.2465 (-0.19z)| lr 3.62e-04 | 4165.20 ms | 32.4% bf16 MFU | 125733 tok/s step 10839/19560 | loss 3.378992 (+0.02z)| norm 0.2366 (-0.24z)| lr 3.62e-04 | 4163.68 ms | 32.4% bf16 MFU | 125743 tok/s step 10840/19560 | loss 3.417394 (+0.91z)| norm 0.2405 (-0.22z)| lr 3.62e-04 | 4176.85 ms | 32.3% bf16 MFU | 125732 tok/s step 10841/19560 | loss 3.431011 (+1.21z)| norm 0.2531 (-0.16z)| lr 3.62e-04 | 4160.85 ms | 32.4% bf16 MFU | 125745 tok/s step 10842/19560 | loss 3.350431 (-0.68z)| norm 0.2498 (-0.17z)| lr 3.62e-04 | 4161.27 ms | 32.4% bf16 MFU | 125758 tok/s step 10843/19560 | loss 3.371462 (-0.18z)| norm 0.2403 (-0.22z)| lr 3.62e-04 | 4163.75 ms | 32.4% bf16 MFU | 125766 tok/s step 10844/19560 | loss 3.382471 (+0.08z)| norm 0.2390 (-0.22z)| lr 3.62e-04 | 4159.89 ms | 32.5% bf16 MFU | 125779 tok/s step 10845/19560 | loss 3.342216 (-0.87z)| norm 0.2383 (-0.23z)| lr 3.62e-04 | 4169.98 ms | 32.4% bf16 MFU | 125777 tok/s step 10846/19560 | loss 3.340846 (-0.89z)| norm 0.2422 (-0.21z)| lr 3.62e-04 | 4160.86 ms | 32.4% bf16 MFU | 125788 tok/s step 10847/19560 | loss 3.317666 (-1.41z)| norm 0.2469 (-0.19z)| lr 3.62e-04 | 4266.87 ms | 31.6% bf16 MFU | 125642 tok/s step 10848/19560 | loss 3.342587 (-0.82z)| norm 0.2343 (-0.25z)| lr 3.62e-04 | 4204.18 ms | 32.1% bf16 MFU | 125596 tok/s step 10849/19560 | loss 3.398101 (+0.51z)| norm 0.2630 (-0.11z)| lr 3.62e-04 | 4161.75 ms | 32.4% bf16 MFU | 125615 tok/s step 10850/19560 | loss 3.495621 (+2.74z)| norm 0.2611 (-0.12z)| lr 3.62e-04 | 4160.92 ms | 32.4% bf16 MFU | 125634 tok/s step 10851/19560 | loss 3.436957 (+1.36z)| norm 0.4436 (+0.72z)| lr 3.61e-04 | 4159.52 ms | 32.5% bf16 MFU | 125655 tok/s step 10852/19560 | loss 3.410424 (+0.74z)| norm 0.2600 (-0.13z)| lr 3.61e-04 | 4159.39 ms | 32.5% bf16 MFU | 125674 tok/s step 10853/19560 | loss 3.323184 (-1.26z)| norm 0.2638 (-0.12z)| lr 3.61e-04 | 4158.96 ms | 32.5% bf16 MFU | 125694 tok/s step 10854/19560 | loss 3.328672 (-1.12z)| norm 0.3161 (+0.12z)| lr 3.61e-04 | 4160.40 ms | 32.5% bf16 MFU | 125710 tok/s step 10855/19560 | loss 3.444469 (+1.50z)| norm 0.2885 (-0.01z)| lr 3.61e-04 | 4213.57 ms | 32.0% bf16 MFU | 125646 tok/s step 10856/19560 | loss 3.374551 (-0.08z)| norm 0.2801 (-0.05z)| lr 3.61e-04 | 4159.61 ms | 32.5% bf16 MFU | 125666 tok/s step 10857/19560 | loss 3.385635 (+0.17z)| norm 0.2663 (-0.11z)| lr 3.61e-04 | 4163.44 ms | 32.4% bf16 MFU | 125679 tok/s step 10858/19560 | loss 3.381897 (+0.08z)| norm 0.2506 (-0.18z)| lr 3.61e-04 | 4164.26 ms | 32.4% bf16 MFU | 125690 tok/s step 10859/19560 | loss 3.349880 (-0.65z)| norm 0.2451 (-0.21z)| lr 3.61e-04 | 4164.60 ms | 32.4% bf16 MFU | 125700 tok/s step 10860/19560 | loss 3.339365 (-0.88z)| norm 0.2480 (-0.20z)| lr 3.61e-04 | 4161.51 ms | 32.4% bf16 MFU | 125714 tok/s step 10861/19560 | loss 3.357302 (-0.47z)| norm 0.2341 (-0.26z)| lr 3.61e-04 | 4279.09 ms | 31.6% bf16 MFU | 125555 tok/s step 10862/19560 | loss 3.341553 (-0.82z)| norm 0.2538 (-0.17z)| lr 3.61e-04 | 4353.79 ms | 31.0% bf16 MFU | 125298 tok/s step 10863/19560 | loss 3.343290 (-0.77z)| norm 0.2468 (-0.20z)| lr 3.61e-04 | 4350.44 ms | 31.0% bf16 MFU | 125059 tok/s step 10864/19560 | loss 3.353735 (-0.53z)| norm 0.2644 (-0.12z)| lr 3.61e-04 | 4275.18 ms | 31.6% bf16 MFU | 124938 tok/s step 10865/19560 | loss 3.354147 (-0.52z)| norm 0.2649 (-0.12z)| lr 3.61e-04 | 4174.80 ms | 32.3% bf16 MFU | 124970 tok/s step 10866/19560 | loss 3.339831 (-0.83z)| norm 0.2738 (-0.08z)| lr 3.60e-04 | 4161.62 ms | 32.4% bf16 MFU | 125021 tok/s step 10867/19560 | loss 3.327727 (-1.12z)| norm 0.2313 (-0.27z)| lr 3.60e-04 | 4178.53 ms | 32.3% bf16 MFU | 125043 tok/s step 10868/19560 | loss 3.364028 (-0.29z)| norm 0.7557 (+2.12z)| lr 3.60e-04 | 4160.95 ms | 32.4% bf16 MFU | 125091 tok/s step 10869/19560 | loss 3.377703 (+0.03z)| norm 0.2635 (-0.14z)| lr 3.60e-04 | 4242.27 ms | 31.8% bf16 MFU | 125016 tok/s step 10870/19560 | loss 3.395010 (+0.42z)| norm 0.2746 (-0.09z)| lr 3.60e-04 | 4158.00 ms | 32.5% bf16 MFU | 125070 tok/s step 10871/19560 | loss 3.366821 (-0.21z)| norm 0.2531 (-0.19z)| lr 3.60e-04 | 4250.30 ms | 31.8% bf16 MFU | 124984 tok/s step 10872/19560 | loss 3.389132 (+0.31z)| norm 0.2589 (-0.16z)| lr 3.60e-04 | 4161.56 ms | 32.4% bf16 MFU | 125034 tok/s step 10873/19560 | loss 3.313437 (-1.41z)| norm 0.2686 (-0.12z)| lr 3.60e-04 | 4160.42 ms | 32.5% bf16 MFU | 125083 tok/s step 10874/19560 | loss 3.354777 (-0.47z)| norm 0.2545 (-0.18z)| lr 3.60e-04 | 4163.40 ms | 32.4% bf16 MFU | 125125 tok/s step 10875/19560 | loss 3.365122 (-0.23z)| norm 0.2515 (-0.20z)| lr 3.60e-04 | 4161.85 ms | 32.4% bf16 MFU | 125168 tok/s step 10876/19560 | loss 3.305184 (-1.57z)| norm 0.2480 (-0.21z)| lr 3.60e-04 | 4162.58 ms | 32.4% bf16 MFU | 125207 tok/s step 10877/19560 | loss 3.313284 (-1.36z)| norm 0.2750 (-0.09z)| lr 3.60e-04 | 4160.90 ms | 32.4% bf16 MFU | 125247 tok/s step 10878/19560 | loss 3.356525 (-0.38z)| norm 0.2660 (-0.13z)| lr 3.60e-04 | 4159.88 ms | 32.5% bf16 MFU | 125286 tok/s step 10879/19560 | loss 3.405088 (+0.73z)| norm 0.2322 (-0.29z)| lr 3.60e-04 | 4208.75 ms | 32.1% bf16 MFU | 125250 tok/s step 10880/19560 | loss 3.341756 (-0.71z)| norm 0.2597 (-0.16z)| lr 3.60e-04 | 4160.41 ms | 32.5% bf16 MFU | 125289 tok/s step 10881/19560 | loss 3.368263 (-0.12z)| norm 0.2504 (-0.20z)| lr 3.59e-04 | 4151.17 ms | 32.5% bf16 MFU | 125339 tok/s step 10882/19560 | loss 3.412855 (+0.89z)| norm 0.2276 (-0.31z)| lr 3.59e-04 | 4154.72 ms | 32.5% bf16 MFU | 125382 tok/s step 10883/19560 | loss 3.408835 (+0.79z)| norm 0.2687 (-0.12z)| lr 3.59e-04 | 4155.25 ms | 32.5% bf16 MFU | 125422 tok/s step 10884/19560 | loss 3.349455 (-0.55z)| norm 0.2550 (-0.18z)| lr 3.59e-04 | 4157.05 ms | 32.5% bf16 MFU | 125456 tok/s step 10885/19560 | loss 3.341950 (-0.72z)| norm 0.2368 (-0.27z)| lr 3.59e-04 | 4164.68 ms | 32.4% bf16 MFU | 125478 tok/s step 10886/19560 | loss 3.371388 (-0.03z)| norm 0.2648 (-0.14z)| lr 3.59e-04 | 4166.64 ms | 32.4% bf16 MFU | 125496 tok/s step 10887/19560 | loss 3.360650 (-0.28z)| norm 0.2383 (-0.26z)| lr 3.59e-04 | 4159.97 ms | 32.5% bf16 MFU | 125522 tok/s step 10888/19560 | loss 3.318165 (-1.29z)| norm 0.2544 (-0.18z)| lr 3.59e-04 | 4160.10 ms | 32.5% bf16 MFU | 125548 tok/s step 10889/19560 | loss 3.328979 (-1.03z)| norm 0.6460 (+1.58z)| lr 3.59e-04 | 4156.32 ms | 32.5% bf16 MFU | 125577 tok/s step 10890/19560 | loss 3.382824 (+0.26z)| norm 0.2530 (-0.20z)| lr 3.59e-04 | 4950.56 ms | 27.3% bf16 MFU | 124594 tok/s step 10891/19560 | loss 3.387049 (+0.36z)| norm 0.2704 (-0.13z)| lr 3.59e-04 | 4156.82 ms | 32.5% bf16 MFU | 124670 tok/s step 10892/19560 | loss 3.371110 (-0.03z)| norm 0.2615 (-0.17z)| lr 3.59e-04 | 4160.11 ms | 32.5% bf16 MFU | 124738 tok/s step 10893/19560 | loss 3.412912 (+0.96z)| norm 0.2579 (-0.18z)| lr 3.59e-04 | 4162.45 ms | 32.4% bf16 MFU | 124799 tok/s step 10894/19560 | loss 3.392093 (+0.46z)| norm 0.2757 (-0.10z)| lr 3.59e-04 | 4163.55 ms | 32.4% bf16 MFU | 124855 tok/s step 10895/19560 | loss 3.372379 (+0.00z)| norm 0.2680 (-0.14z)| lr 3.59e-04 | 4159.89 ms | 32.5% bf16 MFU | 124914 tok/s step 10896/19560 | loss 3.316216 (-1.35z)| norm 0.2613 (-0.17z)| lr 3.58e-04 | 4158.75 ms | 32.5% bf16 MFU | 124972 tok/s step 10897/19560 | loss 3.319866 (-1.24z)| norm 0.2498 (-0.22z)| lr 3.58e-04 | 4163.51 ms | 32.4% bf16 MFU | 125020 tok/s step 10898/19560 | loss 3.462141 (+2.16z)| norm 0.2649 (-0.15z)| lr 3.58e-04 | 4161.75 ms | 32.4% bf16 MFU | 125068 tok/s step 10899/19560 | loss 3.341449 (-0.73z)| norm 0.2468 (-0.23z)| lr 3.58e-04 | 4163.52 ms | 32.4% bf16 MFU | 125110 tok/s step 10900/19560 | loss 3.385381 (+0.33z)| norm 0.2669 (-0.14z)| lr 3.58e-04 | 4164.98 ms | 32.4% bf16 MFU | 125149 tok/s step 10901/19560 | loss 3.391240 (+0.47z)| norm 0.2381 (-0.27z)| lr 3.58e-04 | 4161.26 ms | 32.4% bf16 MFU | 125191 tok/s step 10902/19560 | loss 3.407937 (+0.86z)| norm 0.2593 (-0.18z)| lr 3.58e-04 | 4159.77 ms | 32.5% bf16 MFU | 125233 tok/s step 10903/19560 | loss 3.382275 (+0.25z)| norm 0.2849 (-0.06z)| lr 3.58e-04 | 4159.09 ms | 32.5% bf16 MFU | 125275 tok/s step 10904/19560 | loss 3.357906 (-0.33z)| norm 0.2574 (-0.19z)| lr 3.58e-04 | 4160.19 ms | 32.5% bf16 MFU | 125312 tok/s step 10905/19560 | loss 3.386532 (+0.35z)| norm 0.2435 (-0.25z)| lr 3.58e-04 | 4165.27 ms | 32.4% bf16 MFU | 125340 tok/s step 10906/19560 | loss 3.343526 (-0.67z)| norm 0.2563 (-0.19z)| lr 3.58e-04 | 4162.02 ms | 32.4% bf16 MFU | 125372 tok/s step 10907/19560 | loss 3.443742 (+1.72z)| norm 0.2383 (-0.27z)| lr 3.58e-04 | 4160.28 ms | 32.5% bf16 MFU | 125404 tok/s step 10908/19560 | loss 3.408908 (+0.87z)| norm 0.2397 (-0.27z)| lr 3.58e-04 | 4165.26 ms | 32.4% bf16 MFU | 125428 tok/s step 10909/19560 | loss 3.382290 (+0.25z)| norm 0.2565 (-0.19z)| lr 3.58e-04 | 4164.66 ms | 32.4% bf16 MFU | 125451 tok/s step 10910/19560 | loss 3.389742 (+0.42z)| norm 0.2348 (-0.29z)| lr 3.57e-04 | 4163.78 ms | 32.4% bf16 MFU | 125474 tok/s step 10911/19560 | loss 3.385591 (+0.33z)| norm 0.2606 (-0.17z)| lr 3.57e-04 | 4164.03 ms | 32.4% bf16 MFU | 125496 tok/s step 10912/19560 | loss 3.413860 (+1.00z)| norm 0.2479 (-0.23z)| lr 3.57e-04 | 4167.66 ms | 32.4% bf16 MFU | 125511 tok/s step 10913/19560 | loss 3.399727 (+0.68z)| norm 0.2453 (-0.24z)| lr 3.57e-04 | 4159.91 ms | 32.5% bf16 MFU | 125537 tok/s step 10914/19560 | loss 3.329858 (-1.01z)| norm 0.2551 (-0.19z)| lr 3.57e-04 | 4164.69 ms | 32.4% bf16 MFU | 125555 tok/s step 10915/19560 | loss 3.367077 (-0.11z)| norm 0.2430 (-0.25z)| lr 3.57e-04 | 4166.98 ms | 32.4% bf16 MFU | 125568 tok/s step 10916/19560 | loss 3.340084 (-0.76z)| norm 0.2644 (-0.15z)| lr 3.57e-04 | 4161.08 ms | 32.4% bf16 MFU | 125589 tok/s step 10917/19560 | loss 3.381126 (+0.22z)| norm 0.2500 (-0.22z)| lr 3.57e-04 | 4164.55 ms | 32.4% bf16 MFU | 125605 tok/s step 10918/19560 | loss 3.379273 (+0.17z)| norm 0.2622 (-0.16z)| lr 3.57e-04 | 4163.64 ms | 32.4% bf16 MFU | 125620 tok/s step 10919/19560 | loss 3.341677 (-0.75z)| norm 0.2507 (-0.22z)| lr 3.57e-04 | 4167.30 ms | 32.4% bf16 MFU | 125630 tok/s step 10920/19560 | loss 3.357363 (-0.37z)| norm 0.2553 (-0.19z)| lr 3.57e-04 | 4167.53 ms | 32.4% bf16 MFU | 125638 tok/s step 10921/19560 | loss 3.378925 (+0.16z)| norm 0.2527 (-0.21z)| lr 3.57e-04 | 4166.28 ms | 32.4% bf16 MFU | 125649 tok/s step 10922/19560 | loss 3.332355 (-0.98z)| norm 0.2491 (-0.22z)| lr 3.57e-04 | 4165.56 ms | 32.4% bf16 MFU | 125659 tok/s step 10923/19560 | loss 3.367709 (-0.11z)| norm 0.2506 (-0.22z)| lr 3.57e-04 | 4167.90 ms | 32.4% bf16 MFU | 125666 tok/s step 10924/19560 | loss 3.329506 (-1.03z)| norm 0.2682 (-0.14z)| lr 3.57e-04 | 4164.04 ms | 32.4% bf16 MFU | 125678 tok/s step 10925/19560 | loss 3.339912 (-0.78z)| norm 0.2583 (-0.18z)| lr 3.56e-04 | 4163.89 ms | 32.4% bf16 MFU | 125690 tok/s step 10926/19560 | loss 3.415327 (+1.05z)| norm 0.2840 (-0.06z)| lr 3.56e-04 | 4163.34 ms | 32.4% bf16 MFU | 125702 tok/s step 10927/19560 | loss 3.365743 (-0.16z)| norm 0.2663 (-0.14z)| lr 3.56e-04 | 4160.31 ms | 32.5% bf16 MFU | 125718 tok/s step 10928/19560 | loss 3.346607 (-0.63z)| norm 0.2596 (-0.17z)| lr 3.56e-04 | 4163.31 ms | 32.4% bf16 MFU | 125728 tok/s step 10929/19560 | loss 3.345503 (-0.67z)| norm 0.2544 (-0.20z)| lr 3.56e-04 | 4165.65 ms | 32.4% bf16 MFU | 125735 tok/s step 10930/19560 | loss 3.422781 (+1.21z)| norm 0.2445 (-0.24z)| lr 3.56e-04 | 4166.41 ms | 32.4% bf16 MFU | 125740 tok/s step 10931/19560 | loss 3.288830 (-2.04z)| norm 0.2663 (-0.14z)| lr 3.56e-04 | 4164.16 ms | 32.4% bf16 MFU | 125748 tok/s step 10932/19560 | loss 3.317519 (-1.35z)| norm 0.2362 (-0.28z)| lr 3.56e-04 | 4162.96 ms | 32.4% bf16 MFU | 125758 tok/s step 10933/19560 | loss 3.364871 (-0.19z)| norm 0.2372 (-0.27z)| lr 3.56e-04 | 4164.00 ms | 32.4% bf16 MFU | 125766 tok/s step 10934/19560 | loss 3.325563 (-1.14z)| norm 0.2512 (-0.21z)| lr 3.56e-04 | 4160.16 ms | 32.5% bf16 MFU | 125779 tok/s step 10935/19560 | loss 3.352760 (-0.47z)| norm 0.2537 (-0.20z)| lr 3.56e-04 | 4161.54 ms | 32.4% bf16 MFU | 125789 tok/s step 10936/19560 | loss 3.340105 (-0.78z)| norm 0.6082 (+1.39z)| lr 3.56e-04 | 4166.23 ms | 32.4% bf16 MFU | 125791 tok/s step 10937/19560 | loss 3.316028 (-1.35z)| norm 0.2462 (-0.24z)| lr 3.56e-04 | 4162.86 ms | 32.4% bf16 MFU | 125799 tok/s step 10938/19560 | loss 3.343991 (-0.67z)| norm 0.2882 (-0.05z)| lr 3.56e-04 | 4162.44 ms | 32.4% bf16 MFU | 125807 tok/s step 10939/19560 | loss 3.362921 (-0.21z)| norm 0.2593 (-0.18z)| lr 3.56e-04 | 4164.09 ms | 32.4% bf16 MFU | 125812 tok/s step 10940/19560 | loss 3.449842 (+1.85z)| norm 0.2789 (-0.10z)| lr 3.55e-04 | 4163.04 ms | 32.4% bf16 MFU | 125818 tok/s step 10941/19560 | loss 3.361511 (-0.26z)| norm 0.2659 (-0.16z)| lr 3.55e-04 | 4166.03 ms | 32.4% bf16 MFU | 125820 tok/s step 10942/19560 | loss 3.324810 (-1.12z)| norm 0.2549 (-0.21z)| lr 3.55e-04 | 4161.98 ms | 32.4% bf16 MFU | 125827 tok/s step 10943/19560 | loss 3.409519 (+0.88z)| norm 0.2784 (-0.10z)| lr 3.55e-04 | 4162.71 ms | 32.4% bf16 MFU | 125833 tok/s step 10944/19560 | loss 3.439456 (+1.57z)| norm 0.2448 (-0.25z)| lr 3.55e-04 | 4163.82 ms | 32.4% bf16 MFU | 125838 tok/s step 10945/19560 | loss 3.308957 (-1.50z)| norm 0.2706 (-0.14z)| lr 3.55e-04 | 4160.98 ms | 32.4% bf16 MFU | 125846 tok/s step 10946/19560 | loss 3.344829 (-0.66z)| norm 0.2584 (-0.19z)| lr 3.55e-04 | 4161.15 ms | 32.4% bf16 MFU | 125853 tok/s step 10947/19560 | loss 3.355114 (-0.41z)| norm 0.2649 (-0.16z)| lr 3.55e-04 | 4195.00 ms | 32.2% bf16 MFU | 125810 tok/s step 10948/19560 | loss 3.331125 (-1.00z)| norm 0.2365 (-0.30z)| lr 3.55e-04 | 4149.39 ms | 32.5% bf16 MFU | 125837 tok/s step 10949/19560 | loss 3.387988 (+0.45z)| norm 0.2760 (-0.07z)| lr 3.55e-04 | 4152.81 ms | 32.5% bf16 MFU | 125857 tok/s step 10950/19560 | loss 3.443115 (+1.92z)| norm 1.4048 (+7.38z)| lr 3.55e-04 | 4154.74 ms | 32.5% bf16 MFU | 125874 tok/s step 10951/19560 | loss 3.390275 (+0.53z)| norm 0.3115 (+0.28z)| lr 3.55e-04 | 4152.76 ms | 32.5% bf16 MFU | 125893 tok/s step 10952/19560 | loss 3.359858 (-0.26z)| norm 0.2713 (-0.05z)| lr 3.55e-04 | 4156.12 ms | 32.5% bf16 MFU | 125906 tok/s step 10953/19560 | loss 3.382404 (+0.34z)| norm 0.2921 (+0.13z)| lr 3.55e-04 | 4155.97 ms | 32.5% bf16 MFU | 125918 tok/s step 10954/19560 | loss 3.342669 (-0.70z)| norm 0.2688 (-0.07z)| lr 3.55e-04 | 4145.43 ms | 32.6% bf16 MFU | 125946 tok/s step 10955/19560 | loss 3.322266 (-1.23z)| norm 0.2904 (+0.11z)| lr 3.54e-04 | 4151.90 ms | 32.5% bf16 MFU | 125962 tok/s step 10956/19560 | loss 3.356072 (-0.35z)| norm 0.2655 (-0.09z)| lr 3.54e-04 | 4155.91 ms | 32.5% bf16 MFU | 125972 tok/s step 10957/19560 | loss 3.406595 (+0.96z)| norm 0.2605 (-0.14z)| lr 3.54e-04 | 4158.31 ms | 32.5% bf16 MFU | 125977 tok/s step 10958/19560 | loss 3.417413 (+1.28z)| norm 0.2840 (+0.06z)| lr 3.54e-04 | 4152.71 ms | 32.5% bf16 MFU | 125991 tok/s step 10959/19560 | loss 3.383425 (+0.38z)| norm 1.8806 (+8.58z)| lr 3.54e-04 | 4155.56 ms | 32.5% bf16 MFU | 126000 tok/s step 10960/19560 | loss 3.366589 (-0.06z)| norm 0.3744 (+0.45z)| lr 3.54e-04 | 4154.33 ms | 32.5% bf16 MFU | 126010 tok/s step 10961/19560 | loss 3.394361 (+0.69z)| norm 0.3395 (+0.26z)| lr 3.54e-04 | 4157.34 ms | 32.5% bf16 MFU | 126015 tok/s step 10962/19560 | loss 3.373783 (+0.15z)| norm 0.4136 (+0.65z)| lr 3.54e-04 | 4159.76 ms | 32.5% bf16 MFU | 126016 tok/s step 10963/19560 | loss 3.388752 (+0.55z)| norm 0.2834 (-0.05z)| lr 3.54e-04 | 4158.92 ms | 32.5% bf16 MFU | 126019 tok/s step 10964/19560 | loss 3.383411 (+0.40z)| norm 0.3709 (+0.41z)| lr 3.54e-04 | 4158.69 ms | 32.5% bf16 MFU | 126021 tok/s step 10965/19560 | loss 3.339880 (-0.79z)| norm 0.2774 (-0.09z)| lr 3.54e-04 | 4345.17 ms | 31.1% bf16 MFU | 125753 tok/s step 10966/19560 | loss 3.387606 (+0.51z)| norm 0.2985 (+0.02z)| lr 3.54e-04 | 4158.00 ms | 32.5% bf16 MFU | 125770 tok/s step 10967/19560 | loss 3.368040 (-0.02z)| norm 0.2657 (-0.16z)| lr 3.54e-04 | 4157.29 ms | 32.5% bf16 MFU | 125787 tok/s step 10968/19560 | loss 3.328747 (-1.08z)| norm 0.2736 (-0.12z)| lr 3.54e-04 | 4155.90 ms | 32.5% bf16 MFU | 125806 tok/s step 10969/19560 | loss 3.456293 (+2.38z)| norm 0.2610 (-0.19z)| lr 3.54e-04 | 4158.91 ms | 32.5% bf16 MFU | 125818 tok/s step 10970/19560 | loss 3.418468 (+1.34z)| norm 0.2658 (-0.16z)| lr 3.53e-04 | 4160.17 ms | 32.5% bf16 MFU | 125829 tok/s step 10971/19560 | loss 3.373075 (+0.11z)| norm 0.2510 (-0.24z)| lr 3.53e-04 | 4160.51 ms | 32.5% bf16 MFU | 125838 tok/s step 10972/19560 | loss 3.419162 (+1.34z)| norm 0.2620 (-0.18z)| lr 3.53e-04 | 4158.82 ms | 32.5% bf16 MFU | 125850 tok/s step 10973/19560 | loss 3.320960 (-1.28z)| norm 0.2566 (-0.21z)| lr 3.53e-04 | 4157.78 ms | 32.5% bf16 MFU | 125862 tok/s step 10974/19560 | loss 3.431777 (+1.64z)| norm 0.2555 (-0.22z)| lr 3.53e-04 | 4158.08 ms | 32.5% bf16 MFU | 125873 tok/s step 10975/19560 | loss 3.387100 (+0.45z)| norm 0.2550 (-0.22z)| lr 3.53e-04 | 4496.74 ms | 30.0% bf16 MFU | 125409 tok/s step 10976/19560 | loss 3.360107 (-0.27z)| norm 0.2843 (-0.07z)| lr 3.53e-04 | 6877.97 ms | 19.6% bf16 MFU | 122950 tok/s step 10977/19560 | loss 3.386606 (+0.44z)| norm 0.2631 (-0.18z)| lr 3.53e-04 | 4600.23 ms | 29.4% bf16 MFU | 122501 tok/s step 10978/19560 | loss 3.412215 (+1.18z)| norm 0.2663 (-0.16z)| lr 3.53e-04 | 4319.25 ms | 31.3% bf16 MFU | 122445 tok/s step 10979/19560 | loss 3.388700 (+0.55z)| norm 0.2628 (-0.18z)| lr 3.53e-04 | 4260.74 ms | 31.7% bf16 MFU | 122476 tok/s step 10980/19560 | loss 3.332407 (-1.02z)| norm 0.2546 (-0.22z)| lr 3.53e-04 | 4306.72 ms | 31.4% bf16 MFU | 122439 tok/s step 10981/19560 | loss 3.328895 (-1.12z)| norm 0.2516 (-0.24z)| lr 3.53e-04 | 4208.62 ms | 32.1% bf16 MFU | 122545 tok/s step 10982/19560 | loss 3.406940 (+1.06z)| norm 0.2471 (-0.26z)| lr 3.53e-04 | 4305.42 ms | 31.4% bf16 MFU | 122507 tok/s step 10983/19560 | loss 3.420039 (+1.45z)| norm 0.2517 (-0.23z)| lr 3.53e-04 | 4212.84 ms | 32.0% bf16 MFU | 122604 tok/s step 10984/19560 | loss 3.384732 (+0.44z)| norm 0.2620 (-0.17z)| lr 3.53e-04 | 4168.31 ms | 32.4% bf16 MFU | 122763 tok/s step 10985/19560 | loss 3.411390 (+1.19z)| norm 0.2628 (-0.17z)| lr 3.52e-04 | 4306.19 ms | 31.4% bf16 MFU | 122712 tok/s step 10986/19560 | loss 3.380861 (+0.33z)| norm 0.2626 (-0.17z)| lr 3.52e-04 | 4193.65 ms | 32.2% bf16 MFU | 122828 tok/s step 10987/19560 | loss 3.375068 (+0.16z)| norm 0.2465 (-0.26z)| lr 3.52e-04 | 4159.15 ms | 32.5% bf16 MFU | 122989 tok/s step 10988/19560 | loss 3.419813 (+1.40z)| norm 0.2652 (-0.16z)| lr 3.52e-04 | 4258.27 ms | 31.7% bf16 MFU | 122996 tok/s step 10989/19560 | loss 3.435221 (+1.80z)| norm 0.2529 (-0.22z)| lr 3.52e-04 | 4207.58 ms | 32.1% bf16 MFU | 123076 tok/s step 10990/19560 | loss 3.408527 (+1.04z)| norm 0.2768 (-0.10z)| lr 3.52e-04 | 4158.58 ms | 32.5% bf16 MFU | 123226 tok/s step 10991/19560 | loss 3.380384 (+0.25z)| norm 0.2339 (-0.33z)| lr 3.52e-04 | 4167.14 ms | 32.4% bf16 MFU | 123356 tok/s step 10992/19560 | loss 3.484572 (+3.01z)| norm 0.2696 (-0.14z)| lr 3.52e-04 | 4195.08 ms | 32.2% bf16 MFU | 123437 tok/s step 10993/19560 | loss 3.407305 (+0.92z)| norm 0.2302 (-0.35z)| lr 3.52e-04 | 4207.17 ms | 32.1% bf16 MFU | 123496 tok/s step 10994/19560 | loss 3.528980 (+3.90z)| norm 0.2488 (-0.24z)| lr 3.52e-04 | 4170.87 ms | 32.4% bf16 MFU | 123606 tok/s step 10995/19560 | loss 3.467247 (+2.28z)| norm 0.2485 (-0.25z)| lr 3.52e-04 | 4154.64 ms | 32.5% bf16 MFU | 123735 tok/s step 10996/19560 | loss 3.392617 (+0.42z)| norm 0.2597 (-0.17z)| lr 3.52e-04 | 4156.97 ms | 32.5% bf16 MFU | 123855 tok/s step 10997/19560 | loss 3.376880 (+0.03z)| norm 0.2564 (-0.19z)| lr 3.52e-04 | 4164.14 ms | 32.4% bf16 MFU | 123957 tok/s step 10998/19560 | loss 3.450922 (+1.83z)| norm 0.2786 (-0.07z)| lr 3.52e-04 | 4207.61 ms | 32.1% bf16 MFU | 123990 tok/s step 10999/19560 | loss 3.401302 (+0.61z)| norm 0.2870 (-0.02z)| lr 3.52e-04 | 4234.58 ms | 31.9% bf16 MFU | 123981 tok/s step 11000/19560 | loss 3.362094 (-0.35z)| norm 0.2379 (-0.29z)| lr 3.51e-04 | 4169.11 ms | 32.4% bf16 MFU | 124069 tok/s val loss 3.361578 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2915/10042 = 0.290281 step 11001/19560 | loss 3.415478 (+0.95z)| norm 0.2767 (-0.08z)| lr 3.51e-04 | 4186.47 ms | 32.3% bf16 MFU | 124128 tok/s step 11002/19560 | loss 3.353442 (-0.58z)| norm 0.2399 (-0.28z)| lr 3.51e-04 | 4167.01 ms | 32.4% bf16 MFU | 124212 tok/s step 11003/19560 | loss 3.368251 (-0.22z)| norm 0.2539 (-0.20z)| lr 3.51e-04 | 4167.26 ms | 32.4% bf16 MFU | 124292 tok/s step 11004/19560 | loss 3.396550 (+0.47z)| norm 0.2324 (-0.32z)| lr 3.51e-04 | 4175.52 ms | 32.3% bf16 MFU | 124356 tok/s step 11005/19560 | loss 3.338991 (-0.98z)| norm 0.2385 (-0.28z)| lr 3.51e-04 | 4234.41 ms | 31.9% bf16 MFU | 124329 tok/s step 11006/19560 | loss 3.417307 (+0.97z)| norm 0.2523 (-0.21z)| lr 3.51e-04 | 4214.01 ms | 32.0% bf16 MFU | 124333 tok/s step 11007/19560 | loss 3.403549 (+0.63z)| norm 0.2409 (-0.27z)| lr 3.51e-04 | 4180.25 ms | 32.3% bf16 MFU | 124387 tok/s step 11008/19560 | loss 3.422806 (+1.09z)| norm 0.2527 (-0.21z)| lr 3.51e-04 | 4159.54 ms | 32.5% bf16 MFU | 124470 tok/s step 11009/19560 | loss 3.377883 (-0.03z)| norm 0.2422 (-0.26z)| lr 3.51e-04 | 4197.22 ms | 32.2% bf16 MFU | 124492 tok/s step 11010/19560 | loss 3.373796 (-0.13z)| norm 0.2471 (-0.24z)| lr 3.51e-04 | 4176.00 ms | 32.3% bf16 MFU | 124545 tok/s step 11011/19560 | loss 3.533191 (+3.65z)| norm 0.2718 (-0.10z)| lr 3.51e-04 | 4170.59 ms | 32.4% bf16 MFU | 124603 tok/s step 11012/19560 | loss 3.370097 (-0.24z)| norm 0.2394 (-0.28z)| lr 3.51e-04 | 4250.38 ms | 31.8% bf16 MFU | 124541 tok/s step 11013/19560 | loss 3.341153 (-0.93z)| norm 0.2449 (-0.25z)| lr 3.51e-04 | 4165.64 ms | 32.4% bf16 MFU | 124607 tok/s step 11014/19560 | loss 3.408273 (+0.67z)| norm 0.2537 (-0.20z)| lr 3.50e-04 | 4180.49 ms | 32.3% bf16 MFU | 124647 tok/s step 11015/19560 | loss 3.428466 (+1.13z)| norm 0.2331 (-0.31z)| lr 3.50e-04 | 4170.33 ms | 32.4% bf16 MFU | 124701 tok/s step 11016/19560 | loss 3.373126 (-0.19z)| norm 0.2391 (-0.28z)| lr 3.50e-04 | 4168.97 ms | 32.4% bf16 MFU | 124754 tok/s step 11017/19560 | loss 3.401439 (+0.47z)| norm 0.2334 (-0.30z)| lr 3.50e-04 | 4174.61 ms | 32.3% bf16 MFU | 124795 tok/s step 11018/19560 | loss 3.374250 (-0.18z)| norm 0.2500 (-0.20z)| lr 3.50e-04 | 4174.67 ms | 32.3% bf16 MFU | 124835 tok/s step 11019/19560 | loss 3.344427 (-0.89z)| norm 0.2406 (-0.26z)| lr 3.50e-04 | 4164.75 ms | 32.4% bf16 MFU | 124888 tok/s step 11020/19560 | loss 3.381660 (+0.00z)| norm 0.2436 (-0.24z)| lr 3.50e-04 | 4154.52 ms | 32.5% bf16 MFU | 124953 tok/s step 11021/19560 | loss 3.388982 (+0.18z)| norm 0.2404 (-0.26z)| lr 3.50e-04 | 4164.64 ms | 32.4% bf16 MFU | 125000 tok/s step 11022/19560 | loss 3.432923 (+1.23z)| norm 0.2503 (-0.20z)| lr 3.50e-04 | 4216.68 ms | 32.0% bf16 MFU | 124967 tok/s step 11023/19560 | loss 3.471120 (+2.08z)| norm 0.2664 (-0.11z)| lr 3.50e-04 | 4161.74 ms | 32.4% bf16 MFU | 125017 tok/s step 11024/19560 | loss 3.374776 (-0.19z)| norm 0.2347 (-0.29z)| lr 3.50e-04 | 4163.05 ms | 32.4% bf16 MFU | 125063 tok/s step 11025/19560 | loss 3.432020 (+1.15z)| norm 0.2697 (-0.09z)| lr 3.50e-04 | 4164.00 ms | 32.4% bf16 MFU | 125106 tok/s step 11026/19560 | loss 3.403843 (+0.50z)| norm 0.2741 (-0.07z)| lr 3.50e-04 | 4159.83 ms | 32.5% bf16 MFU | 125152 tok/s step 11027/19560 | loss 3.348637 (-0.84z)| norm 0.2730 (-0.07z)| lr 3.50e-04 | 4167.66 ms | 32.4% bf16 MFU | 125185 tok/s step 11028/19560 | loss 3.387176 (+0.09z)| norm 0.2389 (-0.26z)| lr 3.50e-04 | 4168.70 ms | 32.4% bf16 MFU | 125214 tok/s step 11029/19560 | loss 3.457250 (+1.75z)| norm 0.2764 (-0.05z)| lr 3.49e-04 | 4162.60 ms | 32.4% bf16 MFU | 125251 tok/s step 11030/19560 | loss 3.443108 (+1.40z)| norm 0.2385 (-0.27z)| lr 3.49e-04 | 4171.20 ms | 32.4% bf16 MFU | 125273 tok/s step 11031/19560 | loss 3.440498 (+1.31z)| norm 0.2872 (+0.01z)| lr 3.49e-04 | 4178.77 ms | 32.3% bf16 MFU | 125282 tok/s step 11032/19560 | loss 3.395619 (+0.25z)| norm 0.2551 (-0.17z)| lr 3.49e-04 | 4165.73 ms | 32.4% bf16 MFU | 125311 tok/s step 11033/19560 | loss 3.370889 (-0.33z)| norm 0.2726 (-0.08z)| lr 3.49e-04 | 4168.06 ms | 32.4% bf16 MFU | 125335 tok/s step 11034/19560 | loss 3.467162 (+1.90z)| norm 0.2578 (-0.16z)| lr 3.49e-04 | 4165.05 ms | 32.4% bf16 MFU | 125362 tok/s step 11035/19560 | loss 3.456997 (+1.65z)| norm 0.2644 (-0.12z)| lr 3.49e-04 | 4198.51 ms | 32.2% bf16 MFU | 125338 tok/s step 11036/19560 | loss 3.415054 (+0.68z)| norm 0.2603 (-0.15z)| lr 3.49e-04 | 4154.58 ms | 32.5% bf16 MFU | 125381 tok/s step 11037/19560 | loss 3.416578 (+0.70z)| norm 0.2574 (-0.16z)| lr 3.49e-04 | 4154.76 ms | 32.5% bf16 MFU | 125421 tok/s step 11038/19560 | loss 3.453627 (+1.54z)| norm 0.2492 (-0.21z)| lr 3.49e-04 | 4159.24 ms | 32.5% bf16 MFU | 125453 tok/s step 11039/19560 | loss 3.409096 (+0.51z)| norm 0.2734 (-0.07z)| lr 3.49e-04 | 4170.82 ms | 32.4% bf16 MFU | 125465 tok/s step 11040/19560 | loss 3.439573 (+1.20z)| norm 0.2420 (-0.25z)| lr 3.49e-04 | 4165.71 ms | 32.4% bf16 MFU | 125485 tok/s step 11041/19560 | loss 3.413352 (+0.60z)| norm 0.2574 (-0.17z)| lr 3.49e-04 | 4167.79 ms | 32.4% bf16 MFU | 125500 tok/s step 11042/19560 | loss 3.410329 (+0.52z)| norm 0.2480 (-0.22z)| lr 3.49e-04 | 4160.49 ms | 32.5% bf16 MFU | 125526 tok/s step 11043/19560 | loss 3.352712 (-0.80z)| norm 0.2486 (-0.21z)| lr 3.49e-04 | 4160.19 ms | 32.5% bf16 MFU | 125551 tok/s step 11044/19560 | loss 3.360646 (-0.62z)| norm 0.2513 (-0.20z)| lr 3.48e-04 | 4158.37 ms | 32.5% bf16 MFU | 125578 tok/s step 11045/19560 | loss 3.446088 (+1.32z)| norm 0.2579 (-0.16z)| lr 3.48e-04 | 4157.36 ms | 32.5% bf16 MFU | 125604 tok/s step 11046/19560 | loss 3.371427 (-0.38z)| norm 0.2552 (-0.18z)| lr 3.48e-04 | 4161.02 ms | 32.4% bf16 MFU | 125624 tok/s step 11047/19560 | loss 3.342368 (-1.05z)| norm 0.2469 (-0.22z)| lr 3.48e-04 | 4160.28 ms | 32.5% bf16 MFU | 125644 tok/s step 11048/19560 | loss 3.362085 (-0.60z)| norm 0.2615 (-0.14z)| lr 3.48e-04 | 4169.92 ms | 32.4% bf16 MFU | 125648 tok/s step 11049/19560 | loss 3.401161 (+0.29z)| norm 0.2348 (-0.29z)| lr 3.48e-04 | 4160.56 ms | 32.5% bf16 MFU | 125667 tok/s step 11050/19560 | loss 3.360769 (-0.64z)| norm 0.2833 (-0.02z)| lr 3.48e-04 | 4163.30 ms | 32.4% bf16 MFU | 125680 tok/s step 11051/19560 | loss 3.415792 (+0.61z)| norm 0.2454 (-0.23z)| lr 3.48e-04 | 4167.53 ms | 32.4% bf16 MFU | 125686 tok/s step 11052/19560 | loss 3.476911 (+1.97z)| norm 0.2669 (-0.11z)| lr 3.48e-04 | 4158.97 ms | 32.5% bf16 MFU | 125705 tok/s step 11053/19560 | loss 3.427178 (+0.83z)| norm 0.2717 (-0.08z)| lr 3.48e-04 | 4168.12 ms | 32.4% bf16 MFU | 125709 tok/s step 11054/19560 | loss 3.465243 (+1.67z)| norm 0.2733 (-0.08z)| lr 3.48e-04 | 4169.09 ms | 32.4% bf16 MFU | 125711 tok/s step 11055/19560 | loss 3.400325 (+0.20z)| norm 0.2788 (-0.05z)| lr 3.48e-04 | 4164.41 ms | 32.4% bf16 MFU | 125720 tok/s step 11056/19560 | loss 3.387497 (-0.10z)| norm 0.2687 (-0.10z)| lr 3.48e-04 | 4166.58 ms | 32.4% bf16 MFU | 125726 tok/s step 11057/19560 | loss 3.413466 (+0.48z)| norm 0.2604 (-0.15z)| lr 3.48e-04 | 4156.91 ms | 32.5% bf16 MFU | 125746 tok/s step 11058/19560 | loss 3.459735 (+1.51z)| norm 0.2419 (-0.25z)| lr 3.48e-04 | 4168.91 ms | 32.4% bf16 MFU | 125747 tok/s step 11059/19560 | loss 3.392204 (-0.03z)| norm 0.2570 (-0.17z)| lr 3.47e-04 | 4209.46 ms | 32.1% bf16 MFU | 125687 tok/s step 11060/19560 | loss 3.430746 (+0.85z)| norm 0.2625 (-0.14z)| lr 3.47e-04 | 4164.06 ms | 32.4% bf16 MFU | 125698 tok/s step 11061/19560 | loss 3.375717 (-0.44z)| norm 0.2418 (-0.25z)| lr 3.47e-04 | 4164.76 ms | 32.4% bf16 MFU | 125707 tok/s step 11062/19560 | loss 3.383159 (-0.28z)| norm 0.2798 (-0.04z)| lr 3.47e-04 | 4168.39 ms | 32.4% bf16 MFU | 125711 tok/s step 11063/19560 | loss 3.428607 (+0.78z)| norm 0.2404 (-0.26z)| lr 3.47e-04 | 4161.45 ms | 32.4% bf16 MFU | 125725 tok/s step 11064/19560 | loss 3.382357 (-0.32z)| norm 0.2672 (-0.10z)| lr 3.47e-04 | 4170.32 ms | 32.4% bf16 MFU | 125724 tok/s step 11065/19560 | loss 3.383850 (-0.30z)| norm 0.2635 (-0.12z)| lr 3.47e-04 | 4159.53 ms | 32.5% bf16 MFU | 125740 tok/s step 11066/19560 | loss 3.415117 (+0.44z)| norm 0.2604 (-0.14z)| lr 3.47e-04 | 4194.81 ms | 32.2% bf16 MFU | 125703 tok/s step 11067/19560 | loss 3.442260 (+1.08z)| norm 0.2686 (-0.09z)| lr 3.47e-04 | 4165.06 ms | 32.4% bf16 MFU | 125711 tok/s step 11068/19560 | loss 3.334258 (-1.50z)| norm 0.2639 (-0.12z)| lr 3.47e-04 | 4170.80 ms | 32.4% bf16 MFU | 125711 tok/s step 11069/19560 | loss 3.398921 (+0.05z)| norm 0.2396 (-0.25z)| lr 3.47e-04 | 4159.96 ms | 32.5% bf16 MFU | 125727 tok/s step 11070/19560 | loss 3.427724 (+0.73z)| norm 0.2656 (-0.11z)| lr 3.47e-04 | 4165.05 ms | 32.4% bf16 MFU | 125735 tok/s step 11071/19560 | loss 3.388214 (-0.23z)| norm 0.2507 (-0.19z)| lr 3.47e-04 | 4168.66 ms | 32.4% bf16 MFU | 125736 tok/s step 11072/19560 | loss 3.388352 (-0.22z)| norm 0.2572 (-0.15z)| lr 3.47e-04 | 4164.85 ms | 32.4% bf16 MFU | 125744 tok/s step 11073/19560 | loss 3.396825 (-0.03z)| norm 0.2614 (-0.13z)| lr 3.47e-04 | 4169.05 ms | 32.4% bf16 MFU | 125744 tok/s step 11074/19560 | loss 3.404814 (+0.16z)| norm 0.2622 (-0.12z)| lr 3.46e-04 | 4168.18 ms | 32.4% bf16 MFU | 125746 tok/s step 11075/19560 | loss 3.405668 (+0.17z)| norm 0.2662 (-0.10z)| lr 3.46e-04 | 4201.48 ms | 32.1% bf16 MFU | 125698 tok/s step 11076/19560 | loss 3.385917 (-0.34z)| norm 0.2518 (-0.19z)| lr 3.46e-04 | 4164.15 ms | 32.4% bf16 MFU | 125709 tok/s step 11077/19560 | loss 3.451086 (+1.31z)| norm 0.2636 (-0.12z)| lr 3.46e-04 | 4164.60 ms | 32.4% bf16 MFU | 125718 tok/s step 11078/19560 | loss 3.390244 (-0.23z)| norm 0.2447 (-0.21z)| lr 3.46e-04 | 4162.93 ms | 32.4% bf16 MFU | 125729 tok/s step 11079/19560 | loss 3.429890 (+0.77z)| norm 0.2360 (-0.27z)| lr 3.46e-04 | 4169.25 ms | 32.4% bf16 MFU | 125730 tok/s step 11080/19560 | loss 3.407419 (+0.19z)| norm 0.2464 (-0.19z)| lr 3.46e-04 | 4168.31 ms | 32.4% bf16 MFU | 125733 tok/s step 11081/19560 | loss 3.354967 (-1.14z)| norm 0.2357 (-0.26z)| lr 3.46e-04 | 4161.40 ms | 32.4% bf16 MFU | 125745 tok/s step 11082/19560 | loss 3.442679 (+1.08z)| norm 0.2646 (-0.06z)| lr 3.46e-04 | 4158.79 ms | 32.5% bf16 MFU | 125761 tok/s step 11083/19560 | loss 3.391036 (-0.26z)| norm 0.2383 (-0.24z)| lr 3.46e-04 | 4166.40 ms | 32.4% bf16 MFU | 125765 tok/s step 11084/19560 | loss 3.486734 (+2.17z)| norm 0.2599 (-0.09z)| lr 3.46e-04 | 4161.17 ms | 32.4% bf16 MFU | 125777 tok/s step 11085/19560 | loss 3.373026 (-0.74z)| norm 0.2619 (-0.08z)| lr 3.46e-04 | 4167.28 ms | 32.4% bf16 MFU | 125778 tok/s step 11086/19560 | loss 3.404438 (+0.07z)| norm 0.2650 (-0.06z)| lr 3.46e-04 | 4161.87 ms | 32.4% bf16 MFU | 125788 tok/s step 11087/19560 | loss 3.379138 (-0.58z)| norm 0.2246 (-1.41z)| lr 3.46e-04 | 4163.66 ms | 32.4% bf16 MFU | 125795 tok/s step 11088/19560 | loss 3.423574 (+0.55z)| norm 0.2547 (-0.20z)| lr 3.46e-04 | 4164.70 ms | 32.4% bf16 MFU | 125800 tok/s step 11089/19560 | loss 3.416183 (+0.36z)| norm 0.2383 (-0.92z)| lr 3.45e-04 | 4165.19 ms | 32.4% bf16 MFU | 125803 tok/s step 11090/19560 | loss 3.390455 (-0.31z)| norm 0.2452 (-0.69z)| lr 3.45e-04 | 4175.66 ms | 32.3% bf16 MFU | 125791 tok/s step 11091/19560 | loss 3.397788 (-0.12z)| norm 0.2658 (+0.50z)| lr 3.45e-04 | 4172.82 ms | 32.4% bf16 MFU | 125784 tok/s step 11092/19560 | loss 3.427187 (+0.62z)| norm 0.2792 (+1.60z)| lr 3.45e-04 | 4183.84 ms | 32.3% bf16 MFU | 125760 tok/s step 11093/19560 | loss 3.415428 (+0.31z)| norm 0.2533 (-0.21z)| lr 3.45e-04 | 4173.94 ms | 32.3% bf16 MFU | 125753 tok/s step 11094/19560 | loss 3.410163 (+0.17z)| norm 0.2532 (-0.20z)| lr 3.45e-04 | 4161.51 ms | 32.4% bf16 MFU | 125764 tok/s step 11095/19560 | loss 3.385729 (-0.47z)| norm 0.2517 (-0.30z)| lr 3.45e-04 | 4157.94 ms | 32.5% bf16 MFU | 125781 tok/s step 11096/19560 | loss 3.358454 (-1.19z)| norm 0.2431 (-0.92z)| lr 3.45e-04 | 4166.17 ms | 32.4% bf16 MFU | 125784 tok/s step 11097/19560 | loss 3.459099 (+1.44z)| norm 0.2355 (-1.46z)| lr 3.45e-04 | 4161.80 ms | 32.4% bf16 MFU | 125793 tok/s step 11098/19560 | loss 3.435437 (+0.82z)| norm 0.2493 (-0.44z)| lr 3.45e-04 | 4159.23 ms | 32.5% bf16 MFU | 125806 tok/s step 11099/19560 | loss 3.512136 (+2.72z)| norm 0.2747 (+1.41z)| lr 3.45e-04 | 4169.86 ms | 32.4% bf16 MFU | 125803 tok/s step 11100/19560 | loss 3.361622 (-1.09z)| norm 0.2883 (+2.34z)| lr 3.45e-04 | 4164.83 ms | 32.4% bf16 MFU | 125807 tok/s step 11101/19560 | loss 3.410194 (+0.12z)| norm 0.2927 (+2.56z)| lr 3.45e-04 | 4165.43 ms | 32.4% bf16 MFU | 125810 tok/s step 11102/19560 | loss 3.383759 (-0.55z)| norm 0.2725 (+1.14z)| lr 3.45e-04 | 4163.19 ms | 32.4% bf16 MFU | 125816 tok/s step 11103/19560 | loss 3.404980 (-0.01z)| norm 0.2694 (+0.91z)| lr 3.45e-04 | 4165.71 ms | 32.4% bf16 MFU | 125818 tok/s step 11104/19560 | loss 3.354543 (-1.31z)| norm 0.2679 (+0.83z)| lr 3.44e-04 | 4209.33 ms | 32.1% bf16 MFU | 125755 tok/s step 11105/19560 | loss 3.371417 (-0.87z)| norm 0.2498 (-0.43z)| lr 3.44e-04 | 4177.34 ms | 32.3% bf16 MFU | 125743 tok/s step 11106/19560 | loss 3.457397 (+1.33z)| norm 0.2898 (+2.31z)| lr 3.44e-04 | 4164.77 ms | 32.4% bf16 MFU | 125750 tok/s step 11107/19560 | loss 3.379831 (-0.65z)| norm 0.2596 (+0.24z)| lr 3.44e-04 | 4164.22 ms | 32.4% bf16 MFU | 125757 tok/s step 11108/19560 | loss 3.405802 (-0.00z)| norm 0.2548 (-0.09z)| lr 3.44e-04 | 4164.12 ms | 32.4% bf16 MFU | 125765 tok/s step 11109/19560 | loss 3.372880 (-0.88z)| norm 0.2711 (+1.02z)| lr 3.44e-04 | 4254.44 ms | 31.7% bf16 MFU | 125638 tok/s step 11110/19560 | loss 3.439715 (+0.87z)| norm 0.4126 (+7.73z)| lr 3.44e-04 | 4172.39 ms | 32.4% bf16 MFU | 125639 tok/s step 11111/19560 | loss 3.450211 (+1.13z)| norm 0.3143 (+2.72z)| lr 3.44e-04 | 4239.74 ms | 31.8% bf16 MFU | 125540 tok/s step 11112/19560 | loss 3.410097 (+0.08z)| norm 0.2580 (+0.00z)| lr 3.44e-04 | 4168.25 ms | 32.4% bf16 MFU | 125552 tok/s step 11113/19560 | loss 3.426330 (+0.50z)| norm 0.2944 (+1.73z)| lr 3.44e-04 | 4166.04 ms | 32.4% bf16 MFU | 125567 tok/s step 11114/19560 | loss 3.409755 (+0.06z)| norm 0.2713 (+0.62z)| lr 3.44e-04 | 4159.33 ms | 32.5% bf16 MFU | 125591 tok/s step 11115/19560 | loss 3.405023 (-0.07z)| norm 0.2861 (+1.31z)| lr 3.44e-04 | 4165.46 ms | 32.4% bf16 MFU | 125605 tok/s step 11116/19560 | loss 3.425534 (+0.47z)| norm 0.2466 (-0.56z)| lr 3.44e-04 | 4162.32 ms | 32.4% bf16 MFU | 125623 tok/s step 11117/19560 | loss 3.388746 (-0.48z)| norm 0.2994 (+1.90z)| lr 3.44e-04 | 4162.88 ms | 32.4% bf16 MFU | 125639 tok/s step 11118/19560 | loss 3.457877 (+1.31z)| norm 0.2419 (-0.78z)| lr 3.44e-04 | 4168.18 ms | 32.4% bf16 MFU | 125646 tok/s step 11119/19560 | loss 3.425323 (+0.45z)| norm 0.2493 (-0.44z)| lr 3.43e-04 | 4172.46 ms | 32.4% bf16 MFU | 125646 tok/s step 11120/19560 | loss 3.409642 (+0.06z)| norm 0.2687 (+0.47z)| lr 3.43e-04 | 4161.86 ms | 32.4% bf16 MFU | 125663 tok/s step 11121/19560 | loss 3.430208 (+0.60z)| norm 0.2393 (-0.92z)| lr 3.43e-04 | 4162.94 ms | 32.4% bf16 MFU | 125677 tok/s step 11122/19560 | loss 3.374249 (-0.88z)| norm 0.2510 (-0.37z)| lr 3.43e-04 | 4161.10 ms | 32.4% bf16 MFU | 125693 tok/s step 11123/19560 | loss 3.395960 (-0.27z)| norm 0.2368 (-1.03z)| lr 3.43e-04 | 4156.39 ms | 32.5% bf16 MFU | 125715 tok/s step 11124/19560 | loss 3.537001 (+3.44z)| norm 0.2519 (-0.32z)| lr 3.43e-04 | 4165.13 ms | 32.4% bf16 MFU | 125723 tok/s step 11125/19560 | loss 3.417015 (+0.26z)| norm 0.2396 (-0.89z)| lr 3.43e-04 | 4174.34 ms | 32.3% bf16 MFU | 125717 tok/s step 11126/19560 | loss 3.384065 (-0.60z)| norm 0.2398 (-0.86z)| lr 3.43e-04 | 4167.22 ms | 32.4% bf16 MFU | 125722 tok/s step 11127/19560 | loss 3.393054 (-0.36z)| norm 0.2426 (-0.72z)| lr 3.43e-04 | 4156.40 ms | 32.5% bf16 MFU | 125743 tok/s step 11128/19560 | loss 3.419728 (+0.34z)| norm 0.2474 (-0.50z)| lr 3.43e-04 | 4162.93 ms | 32.4% bf16 MFU | 125753 tok/s step 11129/19560 | loss 3.390330 (-0.44z)| norm 0.2440 (-0.65z)| lr 3.43e-04 | 4170.64 ms | 32.4% bf16 MFU | 125750 tok/s step 11130/19560 | loss 3.390532 (-0.45z)| norm 0.2278 (-1.40z)| lr 3.43e-04 | 4163.84 ms | 32.4% bf16 MFU | 125759 tok/s step 11131/19560 | loss 3.407410 (-0.00z)| norm 0.2468 (-0.50z)| lr 3.43e-04 | 4170.40 ms | 32.4% bf16 MFU | 125757 tok/s step 11132/19560 | loss 3.391787 (-0.42z)| norm 0.2495 (-0.38z)| lr 3.43e-04 | 4190.89 ms | 32.2% bf16 MFU | 125724 tok/s step 11133/19560 | loss 3.438678 (+0.83z)| norm 0.2523 (-0.26z)| lr 3.43e-04 | 4147.34 ms | 32.6% bf16 MFU | 125758 tok/s step 11134/19560 | loss 3.397111 (-0.30z)| norm 0.2385 (-0.90z)| lr 3.42e-04 | 4156.84 ms | 32.5% bf16 MFU | 125777 tok/s step 11135/19560 | loss 3.445080 (+1.00z)| norm 0.2411 (-0.78z)| lr 3.42e-04 | 4146.09 ms | 32.6% bf16 MFU | 125811 tok/s step 11136/19560 | loss 3.409988 (+0.05z)| norm 0.2282 (-1.37z)| lr 3.42e-04 | 4149.90 ms | 32.5% bf16 MFU | 125837 tok/s step 11137/19560 | loss 3.424417 (+0.43z)| norm 0.2562 (-0.06z)| lr 3.42e-04 | 4154.88 ms | 32.5% bf16 MFU | 125854 tok/s step 11138/19560 | loss 3.318279 (-2.41z)| norm 0.2446 (-0.61z)| lr 3.42e-04 | 4151.60 ms | 32.5% bf16 MFU | 125876 tok/s step 11139/19560 | loss 3.397854 (-0.26z)| norm 0.2423 (-0.71z)| lr 3.42e-04 | 4154.32 ms | 32.5% bf16 MFU | 125892 tok/s step 11140/19560 | loss 3.409560 (+0.06z)| norm 0.2436 (-0.65z)| lr 3.42e-04 | 4160.00 ms | 32.5% bf16 MFU | 125899 tok/s step 11141/19560 | loss 3.387224 (-0.59z)| norm 0.2615 (+0.19z)| lr 3.42e-04 | 4159.99 ms | 32.5% bf16 MFU | 125906 tok/s step 11142/19560 | loss 3.400322 (-0.21z)| norm 0.2646 (+0.33z)| lr 3.42e-04 | 4159.41 ms | 32.5% bf16 MFU | 125913 tok/s step 11143/19560 | loss 3.392158 (-0.44z)| norm 0.2325 (-1.18z)| lr 3.42e-04 | 4171.72 ms | 32.4% bf16 MFU | 125901 tok/s step 11144/19560 | loss 3.392394 (-0.44z)| norm 0.2498 (-0.37z)| lr 3.42e-04 | 4204.31 ms | 32.1% bf16 MFU | 125841 tok/s step 11145/19560 | loss 3.333132 (-2.08z)| norm 0.2336 (-1.13z)| lr 3.42e-04 | 4147.64 ms | 32.6% bf16 MFU | 125869 tok/s step 11146/19560 | loss 3.403508 (-0.11z)| norm 0.2678 (+0.48z)| lr 3.42e-04 | 4160.39 ms | 32.5% bf16 MFU | 125877 tok/s step 11147/19560 | loss 3.419156 (+0.32z)| norm 0.2347 (-1.09z)| lr 3.42e-04 | 4156.02 ms | 32.5% bf16 MFU | 125891 tok/s step 11148/19560 | loss 3.420497 (+0.35z)| norm 0.2574 (-0.02z)| lr 3.42e-04 | 4157.13 ms | 32.5% bf16 MFU | 125902 tok/s step 11149/19560 | loss 3.373010 (-1.00z)| norm 0.2678 (+0.46z)| lr 3.41e-04 | 4164.16 ms | 32.4% bf16 MFU | 125902 tok/s step 11150/19560 | loss 3.391046 (-0.48z)| norm 0.2738 (+0.74z)| lr 3.41e-04 | 4149.44 ms | 32.5% bf16 MFU | 125925 tok/s step 11151/19560 | loss 3.454820 (+1.35z)| norm 0.2716 (+0.63z)| lr 3.41e-04 | 4153.73 ms | 32.5% bf16 MFU | 125939 tok/s step 11152/19560 | loss 3.380961 (-0.77z)| norm 0.2446 (-0.65z)| lr 3.41e-04 | 4155.20 ms | 32.5% bf16 MFU | 125951 tok/s step 11153/19560 | loss 3.387217 (-0.58z)| norm 0.2782 (+0.94z)| lr 3.41e-04 | 4164.46 ms | 32.4% bf16 MFU | 125949 tok/s step 11154/19560 | loss 3.335773 (-2.01z)| norm 0.2337 (-1.15z)| lr 3.41e-04 | 4148.19 ms | 32.5% bf16 MFU | 125971 tok/s step 11155/19560 | loss 3.402881 (-0.12z)| norm 0.2658 (+0.37z)| lr 3.41e-04 | 4160.13 ms | 32.5% bf16 MFU | 125973 tok/s step 11156/19560 | loss 3.445076 (+1.07z)| norm 0.2447 (-0.63z)| lr 3.41e-04 | 4159.95 ms | 32.5% bf16 MFU | 125976 tok/s step 11157/19560 | loss 3.392409 (-0.42z)| norm 0.2514 (-0.31z)| lr 3.41e-04 | 4152.33 ms | 32.5% bf16 MFU | 125991 tok/s step 11158/19560 | loss 3.378067 (-0.82z)| norm 0.2544 (-0.17z)| lr 3.41e-04 | 4165.00 ms | 32.4% bf16 MFU | 125985 tok/s step 11159/19560 | loss 3.411846 (+0.16z)| norm 0.2413 (-0.78z)| lr 3.41e-04 | 4162.63 ms | 32.4% bf16 MFU | 125983 tok/s step 11160/19560 | loss 3.386163 (-0.58z)| norm 0.2467 (-0.52z)| lr 3.41e-04 | 4150.14 ms | 32.5% bf16 MFU | 126001 tok/s step 11161/19560 | loss 3.412817 (+0.18z)| norm 0.2374 (-0.95z)| lr 3.41e-04 | 4163.12 ms | 32.4% bf16 MFU | 125998 tok/s step 11162/19560 | loss 3.370300 (-1.04z)| norm 0.2566 (-0.03z)| lr 3.41e-04 | 4152.68 ms | 32.5% bf16 MFU | 126010 tok/s step 11163/19560 | loss 3.393217 (-0.36z)| norm 0.2550 (-0.11z)| lr 3.41e-04 | 4167.26 ms | 32.4% bf16 MFU | 126000 tok/s step 11164/19560 | loss 3.436643 (+0.91z)| norm 0.2560 (-0.06z)| lr 3.40e-04 | 4156.12 ms | 32.5% bf16 MFU | 126008 tok/s step 11165/19560 | loss 3.381981 (-0.68z)| norm 0.2654 (+0.39z)| lr 3.40e-04 | 4159.35 ms | 32.5% bf16 MFU | 126010 tok/s step 11166/19560 | loss 3.451957 (+1.37z)| norm 0.2536 (-0.18z)| lr 3.40e-04 | 4492.45 ms | 30.1% bf16 MFU | 125545 tok/s step 11167/19560 | loss 3.407281 (+0.06z)| norm 0.2531 (-0.19z)| lr 3.40e-04 | 4407.69 ms | 30.6% bf16 MFU | 125215 tok/s step 11168/19560 | loss 3.362069 (-1.25z)| norm 0.2523 (-0.24z)| lr 3.40e-04 | 4325.35 ms | 31.2% bf16 MFU | 125015 tok/s step 11169/19560 | loss 3.435816 (+0.90z)| norm 0.2594 (+0.10z)| lr 3.40e-04 | 4346.31 ms | 31.1% bf16 MFU | 124795 tok/s step 11170/19560 | loss 3.368082 (-1.06z)| norm 0.2713 (+0.66z)| lr 3.40e-04 | 4419.25 ms | 30.6% bf16 MFU | 124487 tok/s step 11171/19560 | loss 3.445333 (+1.17z)| norm 0.2590 (+0.07z)| lr 3.40e-04 | 4268.54 ms | 31.6% bf16 MFU | 124404 tok/s step 11172/19560 | loss 3.371134 (-1.00z)| norm 0.2646 (+0.34z)| lr 3.40e-04 | 4203.56 ms | 32.1% bf16 MFU | 124420 tok/s step 11173/19560 | loss 3.397748 (-0.21z)| norm 0.2529 (-0.22z)| lr 3.40e-04 | 4233.48 ms | 31.9% bf16 MFU | 124392 tok/s step 11174/19560 | loss 3.365700 (-1.15z)| norm 0.2518 (-0.27z)| lr 3.40e-04 | 4286.37 ms | 31.5% bf16 MFU | 124288 tok/s step 11175/19560 | loss 3.391869 (-0.40z)| norm 0.2435 (-0.67z)| lr 3.40e-04 | 4221.67 ms | 32.0% bf16 MFU | 124283 tok/s step 11176/19560 | loss 3.438891 (+0.98z)| norm 0.2590 (+0.07z)| lr 3.40e-04 | 4204.96 ms | 32.1% bf16 MFU | 124303 tok/s step 11177/19560 | loss 3.380216 (-0.76z)| norm 0.2509 (-0.32z)| lr 3.40e-04 | 4306.34 ms | 31.4% bf16 MFU | 124175 tok/s step 11178/19560 | loss 3.476959 (+2.07z)| norm 0.2462 (-0.54z)| lr 3.40e-04 | 4149.39 ms | 32.5% bf16 MFU | 124284 tok/s step 11179/19560 | loss 3.371524 (-1.02z)| norm 0.2639 (+0.31z)| lr 3.39e-04 | 4160.98 ms | 32.4% bf16 MFU | 124370 tok/s step 11180/19560 | loss 3.349410 (-1.65z)| norm 0.2498 (-0.36z)| lr 3.39e-04 | 4235.71 ms | 31.9% bf16 MFU | 124340 tok/s step 11181/19560 | loss 3.362007 (-1.26z)| norm 0.2730 (+0.76z)| lr 3.39e-04 | 4150.98 ms | 32.5% bf16 MFU | 124438 tok/s step 11182/19560 | loss 3.412668 (+0.25z)| norm 0.2456 (-0.55z)| lr 3.39e-04 | 4191.16 ms | 32.2% bf16 MFU | 124471 tok/s step 11183/19560 | loss 3.447758 (+1.27z)| norm 0.2743 (+0.83z)| lr 3.39e-04 | 4147.95 ms | 32.6% bf16 MFU | 124568 tok/s step 11184/19560 | loss 3.431120 (+0.77z)| norm 0.2787 (+1.04z)| lr 3.39e-04 | 4168.32 ms | 32.4% bf16 MFU | 124628 tok/s step 11185/19560 | loss 3.340424 (-1.87z)| norm 0.2609 (+0.18z)| lr 3.39e-04 | 4176.11 ms | 32.3% bf16 MFU | 124674 tok/s step 11186/19560 | loss 3.309689 (-2.68z)| norm 0.2723 (+0.72z)| lr 3.39e-04 | 4174.30 ms | 32.3% bf16 MFU | 124720 tok/s step 11187/19560 | loss 3.386307 (-0.49z)| norm 0.2537 (-0.18z)| lr 3.39e-04 | 4150.02 ms | 32.5% bf16 MFU | 124801 tok/s step 11188/19560 | loss 3.454914 (+1.46z)| norm 0.2562 (-0.06z)| lr 3.39e-04 | 4151.24 ms | 32.5% bf16 MFU | 124876 tok/s step 11189/19560 | loss 3.341845 (-1.73z)| norm 0.2652 (+0.37z)| lr 3.39e-04 | 4214.07 ms | 32.0% bf16 MFU | 124853 tok/s step 11190/19560 | loss 3.353939 (-1.37z)| norm 0.2531 (-0.21z)| lr 3.39e-04 | 4196.30 ms | 32.2% bf16 MFU | 124857 tok/s step 11191/19560 | loss 3.355724 (-1.30z)| norm 0.2617 (+0.21z)| lr 3.39e-04 | 4152.74 ms | 32.5% bf16 MFU | 124927 tok/s step 11192/19560 | loss 3.357405 (-1.24z)| norm 0.2263 (-1.49z)| lr 3.39e-04 | 4253.04 ms | 31.7% bf16 MFU | 124844 tok/s step 11193/19560 | loss 3.311357 (-2.44z)| norm 0.2655 (+0.40z)| lr 3.38e-04 | 4153.14 ms | 32.5% bf16 MFU | 124914 tok/s step 11194/19560 | loss 3.398818 (-0.07z)| norm 0.2303 (-1.28z)| lr 3.38e-04 | 4151.36 ms | 32.5% bf16 MFU | 124983 tok/s step 11195/19560 | loss 3.498160 (+2.55z)| norm 0.2885 (+1.50z)| lr 3.38e-04 | 4167.42 ms | 32.4% bf16 MFU | 125024 tok/s step 11196/19560 | loss 3.372487 (-0.80z)| norm 0.2611 (+0.19z)| lr 3.38e-04 | 4153.89 ms | 32.5% bf16 MFU | 125084 tok/s step 11197/19560 | loss 3.336555 (-1.73z)| norm 0.2960 (+1.81z)| lr 3.38e-04 | 4189.59 ms | 32.2% bf16 MFU | 125086 tok/s step 11198/19560 | loss 3.417389 (+0.41z)| norm 0.2603 (+0.13z)| lr 3.38e-04 | 4188.79 ms | 32.2% bf16 MFU | 125090 tok/s step 11199/19560 | loss 3.447089 (+1.18z)| norm 0.3040 (+2.14z)| lr 3.38e-04 | 4151.83 ms | 32.5% bf16 MFU | 125150 tok/s step 11200/19560 | loss 3.385038 (-0.45z)| norm 0.2878 (+1.36z)| lr 3.38e-04 | 4149.35 ms | 32.5% bf16 MFU | 125210 tok/s step 11201/19560 | loss 3.430066 (+0.73z)| norm 0.2749 (+0.77z)| lr 3.38e-04 | 4150.66 ms | 32.5% bf16 MFU | 125265 tok/s step 11202/19560 | loss 3.410265 (+0.21z)| norm 0.2963 (+1.71z)| lr 3.38e-04 | 4152.15 ms | 32.5% bf16 MFU | 125315 tok/s step 11203/19560 | loss 3.350481 (-1.34z)| norm 0.2433 (-0.68z)| lr 3.38e-04 | 4149.95 ms | 32.5% bf16 MFU | 125366 tok/s step 11204/19560 | loss 3.348206 (-1.39z)| norm 0.2810 (+1.01z)| lr 3.38e-04 | 4154.55 ms | 32.5% bf16 MFU | 125408 tok/s step 11205/19560 | loss 3.392145 (-0.24z)| norm 0.2577 (-0.04z)| lr 3.38e-04 | 4149.19 ms | 32.5% bf16 MFU | 125455 tok/s step 11206/19560 | loss 3.416421 (+0.39z)| norm 0.2537 (-0.22z)| lr 3.38e-04 | 4151.75 ms | 32.5% bf16 MFU | 125497 tok/s step 11207/19560 | loss 3.358794 (-1.10z)| norm 0.2660 (+0.33z)| lr 3.38e-04 | 4157.86 ms | 32.5% bf16 MFU | 125527 tok/s step 11208/19560 | loss 3.417265 (+0.42z)| norm 0.2521 (-0.31z)| lr 3.37e-04 | 4149.31 ms | 32.5% bf16 MFU | 125568 tok/s step 11209/19560 | loss 3.370994 (-0.79z)| norm 0.2540 (-0.23z)| lr 3.37e-04 | 4152.02 ms | 32.5% bf16 MFU | 125603 tok/s step 11210/19560 | loss 3.410349 (+0.25z)| norm 0.2482 (-0.49z)| lr 3.37e-04 | 4150.99 ms | 32.5% bf16 MFU | 125638 tok/s step 11211/19560 | loss 3.396029 (-0.13z)| norm 0.2674 (+0.38z)| lr 3.37e-04 | 4151.24 ms | 32.5% bf16 MFU | 125671 tok/s step 11212/19560 | loss 3.363563 (-0.97z)| norm 0.2592 (+0.01z)| lr 3.37e-04 | 4149.77 ms | 32.5% bf16 MFU | 125705 tok/s step 11213/19560 | loss 3.372630 (-0.73z)| norm 0.2765 (+0.79z)| lr 3.37e-04 | 4154.74 ms | 32.5% bf16 MFU | 125729 tok/s step 11214/19560 | loss 3.417564 (+0.47z)| norm 0.2574 (-0.08z)| lr 3.37e-04 | 4155.82 ms | 32.5% bf16 MFU | 125750 tok/s step 11215/19560 | loss 3.332075 (-1.78z)| norm 0.3123 (+2.36z)| lr 3.37e-04 | 4152.64 ms | 32.5% bf16 MFU | 125776 tok/s step 11216/19560 | loss 3.362922 (-0.95z)| norm 0.2663 (+0.29z)| lr 3.37e-04 | 4148.97 ms | 32.5% bf16 MFU | 125805 tok/s step 11217/19560 | loss 3.386621 (-0.32z)| norm 0.2853 (+1.12z)| lr 3.37e-04 | 4173.61 ms | 32.4% bf16 MFU | 125796 tok/s step 11218/19560 | loss 3.390121 (-0.23z)| norm 0.2823 (+0.97z)| lr 3.37e-04 | 4170.06 ms | 32.4% bf16 MFU | 125792 tok/s step 11219/19560 | loss 3.383984 (-0.39z)| norm 0.2667 (+0.27z)| lr 3.37e-04 | 4149.92 ms | 32.5% bf16 MFU | 125820 tok/s step 11220/19560 | loss 3.389898 (-0.23z)| norm 0.2695 (+0.40z)| lr 3.37e-04 | 4151.29 ms | 32.5% bf16 MFU | 125843 tok/s step 11221/19560 | loss 3.337096 (-1.59z)| norm 0.2628 (+0.10z)| lr 3.37e-04 | 4151.25 ms | 32.5% bf16 MFU | 125866 tok/s step 11222/19560 | loss 3.421476 (+0.61z)| norm 0.2752 (+0.65z)| lr 3.37e-04 | 4150.64 ms | 32.5% bf16 MFU | 125889 tok/s step 11223/19560 | loss 3.368088 (-0.77z)| norm 0.2601 (-0.03z)| lr 3.36e-04 | 4156.11 ms | 32.5% bf16 MFU | 125902 tok/s step 11224/19560 | loss 3.409122 (+0.28z)| norm 0.2517 (-0.41z)| lr 3.36e-04 | 4150.13 ms | 32.5% bf16 MFU | 125923 tok/s step 11225/19560 | loss 3.422307 (+0.64z)| norm 0.2865 (+1.14z)| lr 3.36e-04 | 4182.63 ms | 32.3% bf16 MFU | 125894 tok/s step 11226/19560 | loss 3.352379 (-1.18z)| norm 0.2422 (-0.86z)| lr 3.36e-04 | 4152.34 ms | 32.5% bf16 MFU | 125913 tok/s step 11227/19560 | loss 3.360728 (-0.96z)| norm 0.2909 (+1.32z)| lr 3.36e-04 | 4235.60 ms | 31.9% bf16 MFU | 125806 tok/s step 11228/19560 | loss 3.419455 (+0.62z)| norm 0.2719 (+0.48z)| lr 3.36e-04 | 4153.74 ms | 32.5% bf16 MFU | 125827 tok/s step 11229/19560 | loss 3.601398 (+4.97z)| norm 0.2792 (+0.82z)| lr 3.36e-04 | 4144.71 ms | 32.6% bf16 MFU | 125860 tok/s step 11230/19560 | loss 3.365898 (-0.78z)| norm 0.2765 (+0.69z)| lr 3.36e-04 | 4155.06 ms | 32.5% bf16 MFU | 125876 tok/s step 11231/19560 | loss 3.519871 (+2.85z)| norm 0.3360 (+3.22z)| lr 3.36e-04 | 4149.01 ms | 32.5% bf16 MFU | 125901 tok/s step 11232/19560 | loss 3.402412 (+0.07z)| norm 0.2763 (+0.63z)| lr 3.36e-04 | 4151.17 ms | 32.5% bf16 MFU | 125921 tok/s step 11233/19560 | loss 3.370684 (-0.68z)| norm 0.2787 (+0.72z)| lr 3.36e-04 | 4152.00 ms | 32.5% bf16 MFU | 125938 tok/s step 11234/19560 | loss 3.348021 (-1.20z)| norm 0.2784 (+0.72z)| lr 3.36e-04 | 4150.30 ms | 32.5% bf16 MFU | 125958 tok/s step 11235/19560 | loss 3.315747 (-1.92z)| norm 0.2409 (-0.90z)| lr 3.36e-04 | 4152.29 ms | 32.5% bf16 MFU | 125973 tok/s step 11236/19560 | loss 3.396045 (-0.04z)| norm 0.2701 (+0.36z)| lr 3.36e-04 | 4211.08 ms | 32.1% bf16 MFU | 125899 tok/s step 11237/19560 | loss 3.313720 (-1.93z)| norm 0.2603 (-0.06z)| lr 3.36e-04 | 4150.42 ms | 32.5% bf16 MFU | 125921 tok/s step 11238/19560 | loss 3.351180 (-1.05z)| norm 0.2549 (-0.30z)| lr 3.35e-04 | 4149.94 ms | 32.5% bf16 MFU | 125941 tok/s step 11239/19560 | loss 3.313080 (-1.89z)| norm 0.2445 (-0.84z)| lr 3.35e-04 | 4150.86 ms | 32.5% bf16 MFU | 125960 tok/s step 11240/19560 | loss 3.317787 (-1.74z)| norm 0.2518 (-0.44z)| lr 3.35e-04 | 4148.75 ms | 32.5% bf16 MFU | 125980 tok/s step 11241/19560 | loss 3.374090 (-0.46z)| norm 0.2501 (-0.52z)| lr 3.35e-04 | 4152.34 ms | 32.5% bf16 MFU | 125994 tok/s step 11242/19560 | loss 3.369980 (-0.55z)| norm 0.2420 (-0.95z)| lr 3.35e-04 | 4153.31 ms | 32.5% bf16 MFU | 126006 tok/s step 11243/19560 | loss 3.400756 (+0.15z)| norm 0.2282 (-1.69z)| lr 3.35e-04 | 4150.98 ms | 32.5% bf16 MFU | 126021 tok/s step 11244/19560 | loss 3.394294 (+0.01z)| norm 0.2473 (-0.64z)| lr 3.35e-04 | 4154.45 ms | 32.5% bf16 MFU | 126030 tok/s step 11245/19560 | loss 3.370296 (-0.53z)| norm 0.2262 (-1.78z)| lr 3.35e-04 | 4146.54 ms | 32.6% bf16 MFU | 126051 tok/s step 11246/19560 | loss 3.360012 (-0.75z)| norm 0.2366 (-1.20z)| lr 3.35e-04 | 4150.66 ms | 32.5% bf16 MFU | 126064 tok/s step 11247/19560 | loss 3.367819 (-0.56z)| norm 0.2685 (+0.56z)| lr 3.35e-04 | 4155.75 ms | 32.5% bf16 MFU | 126069 tok/s step 11248/19560 | loss 3.450649 (+1.31z)| norm 0.2548 (-0.20z)| lr 3.35e-04 | 4147.50 ms | 32.6% bf16 MFU | 126086 tok/s step 11249/19560 | loss 3.413240 (+0.47z)| norm 0.2525 (-0.33z)| lr 3.35e-04 | 4153.69 ms | 32.5% bf16 MFU | 126093 tok/s step 11250/19560 | loss 3.400712 (+0.18z)| norm 0.2520 (-0.36z)| lr 3.35e-04 | 4152.11 ms | 32.5% bf16 MFU | 126101 tok/s val loss 3.354897 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2918/10042 = 0.290580 step 11251/19560 | loss 3.380911 (-0.27z)| norm 0.2368 (-1.20z)| lr 3.35e-04 | 4153.05 ms | 32.5% bf16 MFU | 126108 tok/s step 11252/19560 | loss 3.375617 (-0.38z)| norm 0.2583 (-0.01z)| lr 3.35e-04 | 4155.08 ms | 32.5% bf16 MFU | 126112 tok/s step 11253/19560 | loss 3.382246 (-0.21z)| norm 0.2384 (-1.11z)| lr 3.34e-04 | 4146.74 ms | 32.6% bf16 MFU | 126128 tok/s step 11254/19560 | loss 3.369868 (-0.51z)| norm 0.2500 (-0.48z)| lr 3.34e-04 | 4152.59 ms | 32.5% bf16 MFU | 126135 tok/s step 11255/19560 | loss 3.401904 (+0.25z)| norm 0.2437 (-0.83z)| lr 3.34e-04 | 4148.93 ms | 32.5% bf16 MFU | 126146 tok/s step 11256/19560 | loss 3.350691 (-0.95z)| norm 0.2637 (+0.28z)| lr 3.34e-04 | 4152.28 ms | 32.5% bf16 MFU | 126152 tok/s step 11257/19560 | loss 3.405213 (+0.34z)| norm 0.2557 (-0.17z)| lr 3.34e-04 | 4150.85 ms | 32.5% bf16 MFU | 126160 tok/s step 11258/19560 | loss 3.509514 (+2.71z)| norm 0.2387 (-1.14z)| lr 3.34e-04 | 4152.00 ms | 32.5% bf16 MFU | 126166 tok/s step 11259/19560 | loss 3.332801 (-1.33z)| norm 0.2403 (-1.04z)| lr 3.34e-04 | 4151.80 ms | 32.5% bf16 MFU | 126171 tok/s step 11260/19560 | loss 3.330466 (-1.37z)| norm 0.2505 (-0.47z)| lr 3.34e-04 | 4151.45 ms | 32.5% bf16 MFU | 126177 tok/s step 11261/19560 | loss 3.349183 (-0.93z)| norm 0.2376 (-1.18z)| lr 3.34e-04 | 4150.51 ms | 32.5% bf16 MFU | 126184 tok/s step 11262/19560 | loss 3.368354 (-0.49z)| norm 0.2569 (-0.11z)| lr 3.34e-04 | 4155.33 ms | 32.5% bf16 MFU | 126184 tok/s step 11263/19560 | loss 3.494276 (+2.33z)| norm 0.2708 (+0.66z)| lr 3.34e-04 | 4154.07 ms | 32.5% bf16 MFU | 126185 tok/s step 11264/19560 | loss 3.384208 (-0.13z)| norm 0.2643 (+0.28z)| lr 3.34e-04 | 4151.21 ms | 32.5% bf16 MFU | 126191 tok/s step 11265/19560 | loss 3.411088 (+0.48z)| norm 0.2781 (+1.05z)| lr 3.34e-04 | 4155.16 ms | 32.5% bf16 MFU | 126190 tok/s step 11266/19560 | loss 3.348689 (-0.93z)| norm 0.2616 (+0.11z)| lr 3.34e-04 | 4206.23 ms | 32.1% bf16 MFU | 126113 tok/s step 11267/19560 | loss 3.379149 (-0.24z)| norm 0.2486 (-0.64z)| lr 3.34e-04 | 4144.67 ms | 32.6% bf16 MFU | 126132 tok/s step 11268/19560 | loss 3.469810 (+1.77z)| norm 0.2656 (+0.32z)| lr 3.33e-04 | 4146.58 ms | 32.6% bf16 MFU | 126147 tok/s step 11269/19560 | loss 3.392010 (+0.04z)| norm 0.2466 (-0.75z)| lr 3.33e-04 | 4145.41 ms | 32.6% bf16 MFU | 126164 tok/s step 11270/19560 | loss 3.340072 (-1.11z)| norm 0.2593 (-0.03z)| lr 3.33e-04 | 4150.09 ms | 32.5% bf16 MFU | 126172 tok/s step 11271/19560 | loss 3.334614 (-1.21z)| norm 0.2376 (-1.27z)| lr 3.33e-04 | 4146.78 ms | 32.6% bf16 MFU | 126185 tok/s step 11272/19560 | loss 3.376130 (-0.29z)| norm 0.2438 (-0.91z)| lr 3.33e-04 | 4148.17 ms | 32.5% bf16 MFU | 126195 tok/s step 11273/19560 | loss 3.343163 (-1.02z)| norm 0.2384 (-1.23z)| lr 3.33e-04 | 4148.65 ms | 32.5% bf16 MFU | 126204 tok/s step 11274/19560 | loss 3.381944 (-0.16z)| norm 0.2594 (-0.02z)| lr 3.33e-04 | 4151.65 ms | 32.5% bf16 MFU | 126208 tok/s step 11275/19560 | loss 3.377736 (-0.25z)| norm 0.2351 (-1.41z)| lr 3.33e-04 | 4150.66 ms | 32.5% bf16 MFU | 126214 tok/s step 11276/19560 | loss 3.369447 (-0.42z)| norm 0.2554 (-0.25z)| lr 3.33e-04 | 4150.50 ms | 32.5% bf16 MFU | 126219 tok/s step 11277/19560 | loss 3.364061 (-0.54z)| norm 0.2487 (-0.63z)| lr 3.33e-04 | 4149.56 ms | 32.5% bf16 MFU | 126225 tok/s step 11278/19560 | loss 3.346542 (-0.92z)| norm 0.2555 (-0.23z)| lr 3.33e-04 | 4152.26 ms | 32.5% bf16 MFU | 126227 tok/s step 11279/19560 | loss 3.374634 (-0.29z)| norm 0.2704 (+0.63z)| lr 3.33e-04 | 4151.86 ms | 32.5% bf16 MFU | 126230 tok/s step 11280/19560 | loss 3.350328 (-0.82z)| norm 0.2361 (-1.33z)| lr 3.33e-04 | 4151.32 ms | 32.5% bf16 MFU | 126233 tok/s step 11281/19560 | loss 3.356337 (-0.68z)| norm 0.2347 (-1.39z)| lr 3.33e-04 | 4150.83 ms | 32.5% bf16 MFU | 126237 tok/s step 11282/19560 | loss 3.366521 (-0.46z)| norm 0.2476 (-0.66z)| lr 3.33e-04 | 4151.00 ms | 32.5% bf16 MFU | 126240 tok/s step 11283/19560 | loss 3.377134 (-0.22z)| norm 0.2456 (-0.77z)| lr 3.32e-04 | 4154.08 ms | 32.5% bf16 MFU | 126239 tok/s step 11284/19560 | loss 3.364361 (-0.49z)| norm 0.2582 (-0.05z)| lr 3.32e-04 | 4152.44 ms | 32.5% bf16 MFU | 126240 tok/s step 11285/19560 | loss 3.357689 (-0.64z)| norm 0.2270 (-1.81z)| lr 3.32e-04 | 4156.90 ms | 32.5% bf16 MFU | 126234 tok/s step 11286/19560 | loss 3.391175 (+0.11z)| norm 0.2433 (-0.87z)| lr 3.32e-04 | 4149.59 ms | 32.5% bf16 MFU | 126240 tok/s step 11287/19560 | loss 3.481045 (+2.07z)| norm 0.2347 (-1.36z)| lr 3.32e-04 | 4155.96 ms | 32.5% bf16 MFU | 126235 tok/s step 11288/19560 | loss 3.389371 (+0.06z)| norm 0.2320 (-1.49z)| lr 3.32e-04 | 4152.61 ms | 32.5% bf16 MFU | 126236 tok/s step 11289/19560 | loss 3.449791 (+1.37z)| norm 0.2479 (-0.61z)| lr 3.32e-04 | 4152.36 ms | 32.5% bf16 MFU | 126238 tok/s step 11290/19560 | loss 3.348620 (-0.84z)| norm 0.2357 (-1.27z)| lr 3.32e-04 | 4150.30 ms | 32.5% bf16 MFU | 126242 tok/s step 11291/19560 | loss 3.340221 (-1.01z)| norm 0.2350 (-1.30z)| lr 3.32e-04 | 4153.77 ms | 32.5% bf16 MFU | 126241 tok/s step 11292/19560 | loss 3.382226 (-0.09z)| norm 0.2302 (-1.54z)| lr 3.32e-04 | 4154.56 ms | 32.5% bf16 MFU | 126239 tok/s step 11293/19560 | loss 3.307044 (-1.69z)| norm 0.2524 (-0.31z)| lr 3.32e-04 | 4152.12 ms | 32.5% bf16 MFU | 126240 tok/s step 11294/19560 | loss 3.439577 (+1.17z)| norm 0.2264 (-1.71z)| lr 3.32e-04 | 4154.10 ms | 32.5% bf16 MFU | 126239 tok/s step 11295/19560 | loss 3.421930 (+0.79z)| norm 0.2427 (-0.82z)| lr 3.32e-04 | 4185.98 ms | 32.3% bf16 MFU | 126189 tok/s step 11296/19560 | loss 3.369383 (-0.35z)| norm 0.2319 (-1.39z)| lr 3.32e-04 | 4155.46 ms | 32.5% bf16 MFU | 126188 tok/s step 11297/19560 | loss 3.382577 (-0.06z)| norm 0.2399 (-0.94z)| lr 3.32e-04 | 4153.45 ms | 32.5% bf16 MFU | 126190 tok/s step 11298/19560 | loss 3.391016 (+0.12z)| norm 0.2687 (+0.60z)| lr 3.31e-04 | 4156.07 ms | 32.5% bf16 MFU | 126188 tok/s step 11299/19560 | loss 3.303960 (-1.74z)| norm 0.2442 (-0.70z)| lr 3.31e-04 | 4153.18 ms | 32.5% bf16 MFU | 126191 tok/s step 11300/19560 | loss 3.395537 (+0.24z)| norm 0.2383 (-1.01z)| lr 3.31e-04 | 4154.78 ms | 32.5% bf16 MFU | 126191 tok/s step 11301/19560 | loss 3.331454 (-1.13z)| norm 0.2431 (-0.75z)| lr 3.31e-04 | 4154.14 ms | 32.5% bf16 MFU | 126192 tok/s step 11302/19560 | loss 3.384134 (+0.00z)| norm 0.2409 (-0.86z)| lr 3.31e-04 | 4154.41 ms | 32.5% bf16 MFU | 126192 tok/s step 11303/19560 | loss 3.348970 (-0.75z)| norm 0.2583 (+0.06z)| lr 3.31e-04 | 4153.92 ms | 32.5% bf16 MFU | 126193 tok/s step 11304/19560 | loss 3.316138 (-1.43z)| norm 0.2341 (-1.21z)| lr 3.31e-04 | 4156.49 ms | 32.5% bf16 MFU | 126190 tok/s step 11305/19560 | loss 3.363411 (-0.41z)| norm 0.2540 (-0.16z)| lr 3.31e-04 | 4156.13 ms | 32.5% bf16 MFU | 126188 tok/s step 11306/19560 | loss 3.379137 (-0.06z)| norm 0.2436 (-0.70z)| lr 3.31e-04 | 4151.84 ms | 32.5% bf16 MFU | 126193 tok/s step 11307/19560 | loss 3.357497 (-0.53z)| norm 0.2599 (+0.16z)| lr 3.31e-04 | 4150.41 ms | 32.5% bf16 MFU | 126199 tok/s step 11308/19560 | loss 3.558221 (+3.61z)| norm 0.2717 (+0.78z)| lr 3.31e-04 | 4150.62 ms | 32.5% bf16 MFU | 126205 tok/s step 11309/19560 | loss 3.385383 (+0.04z)| norm 0.2579 (+0.05z)| lr 3.31e-04 | 4154.59 ms | 32.5% bf16 MFU | 126204 tok/s step 11310/19560 | loss 3.395293 (+0.24z)| norm 0.2638 (+0.36z)| lr 3.31e-04 | 4154.74 ms | 32.5% bf16 MFU | 126204 tok/s step 11311/19560 | loss 3.329195 (-1.11z)| norm 0.2524 (-0.24z)| lr 3.31e-04 | 4163.96 ms | 32.4% bf16 MFU | 126189 tok/s step 11312/19560 | loss 3.353743 (-0.59z)| norm 0.2449 (-0.63z)| lr 3.31e-04 | 4158.55 ms | 32.5% bf16 MFU | 126183 tok/s step 11313/19560 | loss 3.393047 (+0.22z)| norm 0.2464 (-0.54z)| lr 3.30e-04 | 4155.51 ms | 32.5% bf16 MFU | 126183 tok/s step 11314/19560 | loss 3.328740 (-1.12z)| norm 0.2582 (+0.10z)| lr 3.30e-04 | 4155.89 ms | 32.5% bf16 MFU | 126181 tok/s step 11315/19560 | loss 3.400667 (+0.38z)| norm 0.2647 (+0.44z)| lr 3.30e-04 | 4150.48 ms | 32.5% bf16 MFU | 126188 tok/s step 11316/19560 | loss 3.380268 (-0.04z)| norm 0.2302 (-1.38z)| lr 3.30e-04 | 4152.75 ms | 32.5% bf16 MFU | 126191 tok/s step 11317/19560 | loss 3.441958 (+1.24z)| norm 0.2662 (+0.53z)| lr 3.30e-04 | 4281.80 ms | 31.5% bf16 MFU | 126004 tok/s step 11318/19560 | loss 3.405198 (+0.46z)| norm 0.2657 (+0.49z)| lr 3.30e-04 | 4280.96 ms | 31.5% bf16 MFU | 125827 tok/s step 11319/19560 | loss 3.495261 (+2.29z)| norm 0.2657 (+0.49z)| lr 3.30e-04 | 4232.30 ms | 31.9% bf16 MFU | 125730 tok/s step 11320/19560 | loss 3.364149 (-0.42z)| norm 0.2399 (-0.89z)| lr 3.30e-04 | 4255.84 ms | 31.7% bf16 MFU | 125603 tok/s step 11321/19560 | loss 3.379273 (-0.12z)| norm 0.2860 (+1.56z)| lr 3.30e-04 | 4159.18 ms | 32.5% bf16 MFU | 125626 tok/s step 11322/19560 | loss 3.411466 (+0.55z)| norm 0.2488 (-0.43z)| lr 3.30e-04 | 4257.83 ms | 31.7% bf16 MFU | 125501 tok/s step 11323/19560 | loss 3.403682 (+0.41z)| norm 0.2512 (-0.29z)| lr 3.30e-04 | 4212.98 ms | 32.0% bf16 MFU | 125448 tok/s step 11324/19560 | loss 3.359959 (-0.51z)| norm 0.2440 (-0.67z)| lr 3.30e-04 | 4160.36 ms | 32.5% bf16 MFU | 125477 tok/s step 11325/19560 | loss 3.444300 (+1.26z)| norm 0.2667 (+0.58z)| lr 3.30e-04 | 4157.60 ms | 32.5% bf16 MFU | 125508 tok/s step 11326/19560 | loss 3.380283 (-0.09z)| norm 0.2501 (-0.33z)| lr 3.30e-04 | 4156.63 ms | 32.5% bf16 MFU | 125539 tok/s step 11327/19560 | loss 3.403827 (+0.42z)| norm 0.2526 (-0.17z)| lr 3.30e-04 | 4155.37 ms | 32.5% bf16 MFU | 125571 tok/s step 11328/19560 | loss 3.326527 (-1.22z)| norm 0.2533 (-0.12z)| lr 3.29e-04 | 4155.30 ms | 32.5% bf16 MFU | 125601 tok/s step 11329/19560 | loss 3.409691 (+0.55z)| norm 0.2543 (-0.06z)| lr 3.29e-04 | 4155.13 ms | 32.5% bf16 MFU | 125630 tok/s step 11330/19560 | loss 3.400996 (+0.37z)| norm 0.2453 (-0.56z)| lr 3.29e-04 | 4152.94 ms | 32.5% bf16 MFU | 125661 tok/s step 11331/19560 | loss 3.371338 (-0.27z)| norm 0.2549 (-0.00z)| lr 3.29e-04 | 4205.15 ms | 32.1% bf16 MFU | 125612 tok/s step 11332/19560 | loss 3.367392 (-0.35z)| norm 0.2496 (-0.30z)| lr 3.29e-04 | 4197.44 ms | 32.2% bf16 MFU | 125576 tok/s step 11333/19560 | loss 3.356162 (-0.59z)| norm 0.2547 (-0.00z)| lr 3.29e-04 | 4157.24 ms | 32.5% bf16 MFU | 125603 tok/s step 11334/19560 | loss 3.379415 (-0.09z)| norm 0.2542 (-0.03z)| lr 3.29e-04 | 4152.97 ms | 32.5% bf16 MFU | 125635 tok/s step 11335/19560 | loss 3.411776 (+0.60z)| norm 0.2560 (+0.08z)| lr 3.29e-04 | 4151.67 ms | 32.5% bf16 MFU | 125668 tok/s step 11336/19560 | loss 3.347108 (-0.77z)| norm 0.2479 (-0.40z)| lr 3.29e-04 | 4156.29 ms | 32.5% bf16 MFU | 125691 tok/s step 11337/19560 | loss 3.333409 (-1.06z)| norm 0.2675 (+0.76z)| lr 3.29e-04 | 4155.39 ms | 32.5% bf16 MFU | 125715 tok/s step 11338/19560 | loss 3.335456 (-1.00z)| norm 0.2351 (-1.15z)| lr 3.29e-04 | 4156.39 ms | 32.5% bf16 MFU | 125737 tok/s step 11339/19560 | loss 3.589007 (+4.07z)| norm 0.2692 (+0.86z)| lr 3.29e-04 | 4156.47 ms | 32.5% bf16 MFU | 125757 tok/s step 11340/19560 | loss 3.385501 (+0.03z)| norm 0.2469 (-0.45z)| lr 3.29e-04 | 4157.96 ms | 32.5% bf16 MFU | 125774 tok/s step 11341/19560 | loss 3.391294 (+0.14z)| norm 0.2454 (-0.52z)| lr 3.29e-04 | 4154.94 ms | 32.5% bf16 MFU | 125794 tok/s step 11342/19560 | loss 3.405029 (+0.41z)| norm 0.2651 (+0.64z)| lr 3.29e-04 | 4153.71 ms | 32.5% bf16 MFU | 125815 tok/s step 11343/19560 | loss 3.356440 (-0.56z)| norm 0.2410 (-0.79z)| lr 3.28e-04 | 4155.42 ms | 32.5% bf16 MFU | 125833 tok/s step 11344/19560 | loss 3.318407 (-1.30z)| norm 0.2496 (-0.25z)| lr 3.28e-04 | 4149.46 ms | 32.5% bf16 MFU | 125859 tok/s step 11345/19560 | loss 3.356239 (-0.54z)| norm 0.2415 (-0.74z)| lr 3.28e-04 | 4155.53 ms | 32.5% bf16 MFU | 125874 tok/s step 11346/19560 | loss 3.420163 (+0.71z)| norm 0.2509 (-0.14z)| lr 3.28e-04 | 4154.74 ms | 32.5% bf16 MFU | 125890 tok/s step 11347/19560 | loss 3.372460 (-0.23z)| norm 0.2247 (-1.77z)| lr 3.28e-04 | 4155.77 ms | 32.5% bf16 MFU | 125904 tok/s step 11348/19560 | loss 3.434930 (+1.00z)| norm 0.2459 (-0.42z)| lr 3.28e-04 | 4155.86 ms | 32.5% bf16 MFU | 125916 tok/s step 11349/19560 | loss 3.339610 (-0.88z)| norm 0.2568 (+0.27z)| lr 3.28e-04 | 4152.09 ms | 32.5% bf16 MFU | 125934 tok/s step 11350/19560 | loss 3.389470 (+0.11z)| norm 0.2642 (+0.76z)| lr 3.28e-04 | 4156.99 ms | 32.5% bf16 MFU | 125943 tok/s step 11351/19560 | loss 3.353872 (-0.59z)| norm 0.2506 (-0.11z)| lr 3.28e-04 | 4153.24 ms | 32.5% bf16 MFU | 125958 tok/s step 11352/19560 | loss 3.336243 (-0.92z)| norm 0.2673 (+0.95z)| lr 3.28e-04 | 4155.37 ms | 32.5% bf16 MFU | 125969 tok/s step 11353/19560 | loss 3.444357 (+1.19z)| norm 0.2749 (+1.45z)| lr 3.28e-04 | 4142.86 ms | 32.6% bf16 MFU | 125998 tok/s step 11354/19560 | loss 3.426678 (+0.83z)| norm 0.2931 (+2.53z)| lr 3.28e-04 | 4149.84 ms | 32.5% bf16 MFU | 126015 tok/s step 11355/19560 | loss 3.347857 (-0.71z)| norm 0.2686 (+1.02z)| lr 3.28e-04 | 4167.44 ms | 32.4% bf16 MFU | 126004 tok/s step 11356/19560 | loss 3.283538 (-1.92z)| norm 0.2534 (+0.06z)| lr 3.28e-04 | 4151.95 ms | 32.5% bf16 MFU | 126018 tok/s step 11357/19560 | loss 3.364646 (-0.34z)| norm 0.2542 (+0.12z)| lr 3.28e-04 | 4705.53 ms | 28.7% bf16 MFU | 125288 tok/s step 11358/19560 | loss 3.480755 (+2.02z)| norm 0.2722 (+1.31z)| lr 3.27e-04 | 4535.16 ms | 29.8% bf16 MFU | 124804 tok/s step 11359/19560 | loss 3.421638 (+0.85z)| norm 0.2475 (-0.30z)| lr 3.27e-04 | 4379.79 ms | 30.8% bf16 MFU | 124549 tok/s step 11360/19560 | loss 3.361969 (-0.40z)| norm 0.2587 (+0.56z)| lr 3.27e-04 | 4602.18 ms | 29.3% bf16 MFU | 124018 tok/s step 11361/19560 | loss 3.431871 (+1.06z)| norm 0.2603 (+0.70z)| lr 3.27e-04 | 4246.43 ms | 31.8% bf16 MFU | 123990 tok/s step 11362/19560 | loss 3.366680 (-0.31z)| norm 0.2848 (+2.55z)| lr 3.27e-04 | 4183.72 ms | 32.3% bf16 MFU | 124056 tok/s step 11363/19560 | loss 3.346563 (-0.74z)| norm 0.3021 (+3.63z)| lr 3.27e-04 | 4379.06 ms | 30.8% bf16 MFU | 123840 tok/s step 11364/19560 | loss 3.369245 (-0.26z)| norm 0.2633 (+0.84z)| lr 3.27e-04 | 4318.90 ms | 31.3% bf16 MFU | 123718 tok/s step 11365/19560 | loss 3.402127 (+0.42z)| norm 0.2644 (+0.91z)| lr 3.27e-04 | 4300.75 ms | 31.4% bf16 MFU | 123627 tok/s step 11366/19560 | loss 3.387445 (+0.11z)| norm 0.2674 (+1.12z)| lr 3.27e-04 | 4298.59 ms | 31.4% bf16 MFU | 123544 tok/s step 11367/19560 | loss 3.370818 (-0.26z)| norm 0.2629 (+0.78z)| lr 3.27e-04 | 4198.22 ms | 32.2% bf16 MFU | 123611 tok/s step 11368/19560 | loss 3.350024 (-0.72z)| norm 0.2741 (+1.56z)| lr 3.27e-04 | 4196.14 ms | 32.2% bf16 MFU | 123678 tok/s step 11369/19560 | loss 3.364799 (-0.40z)| norm 0.2523 (+0.01z)| lr 3.27e-04 | 4163.73 ms | 32.4% bf16 MFU | 123790 tok/s step 11370/19560 | loss 3.383435 (+0.00z)| norm 0.2729 (+1.45z)| lr 3.27e-04 | 4165.27 ms | 32.4% bf16 MFU | 123894 tok/s step 11371/19560 | loss 3.369462 (-0.29z)| norm 0.2668 (+1.00z)| lr 3.27e-04 | 4162.99 ms | 32.4% bf16 MFU | 123996 tok/s step 11372/19560 | loss 3.458158 (+1.60z)| norm 0.2546 (+0.13z)| lr 3.27e-04 | 4166.40 ms | 32.4% bf16 MFU | 124088 tok/s step 11373/19560 | loss 3.365540 (-0.38z)| norm 0.2595 (+0.47z)| lr 3.26e-04 | 4229.92 ms | 31.9% bf16 MFU | 124081 tok/s step 11374/19560 | loss 3.385907 (+0.05z)| norm 0.2565 (+0.24z)| lr 3.26e-04 | 4183.52 ms | 32.3% bf16 MFU | 124143 tok/s step 11375/19560 | loss 3.344235 (-0.84z)| norm 0.2532 (+0.01z)| lr 3.26e-04 | 4169.42 ms | 32.4% bf16 MFU | 124223 tok/s step 11376/19560 | loss 3.358437 (-0.52z)| norm 0.2519 (-0.08z)| lr 3.26e-04 | 4191.50 ms | 32.2% bf16 MFU | 124266 tok/s step 11377/19560 | loss 3.531550 (+3.07z)| norm 0.2681 (+1.08z)| lr 3.26e-04 | 4171.33 ms | 32.4% bf16 MFU | 124337 tok/s step 11378/19560 | loss 3.438890 (+1.13z)| norm 0.2454 (-0.56z)| lr 3.26e-04 | 4162.01 ms | 32.4% bf16 MFU | 124419 tok/s step 11379/19560 | loss 3.353661 (-0.62z)| norm 0.2730 (+1.42z)| lr 3.26e-04 | 4173.50 ms | 32.4% bf16 MFU | 124479 tok/s step 11380/19560 | loss 3.374841 (-0.18z)| norm 0.2445 (-0.64z)| lr 3.26e-04 | 4171.16 ms | 32.4% bf16 MFU | 124540 tok/s step 11381/19560 | loss 3.335917 (-0.98z)| norm 0.2817 (+2.01z)| lr 3.26e-04 | 4188.02 ms | 32.2% bf16 MFU | 124572 tok/s step 11382/19560 | loss 3.492408 (+2.18z)| norm 0.2663 (+0.89z)| lr 3.26e-04 | 4170.60 ms | 32.4% bf16 MFU | 124629 tok/s step 11383/19560 | loss 3.384051 (-0.00z)| norm 0.2868 (+2.29z)| lr 3.26e-04 | 4177.84 ms | 32.3% bf16 MFU | 124672 tok/s step 11384/19560 | loss 3.360644 (-0.48z)| norm 0.2698 (+1.10z)| lr 3.26e-04 | 4160.25 ms | 32.5% bf16 MFU | 124740 tok/s step 11385/19560 | loss 3.495454 (+2.19z)| norm 0.2503 (-0.27z)| lr 3.26e-04 | 4187.20 ms | 32.2% bf16 MFU | 124764 tok/s step 11386/19560 | loss 3.390418 (+0.13z)| norm 0.2651 (+0.75z)| lr 3.26e-04 | 4212.19 ms | 32.1% bf16 MFU | 124749 tok/s step 11387/19560 | loss 3.429060 (+0.90z)| norm 0.2533 (-0.08z)| lr 3.26e-04 | 4168.81 ms | 32.4% bf16 MFU | 124800 tok/s step 11388/19560 | loss 3.418787 (+0.68z)| norm 0.2567 (+0.16z)| lr 3.25e-04 | 4161.98 ms | 32.4% bf16 MFU | 124858 tok/s step 11389/19560 | loss 3.404217 (+0.37z)| norm 0.2461 (-0.59z)| lr 3.25e-04 | 4157.83 ms | 32.5% bf16 MFU | 124920 tok/s step 11390/19560 | loss 3.358709 (-0.55z)| norm 0.2641 (+0.67z)| lr 3.25e-04 | 4172.08 ms | 32.4% bf16 MFU | 124957 tok/s step 11391/19560 | loss 3.349033 (-0.74z)| norm 0.2381 (-1.14z)| lr 3.25e-04 | 4211.42 ms | 32.1% bf16 MFU | 124934 tok/s step 11392/19560 | loss 3.355076 (-0.61z)| norm 0.2425 (-0.82z)| lr 3.25e-04 | 4266.44 ms | 31.6% bf16 MFU | 124832 tok/s step 11393/19560 | loss 3.404228 (+0.41z)| norm 0.2447 (-0.65z)| lr 3.25e-04 | 4266.83 ms | 31.6% bf16 MFU | 124734 tok/s step 11394/19560 | loss 3.365235 (-0.40z)| norm 0.2495 (-0.31z)| lr 3.25e-04 | 4201.86 ms | 32.1% bf16 MFU | 124736 tok/s step 11395/19560 | loss 3.358252 (-0.54z)| norm 0.2499 (-0.28z)| lr 3.25e-04 | 4206.87 ms | 32.1% bf16 MFU | 124731 tok/s step 11396/19560 | loss 3.332567 (-1.06z)| norm 0.2388 (-1.05z)| lr 3.25e-04 | 4163.68 ms | 32.4% bf16 MFU | 124790 tok/s step 11397/19560 | loss 3.402982 (+0.41z)| norm 0.2320 (-1.51z)| lr 3.25e-04 | 4226.54 ms | 31.9% bf16 MFU | 124753 tok/s step 11398/19560 | loss 3.381423 (-0.05z)| norm 0.2524 (-0.07z)| lr 3.25e-04 | 4224.90 ms | 32.0% bf16 MFU | 124720 tok/s step 11399/19560 | loss 3.398032 (+0.29z)| norm 0.2432 (-0.73z)| lr 3.25e-04 | 4163.66 ms | 32.4% bf16 MFU | 124780 tok/s step 11400/19560 | loss 3.374271 (-0.21z)| norm 0.2440 (-0.67z)| lr 3.25e-04 | 4170.40 ms | 32.4% bf16 MFU | 124827 tok/s step 11401/19560 | loss 3.336090 (-1.01z)| norm 0.2777 (+1.68z)| lr 3.25e-04 | 4167.08 ms | 32.4% bf16 MFU | 124876 tok/s step 11402/19560 | loss 3.404971 (+0.43z)| norm 0.2413 (-0.87z)| lr 3.25e-04 | 4171.52 ms | 32.4% bf16 MFU | 124917 tok/s step 11403/19560 | loss 3.334107 (-1.05z)| norm 0.2558 (+0.14z)| lr 3.24e-04 | 4171.79 ms | 32.4% bf16 MFU | 124954 tok/s step 11404/19560 | loss 3.304110 (-1.65z)| norm 0.2363 (-1.21z)| lr 3.24e-04 | 4172.84 ms | 32.4% bf16 MFU | 124989 tok/s step 11405/19560 | loss 3.364653 (-0.39z)| norm 0.2380 (-1.09z)| lr 3.24e-04 | 4161.96 ms | 32.4% bf16 MFU | 125038 tok/s step 11406/19560 | loss 3.379766 (-0.08z)| norm 0.2336 (-1.37z)| lr 3.24e-04 | 4165.09 ms | 32.4% bf16 MFU | 125080 tok/s step 11407/19560 | loss 3.398240 (+0.30z)| norm 0.2485 (-0.33z)| lr 3.24e-04 | 4177.79 ms | 32.3% bf16 MFU | 125101 tok/s step 11408/19560 | loss 3.287726 (-1.96z)| norm 0.2702 (+1.16z)| lr 3.24e-04 | 4211.86 ms | 32.1% bf16 MFU | 125070 tok/s step 11409/19560 | loss 3.329227 (-1.10z)| norm 0.2477 (-0.41z)| lr 3.24e-04 | 4212.74 ms | 32.0% bf16 MFU | 125039 tok/s step 11410/19560 | loss 3.396509 (+0.27z)| norm 0.2501 (-0.24z)| lr 3.24e-04 | 4167.45 ms | 32.4% bf16 MFU | 125077 tok/s step 11411/19560 | loss 3.376790 (-0.14z)| norm 0.2431 (-0.74z)| lr 3.24e-04 | 4173.34 ms | 32.4% bf16 MFU | 125105 tok/s step 11412/19560 | loss 3.367554 (-0.33z)| norm 0.2405 (-0.90z)| lr 3.24e-04 | 4167.10 ms | 32.4% bf16 MFU | 125140 tok/s step 11413/19560 | loss 3.406305 (+0.46z)| norm 0.2518 (-0.13z)| lr 3.24e-04 | 4165.58 ms | 32.4% bf16 MFU | 125176 tok/s step 11414/19560 | loss 3.406397 (+0.46z)| norm 0.2497 (-0.28z)| lr 3.24e-04 | 4166.35 ms | 32.4% bf16 MFU | 125209 tok/s step 11415/19560 | loss 3.402138 (+0.39z)| norm 0.2485 (-0.38z)| lr 3.24e-04 | 4170.01 ms | 32.4% bf16 MFU | 125235 tok/s step 11416/19560 | loss 3.461418 (+1.59z)| norm 0.2535 (-0.03z)| lr 3.24e-04 | 4168.65 ms | 32.4% bf16 MFU | 125262 tok/s step 11417/19560 | loss 3.328827 (-1.11z)| norm 0.2712 (+1.22z)| lr 3.24e-04 | 4180.41 ms | 32.3% bf16 MFU | 125270 tok/s step 11418/19560 | loss 3.360564 (-0.46z)| norm 0.2763 (+1.56z)| lr 3.23e-04 | 4179.16 ms | 32.3% bf16 MFU | 125279 tok/s step 11419/19560 | loss 3.382011 (-0.03z)| norm 0.2813 (+1.88z)| lr 3.23e-04 | 4177.15 ms | 32.3% bf16 MFU | 125291 tok/s step 11420/19560 | loss 3.467253 (+1.70z)| norm 0.2726 (+1.24z)| lr 3.23e-04 | 4174.76 ms | 32.3% bf16 MFU | 125305 tok/s step 11421/19560 | loss 3.410738 (+0.53z)| norm 0.2887 (+2.32z)| lr 3.23e-04 | 4194.77 ms | 32.2% bf16 MFU | 125289 tok/s step 11422/19560 | loss 3.483984 (+2.01z)| norm 0.3101 (+3.63z)| lr 3.23e-04 | 4173.81 ms | 32.3% bf16 MFU | 125306 tok/s step 11423/19560 | loss 3.372996 (-0.24z)| norm 0.2941 (+2.47z)| lr 3.23e-04 | 4175.24 ms | 32.3% bf16 MFU | 125319 tok/s step 11424/19560 | loss 3.323107 (-1.25z)| norm 0.2824 (+1.68z)| lr 3.23e-04 | 4180.19 ms | 32.3% bf16 MFU | 125324 tok/s step 11425/19560 | loss 3.482107 (+1.94z)| norm 0.2822 (+1.63z)| lr 3.23e-04 | 4174.84 ms | 32.3% bf16 MFU | 125337 tok/s step 11426/19560 | loss 3.432415 (+0.93z)| norm 0.2843 (+1.75z)| lr 3.23e-04 | 4165.21 ms | 32.4% bf16 MFU | 125364 tok/s step 11427/19560 | loss 3.306408 (-1.58z)| norm 0.2662 (+0.56z)| lr 3.23e-04 | 4193.52 ms | 32.2% bf16 MFU | 125347 tok/s step 11428/19560 | loss 3.379005 (-0.13z)| norm 0.2902 (+2.07z)| lr 3.23e-04 | 4434.89 ms | 30.4% bf16 MFU | 124990 tok/s step 11429/19560 | loss 3.465065 (+1.56z)| norm 0.2759 (+1.14z)| lr 3.23e-04 | 4166.50 ms | 32.4% bf16 MFU | 125032 tok/s step 11430/19560 | loss 3.386381 (-0.00z)| norm 0.2735 (+0.97z)| lr 3.23e-04 | 4170.99 ms | 32.4% bf16 MFU | 125066 tok/s step 11431/19560 | loss 3.384646 (-0.04z)| norm 0.2923 (+2.11z)| lr 3.23e-04 | 4174.93 ms | 32.3% bf16 MFU | 125092 tok/s step 11432/19560 | loss 3.416440 (+0.58z)| norm 0.2578 (-0.07z)| lr 3.23e-04 | 4180.50 ms | 32.3% bf16 MFU | 125108 tok/s step 11433/19560 | loss 3.339645 (-0.95z)| norm 0.2648 (+0.37z)| lr 3.22e-04 | 4173.15 ms | 32.4% bf16 MFU | 125134 tok/s step 11434/19560 | loss 3.327536 (-1.18z)| norm 0.2426 (-1.04z)| lr 3.22e-04 | 4175.49 ms | 32.3% bf16 MFU | 125155 tok/s step 11435/19560 | loss 3.313823 (-1.44z)| norm 0.2528 (-0.39z)| lr 3.22e-04 | 4178.79 ms | 32.3% bf16 MFU | 125171 tok/s step 11436/19560 | loss 3.337562 (-0.98z)| norm 0.2682 (+0.59z)| lr 3.22e-04 | 4191.52 ms | 32.2% bf16 MFU | 125166 tok/s step 11437/19560 | loss 3.496897 (+2.24z)| norm 0.2526 (-0.40z)| lr 3.22e-04 | 4165.27 ms | 32.4% bf16 MFU | 125202 tok/s step 11438/19560 | loss 3.324488 (-1.22z)| norm 0.2535 (-0.34z)| lr 3.22e-04 | 4182.94 ms | 32.3% bf16 MFU | 125209 tok/s step 11439/19560 | loss 3.360831 (-0.50z)| norm 0.2641 (+0.33z)| lr 3.22e-04 | 4181.22 ms | 32.3% bf16 MFU | 125218 tok/s step 11440/19560 | loss 3.394906 (+0.18z)| norm 0.2565 (-0.16z)| lr 3.22e-04 | 4180.84 ms | 32.3% bf16 MFU | 125227 tok/s step 11441/19560 | loss 3.390618 (+0.10z)| norm 0.2877 (+1.79z)| lr 3.22e-04 | 4186.69 ms | 32.2% bf16 MFU | 125227 tok/s step 11442/19560 | loss 3.357118 (-0.59z)| norm 0.2622 (+0.18z)| lr 3.22e-04 | 4169.58 ms | 32.4% bf16 MFU | 125253 tok/s step 11443/19560 | loss 3.382658 (-0.07z)| norm 0.2753 (+1.00z)| lr 3.22e-04 | 4174.84 ms | 32.3% bf16 MFU | 125269 tok/s step 11444/19560 | loss 3.410665 (+0.50z)| norm 0.2717 (+0.76z)| lr 3.22e-04 | 4744.17 ms | 28.5% bf16 MFU | 124531 tok/s step 11445/19560 | loss 3.342480 (-0.87z)| norm 0.2672 (+0.47z)| lr 3.22e-04 | 4169.14 ms | 32.4% bf16 MFU | 124592 tok/s step 11446/19560 | loss 3.392701 (+0.15z)| norm 0.2607 (+0.07z)| lr 3.22e-04 | 4173.62 ms | 32.4% bf16 MFU | 124644 tok/s step 11447/19560 | loss 3.363390 (-0.43z)| norm 0.2478 (-0.75z)| lr 3.22e-04 | 4167.79 ms | 32.4% bf16 MFU | 124701 tok/s step 11448/19560 | loss 3.395191 (+0.22z)| norm 0.2663 (+0.42z)| lr 3.21e-04 | 4165.46 ms | 32.4% bf16 MFU | 124760 tok/s step 11449/19560 | loss 3.433684 (+1.01z)| norm 0.2525 (-0.45z)| lr 3.21e-04 | 4172.53 ms | 32.4% bf16 MFU | 124804 tok/s step 11450/19560 | loss 3.331929 (-1.08z)| norm 0.2516 (-0.51z)| lr 3.21e-04 | 4163.16 ms | 32.4% bf16 MFU | 124861 tok/s step 11451/19560 | loss 3.370035 (-0.29z)| norm 0.2515 (-0.52z)| lr 3.21e-04 | 4169.91 ms | 32.4% bf16 MFU | 124904 tok/s step 11452/19560 | loss 3.351603 (-0.66z)| norm 0.2566 (-0.20z)| lr 3.21e-04 | 4172.91 ms | 32.4% bf16 MFU | 124941 tok/s step 11453/19560 | loss 3.373161 (-0.21z)| norm 0.2685 (+0.57z)| lr 3.21e-04 | 4165.16 ms | 32.4% bf16 MFU | 124988 tok/s step 11454/19560 | loss 3.427275 (+0.90z)| norm 0.2489 (-0.69z)| lr 3.21e-04 | 4168.05 ms | 32.4% bf16 MFU | 125028 tok/s step 11455/19560 | loss 3.343971 (-0.81z)| norm 0.2547 (-0.32z)| lr 3.21e-04 | 4168.25 ms | 32.4% bf16 MFU | 125065 tok/s step 11456/19560 | loss 3.404805 (+0.43z)| norm 0.2633 (+0.23z)| lr 3.21e-04 | 4183.16 ms | 32.3% bf16 MFU | 125079 tok/s step 11457/19560 | loss 3.341159 (-0.87z)| norm 0.2459 (-0.89z)| lr 3.21e-04 | 4164.02 ms | 32.4% bf16 MFU | 125120 tok/s step 11458/19560 | loss 3.435664 (+1.07z)| norm 0.2766 (+1.08z)| lr 3.21e-04 | 4178.92 ms | 32.3% bf16 MFU | 125137 tok/s step 11459/19560 | loss 3.352036 (-0.65z)| norm 0.2376 (-1.42z)| lr 3.21e-04 | 4170.32 ms | 32.4% bf16 MFU | 125166 tok/s step 11460/19560 | loss 3.383306 (-0.01z)| norm 0.2523 (-0.48z)| lr 3.21e-04 | 4174.02 ms | 32.3% bf16 MFU | 125188 tok/s step 11461/19560 | loss 3.336887 (-0.95z)| norm 0.2470 (-0.81z)| lr 3.21e-04 | 4250.37 ms | 31.8% bf16 MFU | 125097 tok/s step 11462/19560 | loss 3.317342 (-1.33z)| norm 0.2433 (-1.04z)| lr 3.21e-04 | 4222.88 ms | 32.0% bf16 MFU | 125049 tok/s step 11463/19560 | loss 3.389683 (+0.14z)| norm 0.2470 (-0.80z)| lr 3.21e-04 | 4393.96 ms | 30.7% bf16 MFU | 124763 tok/s step 11464/19560 | loss 3.342184 (-0.83z)| norm 0.2401 (-1.23z)| lr 3.20e-04 | 4247.33 ms | 31.8% bf16 MFU | 124697 tok/s step 11465/19560 | loss 3.297323 (-1.72z)| norm 0.2408 (-1.17z)| lr 3.20e-04 | 4232.62 ms | 31.9% bf16 MFU | 124655 tok/s step 11466/19560 | loss 3.357921 (-0.50z)| norm 0.2498 (-0.61z)| lr 3.20e-04 | 4164.30 ms | 32.4% bf16 MFU | 124718 tok/s step 11467/19560 | loss 3.349246 (-0.69z)| norm 0.2395 (-1.25z)| lr 3.20e-04 | 4168.32 ms | 32.4% bf16 MFU | 124771 tok/s step 11468/19560 | loss 3.526176 (+3.02z)| norm 0.2565 (-0.18z)| lr 3.20e-04 | 4218.87 ms | 32.0% bf16 MFU | 124746 tok/s step 11469/19560 | loss 3.325944 (-1.16z)| norm 0.2407 (-1.17z)| lr 3.20e-04 | 4180.66 ms | 32.3% bf16 MFU | 124779 tok/s step 11470/19560 | loss 3.396742 (+0.32z)| norm 0.2289 (-1.88z)| lr 3.20e-04 | 4255.64 ms | 31.7% bf16 MFU | 124700 tok/s step 11471/19560 | loss 3.402343 (+0.43z)| norm 0.2935 (+2.11z)| lr 3.20e-04 | 4171.06 ms | 32.4% bf16 MFU | 124750 tok/s step 11472/19560 | loss 3.353887 (-0.59z)| norm 0.2486 (-0.66z)| lr 3.20e-04 | 4174.95 ms | 32.3% bf16 MFU | 124791 tok/s step 11473/19560 | loss 3.355348 (-0.56z)| norm 0.2633 (+0.24z)| lr 3.20e-04 | 4225.63 ms | 32.0% bf16 MFU | 124755 tok/s step 11474/19560 | loss 3.391041 (+0.20z)| norm 0.2343 (-1.54z)| lr 3.20e-04 | 4194.58 ms | 32.2% bf16 MFU | 124767 tok/s step 11475/19560 | loss 3.385252 (+0.07z)| norm 0.2348 (-1.52z)| lr 3.20e-04 | 4161.14 ms | 32.4% bf16 MFU | 124829 tok/s step 11476/19560 | loss 3.365203 (-0.34z)| norm 0.2395 (-1.23z)| lr 3.20e-04 | 4240.89 ms | 31.8% bf16 MFU | 124769 tok/s step 11477/19560 | loss 3.334713 (-0.98z)| norm 0.2590 (-0.03z)| lr 3.20e-04 | 4172.66 ms | 32.4% bf16 MFU | 124813 tok/s step 11478/19560 | loss 3.382153 (+0.02z)| norm 0.2436 (-0.96z)| lr 3.20e-04 | 4176.37 ms | 32.3% bf16 MFU | 124849 tok/s step 11479/19560 | loss 3.404418 (+0.48z)| norm 0.2695 (+0.62z)| lr 3.19e-04 | 4173.07 ms | 32.4% bf16 MFU | 124888 tok/s step 11480/19560 | loss 3.328331 (-1.12z)| norm 0.2723 (+0.79z)| lr 3.19e-04 | 4195.60 ms | 32.2% bf16 MFU | 124892 tok/s step 11481/19560 | loss 3.347535 (-0.70z)| norm 0.2430 (-0.99z)| lr 3.19e-04 | 4170.42 ms | 32.4% bf16 MFU | 124933 tok/s step 11482/19560 | loss 3.404967 (+0.52z)| norm 0.2752 (+1.00z)| lr 3.19e-04 | 4166.49 ms | 32.4% bf16 MFU | 124978 tok/s step 11483/19560 | loss 3.423318 (+0.89z)| norm 0.2396 (-1.19z)| lr 3.19e-04 | 4162.99 ms | 32.4% bf16 MFU | 125026 tok/s step 11484/19560 | loss 3.403738 (+0.47z)| norm 0.2759 (+1.04z)| lr 3.19e-04 | 4251.06 ms | 31.8% bf16 MFU | 124941 tok/s step 11485/19560 | loss 3.387097 (+0.10z)| norm 0.2399 (-1.17z)| lr 3.19e-04 | 4168.73 ms | 32.4% bf16 MFU | 124983 tok/s step 11486/19560 | loss 3.362442 (-0.42z)| norm 0.2925 (+2.03z)| lr 3.19e-04 | 4162.53 ms | 32.4% bf16 MFU | 125031 tok/s step 11487/19560 | loss 3.302313 (-1.70z)| norm 0.2583 (-0.05z)| lr 3.19e-04 | 4167.86 ms | 32.4% bf16 MFU | 125069 tok/s step 11488/19560 | loss 3.368681 (-0.26z)| norm 0.2615 (+0.14z)| lr 3.19e-04 | 4160.41 ms | 32.5% bf16 MFU | 125117 tok/s step 11489/19560 | loss 3.370784 (-0.20z)| norm 0.2558 (-0.20z)| lr 3.19e-04 | 4211.27 ms | 32.1% bf16 MFU | 125086 tok/s step 11490/19560 | loss 3.374143 (-0.13z)| norm 0.2544 (-0.27z)| lr 3.19e-04 | 4165.94 ms | 32.4% bf16 MFU | 125124 tok/s step 11491/19560 | loss 3.305118 (-1.62z)| norm 0.2638 (+0.33z)| lr 3.19e-04 | 4164.30 ms | 32.4% bf16 MFU | 125163 tok/s step 11492/19560 | loss 3.367942 (-0.26z)| norm 0.2682 (+0.60z)| lr 3.19e-04 | 4226.92 ms | 31.9% bf16 MFU | 125106 tok/s step 11493/19560 | loss 3.375610 (-0.09z)| norm 0.2579 (-0.04z)| lr 3.19e-04 | 4163.10 ms | 32.4% bf16 MFU | 125148 tok/s step 11494/19560 | loss 3.386185 (+0.14z)| norm 0.2615 (+0.19z)| lr 3.18e-04 | 4245.32 ms | 31.8% bf16 MFU | 125065 tok/s step 11495/19560 | loss 3.384577 (+0.11z)| norm 0.2575 (-0.06z)| lr 3.18e-04 | 4190.71 ms | 32.2% bf16 MFU | 125068 tok/s step 11496/19560 | loss 3.386724 (+0.15z)| norm 0.2401 (-1.14z)| lr 3.18e-04 | 4164.88 ms | 32.4% bf16 MFU | 125108 tok/s step 11497/19560 | loss 3.365015 (-0.32z)| norm 0.2533 (-0.31z)| lr 3.18e-04 | 4207.17 ms | 32.1% bf16 MFU | 125084 tok/s step 11498/19560 | loss 3.313498 (-1.42z)| norm 0.2537 (-0.27z)| lr 3.18e-04 | 4166.15 ms | 32.4% bf16 MFU | 125122 tok/s step 11499/19560 | loss 3.387111 (+0.16z)| norm 0.2394 (-1.16z)| lr 3.18e-04 | 4157.95 ms | 32.5% bf16 MFU | 125170 tok/s step 11500/19560 | loss 3.431308 (+1.13z)| norm 0.2476 (-0.64z)| lr 3.18e-04 | 4160.97 ms | 32.4% bf16 MFU | 125212 tok/s val loss 3.348819 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2923/10042 = 0.291077 step 11501/19560 | loss 3.378698 (-0.02z)| norm 0.2385 (-1.20z)| lr 3.18e-04 | 4175.35 ms | 32.3% bf16 MFU | 125230 tok/s step 11502/19560 | loss 3.308089 (-1.52z)| norm 0.2463 (-0.70z)| lr 3.18e-04 | 4171.90 ms | 32.4% bf16 MFU | 125252 tok/s step 11503/19560 | loss 3.341945 (-0.79z)| norm 0.2294 (-1.73z)| lr 3.18e-04 | 4243.67 ms | 31.8% bf16 MFU | 125166 tok/s step 11504/19560 | loss 3.370966 (-0.17z)| norm 0.2359 (-1.31z)| lr 3.18e-04 | 4310.49 ms | 31.3% bf16 MFU | 124990 tok/s step 11505/19560 | loss 3.389263 (+0.26z)| norm 0.2379 (-1.17z)| lr 3.18e-04 | 4209.27 ms | 32.1% bf16 MFU | 124968 tok/s step 11506/19560 | loss 3.336005 (-0.93z)| norm 0.2537 (-0.21z)| lr 3.18e-04 | 4238.67 ms | 31.9% bf16 MFU | 124904 tok/s step 11507/19560 | loss 3.371499 (-0.13z)| norm 0.2323 (-1.49z)| lr 3.18e-04 | 4242.68 ms | 31.8% bf16 MFU | 124838 tok/s step 11508/19560 | loss 3.318830 (-1.30z)| norm 0.2506 (-0.37z)| lr 3.18e-04 | 4166.64 ms | 32.4% bf16 MFU | 124887 tok/s step 11509/19560 | loss 3.385873 (+0.20z)| norm 0.2611 (+0.28z)| lr 3.17e-04 | 4181.64 ms | 32.3% bf16 MFU | 124912 tok/s step 11510/19560 | loss 3.339740 (-0.84z)| norm 0.2713 (+0.90z)| lr 3.17e-04 | 4172.17 ms | 32.4% bf16 MFU | 124949 tok/s step 11511/19560 | loss 3.405860 (+0.69z)| norm 0.2420 (-0.89z)| lr 3.17e-04 | 4170.56 ms | 32.4% bf16 MFU | 124988 tok/s step 11512/19560 | loss 3.332948 (-0.99z)| norm 0.2358 (-1.25z)| lr 3.17e-04 | 4168.46 ms | 32.4% bf16 MFU | 125027 tok/s step 11513/19560 | loss 3.367786 (-0.17z)| norm 0.2589 (+0.17z)| lr 3.17e-04 | 4165.01 ms | 32.4% bf16 MFU | 125070 tok/s step 11514/19560 | loss 3.393282 (+0.44z)| norm 0.2317 (-1.48z)| lr 3.17e-04 | 4170.43 ms | 32.4% bf16 MFU | 125102 tok/s step 11515/19560 | loss 3.394807 (+0.48z)| norm 0.2469 (-0.55z)| lr 3.17e-04 | 4167.97 ms | 32.4% bf16 MFU | 125136 tok/s step 11516/19560 | loss 3.381953 (+0.18z)| norm 0.2576 (+0.11z)| lr 3.17e-04 | 4171.64 ms | 32.4% bf16 MFU | 125163 tok/s step 11517/19560 | loss 3.390219 (+0.38z)| norm 0.2591 (+0.20z)| lr 3.17e-04 | 4171.74 ms | 32.4% bf16 MFU | 125189 tok/s step 11518/19560 | loss 3.401405 (+0.64z)| norm 0.2451 (-0.65z)| lr 3.17e-04 | 4176.85 ms | 32.3% bf16 MFU | 125206 tok/s step 11519/19560 | loss 3.352819 (-0.52z)| norm 0.2609 (+0.30z)| lr 3.17e-04 | 4195.53 ms | 32.2% bf16 MFU | 125194 tok/s step 11520/19560 | loss 3.338986 (-0.85z)| norm 0.2380 (-1.10z)| lr 3.17e-04 | 4163.59 ms | 32.4% bf16 MFU | 125230 tok/s step 11521/19560 | loss 3.395594 (+0.51z)| norm 0.2690 (+0.79z)| lr 3.17e-04 | 4163.12 ms | 32.4% bf16 MFU | 125265 tok/s step 11522/19560 | loss 3.332130 (-1.00z)| norm 0.2463 (-0.60z)| lr 3.17e-04 | 4177.50 ms | 32.3% bf16 MFU | 125277 tok/s step 11523/19560 | loss 3.381382 (+0.17z)| norm 0.2497 (-0.39z)| lr 3.17e-04 | 4195.50 ms | 32.2% bf16 MFU | 125262 tok/s step 11524/19560 | loss 3.341139 (-0.79z)| norm 0.2397 (-1.01z)| lr 3.16e-04 | 4164.46 ms | 32.4% bf16 MFU | 125293 tok/s step 11525/19560 | loss 3.303765 (-1.65z)| norm 0.2562 (-0.01z)| lr 3.16e-04 | 4262.11 ms | 31.7% bf16 MFU | 125179 tok/s step 11526/19560 | loss 3.371049 (-0.06z)| norm 0.2488 (-0.46z)| lr 3.16e-04 | 4161.82 ms | 32.4% bf16 MFU | 125219 tok/s step 11527/19560 | loss 3.393648 (+0.48z)| norm 0.2640 (+0.47z)| lr 3.16e-04 | 4178.98 ms | 32.3% bf16 MFU | 125231 tok/s step 11528/19560 | loss 3.388074 (+0.34z)| norm 0.2681 (+0.72z)| lr 3.16e-04 | 4173.81 ms | 32.3% bf16 MFU | 125250 tok/s step 11529/19560 | loss 3.387519 (+0.32z)| norm 0.2518 (-0.28z)| lr 3.16e-04 | 4169.28 ms | 32.4% bf16 MFU | 125275 tok/s step 11530/19560 | loss 3.357028 (-0.39z)| norm 0.2447 (-0.73z)| lr 3.16e-04 | 4170.06 ms | 32.4% bf16 MFU | 125298 tok/s step 11531/19560 | loss 3.379238 (+0.13z)| norm 0.2515 (-0.30z)| lr 3.16e-04 | 4177.88 ms | 32.3% bf16 MFU | 125307 tok/s step 11532/19560 | loss 3.361525 (-0.31z)| norm 0.2600 (+0.22z)| lr 3.16e-04 | 4162.65 ms | 32.4% bf16 MFU | 125340 tok/s step 11533/19560 | loss 3.347960 (-0.63z)| norm 0.2567 (-0.00z)| lr 3.16e-04 | 4176.11 ms | 32.3% bf16 MFU | 125350 tok/s step 11534/19560 | loss 3.401422 (+0.65z)| norm 0.2437 (-0.83z)| lr 3.16e-04 | 4167.15 ms | 32.4% bf16 MFU | 125373 tok/s step 11535/19560 | loss 3.339573 (-0.83z)| norm 0.2397 (-1.08z)| lr 3.16e-04 | 4170.11 ms | 32.4% bf16 MFU | 125391 tok/s step 11536/19560 | loss 3.385776 (+0.27z)| norm 0.2396 (-1.07z)| lr 3.16e-04 | 4168.63 ms | 32.4% bf16 MFU | 125410 tok/s step 11537/19560 | loss 3.342556 (-0.79z)| norm 0.2628 (+0.39z)| lr 3.16e-04 | 4164.93 ms | 32.4% bf16 MFU | 125433 tok/s step 11538/19560 | loss 3.372403 (-0.05z)| norm 0.2281 (-1.77z)| lr 3.16e-04 | 4177.49 ms | 32.3% bf16 MFU | 125437 tok/s step 11539/19560 | loss 3.326685 (-1.16z)| norm 0.2592 (+0.17z)| lr 3.15e-04 | 4172.54 ms | 32.4% bf16 MFU | 125447 tok/s step 11540/19560 | loss 3.428411 (+1.30z)| norm 0.2348 (-1.36z)| lr 3.15e-04 | 4170.45 ms | 32.4% bf16 MFU | 125461 tok/s step 11541/19560 | loss 3.364314 (-0.24z)| norm 0.2639 (+0.45z)| lr 3.15e-04 | 4163.82 ms | 32.4% bf16 MFU | 125484 tok/s step 11542/19560 | loss 3.337464 (-0.88z)| norm 0.2440 (-0.78z)| lr 3.15e-04 | 4165.98 ms | 32.4% bf16 MFU | 125502 tok/s step 11543/19560 | loss 3.400318 (+0.64z)| norm 0.2584 (+0.11z)| lr 3.15e-04 | 4173.53 ms | 32.4% bf16 MFU | 125508 tok/s step 11544/19560 | loss 3.355271 (-0.44z)| norm 0.2651 (+0.52z)| lr 3.15e-04 | 4161.55 ms | 32.4% bf16 MFU | 125532 tok/s step 11545/19560 | loss 3.425847 (+1.28z)| norm 0.2589 (+0.14z)| lr 3.15e-04 | 4162.56 ms | 32.4% bf16 MFU | 125553 tok/s step 11546/19560 | loss 3.338842 (-0.85z)| norm 0.2857 (+1.80z)| lr 3.15e-04 | 4167.55 ms | 32.4% bf16 MFU | 125565 tok/s step 11547/19560 | loss 3.373248 (-0.01z)| norm 0.2305 (-1.60z)| lr 3.15e-04 | 4389.37 ms | 30.8% bf16 MFU | 125259 tok/s step 11548/19560 | loss 3.350808 (-0.55z)| norm 0.2864 (+1.84z)| lr 3.15e-04 | 4876.08 ms | 27.7% bf16 MFU | 124372 tok/s step 11549/19560 | loss 3.364297 (-0.20z)| norm 0.2283 (-1.71z)| lr 3.15e-04 | 4427.94 ms | 30.5% bf16 MFU | 124074 tok/s step 11550/19560 | loss 3.390636 (+0.50z)| norm 0.2738 (+1.16z)| lr 3.15e-04 | 4262.04 ms | 31.7% bf16 MFU | 124021 tok/s step 11551/19560 | loss 3.313927 (-1.47z)| norm 0.2342 (-1.38z)| lr 3.15e-04 | 4287.22 ms | 31.5% bf16 MFU | 123934 tok/s step 11552/19560 | loss 3.316029 (-1.41z)| norm 0.2665 (+0.76z)| lr 3.15e-04 | 4217.32 ms | 32.0% bf16 MFU | 123954 tok/s step 11553/19560 | loss 3.310966 (-1.54z)| norm 0.2612 (+0.42z)| lr 3.15e-04 | 4165.79 ms | 32.4% bf16 MFU | 124049 tok/s step 11554/19560 | loss 3.359324 (-0.26z)| norm 0.2403 (-0.97z)| lr 3.14e-04 | 4249.95 ms | 31.8% bf16 MFU | 124014 tok/s step 11555/19560 | loss 3.376767 (+0.19z)| norm 0.2750 (+1.37z)| lr 3.14e-04 | 4296.59 ms | 31.4% bf16 MFU | 123915 tok/s step 11556/19560 | loss 3.376568 (+0.19z)| norm 0.2289 (-1.72z)| lr 3.14e-04 | 4189.57 ms | 32.2% bf16 MFU | 123976 tok/s step 11557/19560 | loss 3.362556 (-0.17z)| norm 0.2727 (+1.27z)| lr 3.14e-04 | 4170.01 ms | 32.4% bf16 MFU | 124064 tok/s step 11558/19560 | loss 3.389445 (+0.57z)| norm 0.2465 (-0.51z)| lr 3.14e-04 | 4157.88 ms | 32.5% bf16 MFU | 124165 tok/s step 11559/19560 | loss 3.352615 (-0.44z)| norm 0.2576 (+0.28z)| lr 3.14e-04 | 4301.80 ms | 31.4% bf16 MFU | 124051 tok/s step 11560/19560 | loss 3.290725 (-2.10z)| norm 0.2454 (-0.58z)| lr 3.14e-04 | 4212.84 ms | 32.0% bf16 MFU | 124071 tok/s step 11561/19560 | loss 3.351201 (-0.45z)| norm 0.2546 (+0.08z)| lr 3.14e-04 | 4191.05 ms | 32.2% bf16 MFU | 124122 tok/s step 11562/19560 | loss 3.357644 (-0.28z)| norm 0.2488 (-0.33z)| lr 3.14e-04 | 4176.05 ms | 32.3% bf16 MFU | 124193 tok/s step 11563/19560 | loss 3.358635 (-0.27z)| norm 0.2352 (-1.28z)| lr 3.14e-04 | 4170.81 ms | 32.4% bf16 MFU | 124269 tok/s step 11564/19560 | loss 3.386465 (+0.49z)| norm 0.2559 (+0.19z)| lr 3.14e-04 | 4159.15 ms | 32.5% bf16 MFU | 124358 tok/s step 11565/19560 | loss 3.313250 (-1.56z)| norm 0.2390 (-1.00z)| lr 3.14e-04 | 4285.48 ms | 31.5% bf16 MFU | 124257 tok/s step 11566/19560 | loss 3.368514 (+0.03z)| norm 0.2572 (+0.28z)| lr 3.14e-04 | 4239.31 ms | 31.8% bf16 MFU | 124228 tok/s step 11567/19560 | loss 3.341711 (-0.75z)| norm 0.2402 (-0.90z)| lr 3.14e-04 | 4178.12 ms | 32.3% bf16 MFU | 124291 tok/s step 11568/19560 | loss 3.353485 (-0.40z)| norm 0.2380 (-1.04z)| lr 3.14e-04 | 4393.34 ms | 30.7% bf16 MFU | 124043 tok/s step 11569/19560 | loss 3.377582 (+0.31z)| norm 0.2505 (-0.15z)| lr 3.13e-04 | 4239.96 ms | 31.8% bf16 MFU | 124024 tok/s step 11570/19560 | loss 3.337548 (-0.85z)| norm 0.2402 (-0.88z)| lr 3.13e-04 | 4167.06 ms | 32.4% bf16 MFU | 124114 tok/s step 11571/19560 | loss 3.380622 (+0.40z)| norm 0.2460 (-0.45z)| lr 3.13e-04 | 4175.70 ms | 32.3% bf16 MFU | 124186 tok/s step 11572/19560 | loss 3.422403 (+1.61z)| norm 0.2579 (+0.43z)| lr 3.13e-04 | 4240.69 ms | 31.8% bf16 MFU | 124158 tok/s step 11573/19560 | loss 3.321465 (-1.31z)| norm 0.2675 (+1.12z)| lr 3.13e-04 | 4165.09 ms | 32.4% bf16 MFU | 124244 tok/s step 11574/19560 | loss 3.347242 (-0.55z)| norm 0.2596 (+0.55z)| lr 3.13e-04 | 4292.95 ms | 31.5% bf16 MFU | 124138 tok/s step 11575/19560 | loss 3.410754 (+1.26z)| norm 0.3110 (+3.99z)| lr 3.13e-04 | 4176.87 ms | 32.3% bf16 MFU | 124207 tok/s step 11576/19560 | loss 3.375634 (+0.26z)| norm 0.2886 (+2.39z)| lr 3.13e-04 | 4173.84 ms | 32.3% bf16 MFU | 124278 tok/s step 11577/19560 | loss 3.401525 (+1.02z)| norm 0.2689 (+1.07z)| lr 3.13e-04 | 4215.34 ms | 32.0% bf16 MFU | 124283 tok/s step 11578/19560 | loss 3.383736 (+0.49z)| norm 0.2844 (+2.04z)| lr 3.13e-04 | 4212.65 ms | 32.1% bf16 MFU | 124291 tok/s step 11579/19560 | loss 3.327040 (-1.14z)| norm 0.2672 (+0.91z)| lr 3.13e-04 | 4203.68 ms | 32.1% bf16 MFU | 124313 tok/s step 11580/19560 | loss 3.352135 (-0.42z)| norm 0.2555 (+0.15z)| lr 3.13e-04 | 4172.23 ms | 32.4% bf16 MFU | 124380 tok/s step 11581/19560 | loss 3.344745 (-0.62z)| norm 0.2638 (+0.69z)| lr 3.13e-04 | 4257.88 ms | 31.7% bf16 MFU | 124318 tok/s step 11582/19560 | loss 3.336885 (-0.84z)| norm 0.2401 (-0.85z)| lr 3.13e-04 | 4174.89 ms | 32.3% bf16 MFU | 124381 tok/s step 11583/19560 | loss 3.342256 (-0.68z)| norm 0.2643 (+0.72z)| lr 3.13e-04 | 4166.80 ms | 32.4% bf16 MFU | 124453 tok/s step 11584/19560 | loss 3.375288 (+0.29z)| norm 0.2661 (+0.84z)| lr 3.12e-04 | 4215.40 ms | 32.0% bf16 MFU | 124449 tok/s step 11585/19560 | loss 3.320216 (-1.31z)| norm 0.2646 (+0.73z)| lr 3.12e-04 | 4162.69 ms | 32.4% bf16 MFU | 124524 tok/s step 11586/19560 | loss 3.339102 (-0.75z)| norm 0.2588 (+0.36z)| lr 3.12e-04 | 4182.12 ms | 32.3% bf16 MFU | 124566 tok/s step 11587/19560 | loss 3.410763 (+1.35z)| norm 0.2721 (+1.21z)| lr 3.12e-04 | 4177.58 ms | 32.3% bf16 MFU | 124613 tok/s step 11588/19560 | loss 3.341581 (-0.68z)| norm 0.2570 (+0.22z)| lr 3.12e-04 | 4173.05 ms | 32.4% bf16 MFU | 124664 tok/s step 11589/19560 | loss 3.350462 (-0.42z)| norm 0.2845 (+1.97z)| lr 3.12e-04 | 4179.64 ms | 32.3% bf16 MFU | 124703 tok/s step 11590/19560 | loss 3.491667 (+3.55z)| norm 0.2988 (+2.79z)| lr 3.12e-04 | 4191.36 ms | 32.2% bf16 MFU | 124722 tok/s step 11591/19560 | loss 3.304305 (-1.71z)| norm 0.2646 (+0.64z)| lr 3.12e-04 | 4204.80 ms | 32.1% bf16 MFU | 124720 tok/s step 11592/19560 | loss 3.356556 (-0.25z)| norm 0.2783 (+1.46z)| lr 3.12e-04 | 4175.71 ms | 32.3% bf16 MFU | 124762 tok/s step 11593/19560 | loss 3.314045 (-1.45z)| norm 0.2489 (-0.37z)| lr 3.12e-04 | 4228.87 ms | 31.9% bf16 MFU | 124723 tok/s step 11594/19560 | loss 3.359083 (-0.18z)| norm 0.2791 (+1.49z)| lr 3.12e-04 | 4180.09 ms | 32.3% bf16 MFU | 124758 tok/s step 11595/19560 | loss 3.341233 (-0.68z)| norm 0.2453 (-0.61z)| lr 3.12e-04 | 4179.53 ms | 32.3% bf16 MFU | 124792 tok/s step 11596/19560 | loss 3.354037 (-0.31z)| norm 0.2674 (+0.76z)| lr 3.12e-04 | 4177.17 ms | 32.3% bf16 MFU | 124828 tok/s step 11597/19560 | loss 3.396232 (+0.97z)| norm 0.2357 (-1.19z)| lr 3.12e-04 | 4180.35 ms | 32.3% bf16 MFU | 124858 tok/s step 11598/19560 | loss 3.351725 (-0.39z)| norm 0.2500 (-0.33z)| lr 3.12e-04 | 4169.97 ms | 32.4% bf16 MFU | 124901 tok/s step 11599/19560 | loss 3.303463 (-1.84z)| norm 0.2558 (+0.05z)| lr 3.12e-04 | 4181.32 ms | 32.3% bf16 MFU | 124926 tok/s step 11600/19560 | loss 3.402650 (+1.18z)| norm 0.2635 (+0.54z)| lr 3.11e-04 | 4187.92 ms | 32.2% bf16 MFU | 124939 tok/s step 11601/19560 | loss 3.302064 (-1.85z)| norm 0.2546 (-0.02z)| lr 3.11e-04 | 4177.87 ms | 32.3% bf16 MFU | 124967 tok/s step 11602/19560 | loss 3.372368 (+0.27z)| norm 0.2620 (+0.43z)| lr 3.11e-04 | 4175.08 ms | 32.3% bf16 MFU | 124997 tok/s step 11603/19560 | loss 3.313553 (-1.48z)| norm 0.2372 (-1.16z)| lr 3.11e-04 | 4179.41 ms | 32.3% bf16 MFU | 125019 tok/s step 11604/19560 | loss 3.352675 (-0.30z)| norm 0.2511 (-0.27z)| lr 3.11e-04 | 4181.55 ms | 32.3% bf16 MFU | 125037 tok/s step 11605/19560 | loss 3.363563 (+0.02z)| norm 0.2413 (-0.89z)| lr 3.11e-04 | 4185.44 ms | 32.3% bf16 MFU | 125049 tok/s step 11606/19560 | loss 3.344038 (-0.56z)| norm 0.2772 (+1.39z)| lr 3.11e-04 | 4182.22 ms | 32.3% bf16 MFU | 125064 tok/s step 11607/19560 | loss 3.381160 (+0.56z)| norm 0.2475 (-0.50z)| lr 3.11e-04 | 4175.83 ms | 32.3% bf16 MFU | 125089 tok/s step 11608/19560 | loss 3.354802 (-0.24z)| norm 0.2496 (-0.35z)| lr 3.11e-04 | 4192.34 ms | 32.2% bf16 MFU | 125087 tok/s step 11609/19560 | loss 3.375484 (+0.38z)| norm 0.2402 (-0.95z)| lr 3.11e-04 | 4201.19 ms | 32.1% bf16 MFU | 125073 tok/s step 11610/19560 | loss 3.373800 (+0.34z)| norm 0.2454 (-0.61z)| lr 3.11e-04 | 4203.24 ms | 32.1% bf16 MFU | 125056 tok/s step 11611/19560 | loss 3.384838 (+0.69z)| norm 0.2464 (-0.55z)| lr 3.11e-04 | 4182.98 ms | 32.3% bf16 MFU | 125070 tok/s step 11612/19560 | loss 3.362564 (+0.02z)| norm 0.2628 (+0.52z)| lr 3.11e-04 | 4185.86 ms | 32.3% bf16 MFU | 125079 tok/s step 11613/19560 | loss 3.371026 (+0.28z)| norm 0.2387 (-1.05z)| lr 3.11e-04 | 4171.14 ms | 32.4% bf16 MFU | 125110 tok/s step 11614/19560 | loss 3.304903 (-1.74z)| norm 0.2504 (-0.27z)| lr 3.11e-04 | 4168.33 ms | 32.4% bf16 MFU | 125143 tok/s step 11615/19560 | loss 3.364140 (+0.07z)| norm 0.2470 (-0.49z)| lr 3.10e-04 | 4224.06 ms | 32.0% bf16 MFU | 125092 tok/s step 11616/19560 | loss 3.328969 (-1.01z)| norm 0.2453 (-0.59z)| lr 3.10e-04 | 4189.07 ms | 32.2% bf16 MFU | 125095 tok/s step 11617/19560 | loss 3.379341 (+0.55z)| norm 0.2497 (-0.30z)| lr 3.10e-04 | 4160.13 ms | 32.5% bf16 MFU | 125142 tok/s step 11618/19560 | loss 3.431825 (+2.12z)| norm 0.2355 (-1.23z)| lr 3.10e-04 | 4190.94 ms | 32.2% bf16 MFU | 125140 tok/s step 11619/19560 | loss 3.367078 (+0.14z)| norm 0.2504 (-0.24z)| lr 3.10e-04 | 4172.94 ms | 32.4% bf16 MFU | 125165 tok/s step 11620/19560 | loss 3.343324 (-0.59z)| norm 0.2396 (-0.94z)| lr 3.10e-04 | 4186.48 ms | 32.3% bf16 MFU | 125168 tok/s step 11621/19560 | loss 3.363119 (+0.02z)| norm 0.2247 (-1.88z)| lr 3.10e-04 | 4182.99 ms | 32.3% bf16 MFU | 125177 tok/s step 11622/19560 | loss 3.309883 (-1.59z)| norm 0.2344 (-1.22z)| lr 3.10e-04 | 4164.25 ms | 32.4% bf16 MFU | 125213 tok/s step 11623/19560 | loss 3.407916 (+1.40z)| norm 0.2619 (+0.56z)| lr 3.10e-04 | 4173.59 ms | 32.4% bf16 MFU | 125233 tok/s step 11624/19560 | loss 3.316008 (-1.37z)| norm 0.2250 (-1.81z)| lr 3.10e-04 | 4167.42 ms | 32.4% bf16 MFU | 125262 tok/s step 11625/19560 | loss 3.337379 (-0.72z)| norm 0.2355 (-1.12z)| lr 3.10e-04 | 4181.21 ms | 32.3% bf16 MFU | 125268 tok/s step 11626/19560 | loss 3.409064 (+1.42z)| norm 0.2536 (+0.03z)| lr 3.10e-04 | 4172.37 ms | 32.4% bf16 MFU | 125288 tok/s step 11627/19560 | loss 3.324535 (-1.11z)| norm 0.2603 (+0.45z)| lr 3.10e-04 | 4180.22 ms | 32.3% bf16 MFU | 125295 tok/s step 11628/19560 | loss 3.388184 (+0.83z)| norm 0.2449 (-0.53z)| lr 3.10e-04 | 4190.26 ms | 32.2% bf16 MFU | 125286 tok/s step 11629/19560 | loss 3.423104 (+1.86z)| norm 0.2480 (-0.34z)| lr 3.10e-04 | 4183.10 ms | 32.3% bf16 MFU | 125288 tok/s step 11630/19560 | loss 3.368952 (+0.21z)| norm 0.2784 (+1.59z)| lr 3.09e-04 | 4193.33 ms | 32.2% bf16 MFU | 125275 tok/s step 11631/19560 | loss 3.412127 (+1.50z)| norm 0.2554 (+0.11z)| lr 3.09e-04 | 4176.90 ms | 32.3% bf16 MFU | 125288 tok/s step 11632/19560 | loss 3.422540 (+1.78z)| norm 0.2824 (+1.81z)| lr 3.09e-04 | 4181.87 ms | 32.3% bf16 MFU | 125292 tok/s step 11633/19560 | loss 3.353465 (-0.27z)| norm 0.2529 (-0.08z)| lr 3.09e-04 | 4170.71 ms | 32.4% bf16 MFU | 125313 tok/s step 11634/19560 | loss 3.343276 (-0.58z)| norm 0.2566 (+0.15z)| lr 3.09e-04 | 4180.98 ms | 32.3% bf16 MFU | 125317 tok/s step 11635/19560 | loss 3.345140 (-0.52z)| norm 0.2689 (+0.92z)| lr 3.09e-04 | 4172.28 ms | 32.4% bf16 MFU | 125334 tok/s step 11636/19560 | loss 3.346386 (-0.49z)| norm 0.2548 (+0.01z)| lr 3.09e-04 | 4169.13 ms | 32.4% bf16 MFU | 125355 tok/s step 11637/19560 | loss 3.330488 (-0.95z)| norm 0.2739 (+1.24z)| lr 3.09e-04 | 4285.96 ms | 31.5% bf16 MFU | 125204 tok/s step 11638/19560 | loss 3.353146 (-0.28z)| norm 0.2476 (-0.44z)| lr 3.09e-04 | 4171.33 ms | 32.4% bf16 MFU | 125228 tok/s step 11639/19560 | loss 3.309070 (-1.58z)| norm 0.2748 (+1.29z)| lr 3.09e-04 | 4191.01 ms | 32.2% bf16 MFU | 125221 tok/s step 11640/19560 | loss 3.306791 (-1.63z)| norm 0.2554 (+0.03z)| lr 3.09e-04 | 4179.11 ms | 32.3% bf16 MFU | 125233 tok/s step 11641/19560 | loss 3.391550 (+0.89z)| norm 0.2404 (-0.92z)| lr 3.09e-04 | 4299.53 ms | 31.4% bf16 MFU | 125068 tok/s step 11642/19560 | loss 3.331906 (-0.87z)| norm 0.2604 (+0.36z)| lr 3.09e-04 | 4198.86 ms | 32.2% bf16 MFU | 125058 tok/s step 11643/19560 | loss 3.324359 (-1.07z)| norm 0.2468 (-0.53z)| lr 3.09e-04 | 4183.02 ms | 32.3% bf16 MFU | 125072 tok/s step 11644/19560 | loss 3.396876 (+1.07z)| norm 0.2253 (-1.87z)| lr 3.09e-04 | 4231.33 ms | 31.9% bf16 MFU | 125014 tok/s step 11645/19560 | loss 3.293839 (-1.93z)| norm 0.2286 (-1.63z)| lr 3.08e-04 | 4197.09 ms | 32.2% bf16 MFU | 125009 tok/s step 11646/19560 | loss 3.315542 (-1.28z)| norm 0.2393 (-0.95z)| lr 3.08e-04 | 4218.44 ms | 32.0% bf16 MFU | 124973 tok/s step 11647/19560 | loss 3.368073 (+0.26z)| norm 0.2396 (-0.92z)| lr 3.08e-04 | 4219.09 ms | 32.0% bf16 MFU | 124937 tok/s step 11648/19560 | loss 3.304208 (-1.59z)| norm 0.2379 (-1.03z)| lr 3.08e-04 | 4182.41 ms | 32.3% bf16 MFU | 124958 tok/s step 11649/19560 | loss 3.350656 (-0.23z)| norm 0.2257 (-1.76z)| lr 3.08e-04 | 4187.74 ms | 32.2% bf16 MFU | 124970 tok/s step 11650/19560 | loss 3.344402 (-0.42z)| norm 0.2307 (-1.43z)| lr 3.08e-04 | 4171.66 ms | 32.4% bf16 MFU | 125006 tok/s step 11651/19560 | loss 3.312869 (-1.32z)| norm 0.2273 (-1.61z)| lr 3.08e-04 | 4185.41 ms | 32.3% bf16 MFU | 125019 tok/s step 11652/19560 | loss 3.402856 (+1.27z)| norm 0.2338 (-1.20z)| lr 3.08e-04 | 4183.75 ms | 32.3% bf16 MFU | 125033 tok/s step 11653/19560 | loss 3.372946 (+0.40z)| norm 0.2465 (-0.42z)| lr 3.08e-04 | 4159.71 ms | 32.5% bf16 MFU | 125084 tok/s step 11654/19560 | loss 3.335611 (-0.68z)| norm 0.2300 (-1.42z)| lr 3.08e-04 | 4179.99 ms | 32.3% bf16 MFU | 125101 tok/s step 11655/19560 | loss 3.337473 (-0.62z)| norm 0.2657 (+0.75z)| lr 3.08e-04 | 4164.47 ms | 32.4% bf16 MFU | 125141 tok/s step 11656/19560 | loss 3.387786 (+0.85z)| norm 0.2527 (-0.03z)| lr 3.08e-04 | 4271.79 ms | 31.6% bf16 MFU | 125020 tok/s step 11657/19560 | loss 3.398947 (+1.17z)| norm 0.2380 (-0.91z)| lr 3.08e-04 | 4225.38 ms | 32.0% bf16 MFU | 124973 tok/s step 11658/19560 | loss 3.295700 (-1.79z)| norm 0.2486 (-0.28z)| lr 3.08e-04 | 4169.30 ms | 32.4% bf16 MFU | 125012 tok/s step 11659/19560 | loss 3.383804 (+0.73z)| norm 0.2257 (-1.63z)| lr 3.08e-04 | 4172.10 ms | 32.4% bf16 MFU | 125045 tok/s step 11660/19560 | loss 3.358124 (-0.00z)| norm 0.2508 (-0.12z)| lr 3.07e-04 | 4173.11 ms | 32.4% bf16 MFU | 125074 tok/s step 11661/19560 | loss 3.351998 (-0.18z)| norm 0.2600 (+0.43z)| lr 3.07e-04 | 4312.72 ms | 31.3% bf16 MFU | 124899 tok/s step 11662/19560 | loss 3.320526 (-1.07z)| norm 0.2785 (+1.51z)| lr 3.07e-04 | 4207.78 ms | 32.1% bf16 MFU | 124884 tok/s step 11663/19560 | loss 3.339153 (-0.53z)| norm 0.2501 (-0.19z)| lr 3.07e-04 | 4187.48 ms | 32.2% bf16 MFU | 124900 tok/s step 11664/19560 | loss 3.341882 (-0.44z)| norm 0.2551 (+0.11z)| lr 3.07e-04 | 4165.71 ms | 32.4% bf16 MFU | 124948 tok/s step 11665/19560 | loss 3.371192 (+0.40z)| norm 0.2596 (+0.37z)| lr 3.07e-04 | 4169.58 ms | 32.4% bf16 MFU | 124988 tok/s step 11666/19560 | loss 3.336008 (-0.61z)| norm 0.2454 (-0.48z)| lr 3.07e-04 | 4187.50 ms | 32.2% bf16 MFU | 124998 tok/s step 11667/19560 | loss 3.323863 (-0.96z)| norm 0.2508 (-0.16z)| lr 3.07e-04 | 4173.05 ms | 32.4% bf16 MFU | 125030 tok/s step 11668/19560 | loss 3.416005 (+1.70z)| norm 0.2536 (+0.00z)| lr 3.07e-04 | 4177.64 ms | 32.3% bf16 MFU | 125054 tok/s step 11669/19560 | loss 3.373461 (+0.47z)| norm 0.2298 (-1.42z)| lr 3.07e-04 | 4175.08 ms | 32.3% bf16 MFU | 125080 tok/s step 11670/19560 | loss 3.330906 (-0.76z)| norm 0.2868 (+1.97z)| lr 3.07e-04 | 4187.24 ms | 32.2% bf16 MFU | 125086 tok/s step 11671/19560 | loss 3.384511 (+0.80z)| norm 0.2460 (-0.45z)| lr 3.07e-04 | 4185.54 ms | 32.3% bf16 MFU | 125095 tok/s step 11672/19560 | loss 3.333351 (-0.68z)| norm 0.2580 (+0.27z)| lr 3.07e-04 | 4172.75 ms | 32.4% bf16 MFU | 125123 tok/s step 11673/19560 | loss 3.345939 (-0.30z)| norm 0.2262 (-1.59z)| lr 3.07e-04 | 4247.99 ms | 31.8% bf16 MFU | 125038 tok/s step 11674/19560 | loss 3.360104 (+0.11z)| norm 0.2549 (+0.12z)| lr 3.07e-04 | 4183.65 ms | 32.3% bf16 MFU | 125052 tok/s step 11675/19560 | loss 3.372842 (+0.49z)| norm 0.2434 (-0.58z)| lr 3.06e-04 | 4210.34 ms | 32.1% bf16 MFU | 125025 tok/s step 11676/19560 | loss 3.343382 (-0.38z)| norm 0.2537 (+0.05z)| lr 3.06e-04 | 4169.89 ms | 32.4% bf16 MFU | 125060 tok/s step 11677/19560 | loss 3.424692 (+1.97z)| norm 0.2500 (-0.18z)| lr 3.06e-04 | 4162.48 ms | 32.4% bf16 MFU | 125105 tok/s step 11678/19560 | loss 3.324487 (-0.92z)| norm 0.2543 (+0.09z)| lr 3.06e-04 | 4281.69 ms | 31.5% bf16 MFU | 124972 tok/s step 11679/19560 | loss 3.335691 (-0.60z)| norm 0.2391 (-0.86z)| lr 3.06e-04 | 4203.00 ms | 32.1% bf16 MFU | 124961 tok/s step 11680/19560 | loss 3.371923 (+0.44z)| norm 0.2724 (+1.21z)| lr 3.06e-04 | 4178.76 ms | 32.3% bf16 MFU | 124986 tok/s step 11681/19560 | loss 3.353524 (-0.11z)| norm 0.2349 (-1.10z)| lr 3.06e-04 | 4203.53 ms | 32.1% bf16 MFU | 124973 tok/s step 11682/19560 | loss 3.327262 (-0.87z)| norm 0.2798 (+1.64z)| lr 3.06e-04 | 4178.16 ms | 32.3% bf16 MFU | 124999 tok/s step 11683/19560 | loss 3.341590 (-0.44z)| norm 0.2497 (-0.19z)| lr 3.06e-04 | 4183.22 ms | 32.3% bf16 MFU | 125015 tok/s step 11684/19560 | loss 3.329181 (-0.79z)| norm 0.2607 (+0.47z)| lr 3.06e-04 | 4176.42 ms | 32.3% bf16 MFU | 125041 tok/s step 11685/19560 | loss 3.330694 (-0.74z)| norm 0.2500 (-0.18z)| lr 3.06e-04 | 4201.66 ms | 32.1% bf16 MFU | 125028 tok/s step 11686/19560 | loss 3.316673 (-1.13z)| norm 0.2569 (+0.24z)| lr 3.06e-04 | 4213.12 ms | 32.0% bf16 MFU | 124999 tok/s step 11687/19560 | loss 3.432767 (+2.20z)| norm 0.2482 (-0.30z)| lr 3.06e-04 | 4338.77 ms | 31.1% bf16 MFU | 124791 tok/s step 11688/19560 | loss 3.335702 (-0.60z)| norm 0.2561 (+0.19z)| lr 3.06e-04 | 4282.35 ms | 31.5% bf16 MFU | 124673 tok/s step 11689/19560 | loss 3.368405 (+0.34z)| norm 0.2552 (+0.14z)| lr 3.06e-04 | 4301.53 ms | 31.4% bf16 MFU | 124533 tok/s step 11690/19560 | loss 3.366557 (+0.29z)| norm 0.2672 (+0.88z)| lr 3.06e-04 | 4260.27 ms | 31.7% bf16 MFU | 124460 tok/s step 11691/19560 | loss 3.415814 (+1.68z)| norm 0.2516 (-0.11z)| lr 3.05e-04 | 4215.15 ms | 32.0% bf16 MFU | 124456 tok/s step 11692/19560 | loss 3.364579 (+0.22z)| norm 0.2638 (+0.66z)| lr 3.05e-04 | 4206.36 ms | 32.1% bf16 MFU | 124465 tok/s step 11693/19560 | loss 3.396498 (+1.12z)| norm 0.2394 (-0.87z)| lr 3.05e-04 | 4289.74 ms | 31.5% bf16 MFU | 124353 tok/s step 11694/19560 | loss 3.342714 (-0.42z)| norm 0.2700 (+1.03z)| lr 3.05e-04 | 4177.99 ms | 32.3% bf16 MFU | 124410 tok/s step 11695/19560 | loss 3.348254 (-0.26z)| norm 0.2453 (-0.51z)| lr 3.05e-04 | 4212.67 ms | 32.1% bf16 MFU | 124412 tok/s step 11696/19560 | loss 3.359140 (+0.05z)| norm 0.2619 (+0.51z)| lr 3.05e-04 | 4160.86 ms | 32.4% bf16 MFU | 124492 tok/s step 11697/19560 | loss 3.452180 (+2.64z)| norm 0.2442 (-0.59z)| lr 3.05e-04 | 4259.16 ms | 31.7% bf16 MFU | 124422 tok/s step 11698/19560 | loss 3.365497 (+0.20z)| norm 0.2871 (+2.04z)| lr 3.05e-04 | 4343.03 ms | 31.1% bf16 MFU | 124237 tok/s step 11699/19560 | loss 3.375378 (+0.48z)| norm 0.2320 (-1.34z)| lr 3.05e-04 | 4237.08 ms | 31.9% bf16 MFU | 124212 tok/s step 11700/19560 | loss 3.367199 (+0.27z)| norm 0.2824 (+1.72z)| lr 3.05e-04 | 4176.64 ms | 32.3% bf16 MFU | 124278 tok/s step 11701/19560 | loss 3.454822 (+2.66z)| norm 0.2668 (+0.77z)| lr 3.05e-04 | 4207.87 ms | 32.1% bf16 MFU | 124294 tok/s step 11702/19560 | loss 3.390985 (+0.88z)| norm 0.2399 (-0.85z)| lr 3.05e-04 | 4195.48 ms | 32.2% bf16 MFU | 124327 tok/s step 11703/19560 | loss 3.337217 (-0.60z)| norm 0.2655 (+0.76z)| lr 3.05e-04 | 4177.71 ms | 32.3% bf16 MFU | 124386 tok/s step 11704/19560 | loss 3.341332 (-0.47z)| norm 0.2423 (-0.71z)| lr 3.05e-04 | 4159.24 ms | 32.5% bf16 MFU | 124469 tok/s step 11705/19560 | loss 3.296944 (-1.68z)| norm 0.2574 (+0.28z)| lr 3.05e-04 | 4177.89 ms | 32.3% bf16 MFU | 124520 tok/s step 11706/19560 | loss 3.386785 (+0.81z)| norm 0.2356 (-1.12z)| lr 3.04e-04 | 4180.99 ms | 32.3% bf16 MFU | 124564 tok/s step 11707/19560 | loss 3.372192 (+0.40z)| norm 0.2678 (+0.99z)| lr 3.04e-04 | 4176.35 ms | 32.3% bf16 MFU | 124613 tok/s step 11708/19560 | loss 3.317659 (-1.10z)| norm 0.2425 (-0.66z)| lr 3.04e-04 | 4204.79 ms | 32.1% bf16 MFU | 124616 tok/s step 11709/19560 | loss 3.334538 (-0.64z)| norm 0.2397 (-0.83z)| lr 3.04e-04 | 4168.58 ms | 32.4% bf16 MFU | 124674 tok/s step 11710/19560 | loss 3.416691 (+1.60z)| norm 0.2454 (-0.46z)| lr 3.04e-04 | 4161.30 ms | 32.4% bf16 MFU | 124740 tok/s step 11711/19560 | loss 3.375313 (+0.46z)| norm 0.2636 (+0.73z)| lr 3.04e-04 | 4170.02 ms | 32.4% bf16 MFU | 124789 tok/s step 11712/19560 | loss 3.351065 (-0.20z)| norm 0.2684 (+1.04z)| lr 3.04e-04 | 4178.41 ms | 32.3% bf16 MFU | 124824 tok/s step 11713/19560 | loss 3.328451 (-0.82z)| norm 0.2669 (+0.94z)| lr 3.04e-04 | 4170.58 ms | 32.4% bf16 MFU | 124868 tok/s step 11714/19560 | loss 3.349254 (-0.25z)| norm 0.2575 (+0.33z)| lr 3.04e-04 | 4177.93 ms | 32.3% bf16 MFU | 124899 tok/s step 11715/19560 | loss 3.303318 (-1.49z)| norm 0.2702 (+1.16z)| lr 3.04e-04 | 4288.51 ms | 31.5% bf16 MFU | 124767 tok/s step 11716/19560 | loss 3.309242 (-1.31z)| norm 0.2527 (+0.02z)| lr 3.04e-04 | 4179.26 ms | 32.3% bf16 MFU | 124801 tok/s step 11717/19560 | loss 3.351890 (-0.15z)| norm 0.2485 (-0.24z)| lr 3.04e-04 | 4176.36 ms | 32.3% bf16 MFU | 124838 tok/s step 11718/19560 | loss 3.377972 (+0.62z)| norm 0.2438 (-0.55z)| lr 3.04e-04 | 4157.91 ms | 32.5% bf16 MFU | 124901 tok/s step 11719/19560 | loss 3.358501 (+0.05z)| norm 0.2690 (+1.19z)| lr 3.04e-04 | 4183.46 ms | 32.3% bf16 MFU | 124922 tok/s step 11720/19560 | loss 3.411708 (+1.57z)| norm 0.2618 (+0.71z)| lr 3.04e-04 | 4175.50 ms | 32.3% bf16 MFU | 124954 tok/s step 11721/19560 | loss 3.455645 (+2.73z)| norm 0.2659 (+0.98z)| lr 3.03e-04 | 4174.87 ms | 32.3% bf16 MFU | 124985 tok/s step 11722/19560 | loss 3.352479 (-0.16z)| norm 0.2446 (-0.49z)| lr 3.03e-04 | 4195.77 ms | 32.2% bf16 MFU | 124984 tok/s step 11723/19560 | loss 3.376590 (+0.51z)| norm 0.2531 (+0.11z)| lr 3.03e-04 | 4180.04 ms | 32.3% bf16 MFU | 125006 tok/s step 11724/19560 | loss 3.341476 (-0.48z)| norm 0.2667 (+1.07z)| lr 3.03e-04 | 4179.72 ms | 32.3% bf16 MFU | 125028 tok/s step 11725/19560 | loss 3.347803 (-0.29z)| norm 0.2532 (+0.10z)| lr 3.03e-04 | 4199.30 ms | 32.2% bf16 MFU | 125019 tok/s step 11726/19560 | loss 3.343793 (-0.40z)| norm 0.3921 (+7.43z)| lr 3.03e-04 | 4183.56 ms | 32.3% bf16 MFU | 125034 tok/s step 11727/19560 | loss 3.368153 (+0.27z)| norm 0.2721 (+1.02z)| lr 3.03e-04 | 4270.08 ms | 31.6% bf16 MFU | 124921 tok/s step 11728/19560 | loss 3.341107 (-0.48z)| norm 0.3046 (+2.65z)| lr 3.03e-04 | 4244.56 ms | 31.8% bf16 MFU | 124851 tok/s step 11729/19560 | loss 3.330491 (-0.80z)| norm 0.2685 (+0.78z)| lr 3.03e-04 | 4166.95 ms | 32.4% bf16 MFU | 124900 tok/s step 11730/19560 | loss 3.369343 (+0.32z)| norm 0.2811 (+1.41z)| lr 3.03e-04 | 4190.75 ms | 32.2% bf16 MFU | 124910 tok/s step 11731/19560 | loss 3.369502 (+0.31z)| norm 0.2686 (+0.76z)| lr 3.03e-04 | 4243.21 ms | 31.8% bf16 MFU | 124842 tok/s step 11732/19560 | loss 3.340902 (-0.51z)| norm 0.2630 (+0.47z)| lr 3.03e-04 | 4213.17 ms | 32.0% bf16 MFU | 124822 tok/s step 11733/19560 | loss 3.323908 (-0.99z)| norm 0.2596 (+0.28z)| lr 3.03e-04 | 4178.08 ms | 32.3% bf16 MFU | 124855 tok/s step 11734/19560 | loss 3.318559 (-1.14z)| norm 0.2516 (-0.12z)| lr 3.03e-04 | 4174.68 ms | 32.3% bf16 MFU | 124892 tok/s step 11735/19560 | loss 3.359260 (+0.04z)| norm 0.2571 (+0.17z)| lr 3.03e-04 | 4168.68 ms | 32.4% bf16 MFU | 124936 tok/s step 11736/19560 | loss 3.405909 (+1.36z)| norm 0.2640 (+0.51z)| lr 3.02e-04 | 4170.08 ms | 32.4% bf16 MFU | 124975 tok/s step 11737/19560 | loss 3.339553 (-0.53z)| norm 0.2553 (+0.06z)| lr 3.02e-04 | 4220.88 ms | 32.0% bf16 MFU | 124937 tok/s step 11738/19560 | loss 3.393095 (+0.99z)| norm 0.2756 (+1.09z)| lr 3.02e-04 | 4620.60 ms | 29.2% bf16 MFU | 124364 tok/s step 11739/19560 | loss 3.402253 (+1.24z)| norm 0.2401 (-0.73z)| lr 3.02e-04 | 4876.85 ms | 27.7% bf16 MFU | 123521 tok/s step 11740/19560 | loss 3.335136 (-0.65z)| norm 0.2628 (+0.43z)| lr 3.02e-04 | 4528.83 ms | 29.8% bf16 MFU | 123133 tok/s step 11741/19560 | loss 3.316025 (-1.17z)| norm 0.2559 (+0.08z)| lr 3.02e-04 | 4449.56 ms | 30.3% bf16 MFU | 122868 tok/s step 11742/19560 | loss 3.314531 (-1.22z)| norm 0.2370 (-0.89z)| lr 3.02e-04 | 4451.63 ms | 30.3% bf16 MFU | 122613 tok/s step 11743/19560 | loss 3.351714 (-0.17z)| norm 0.2683 (+0.71z)| lr 3.02e-04 | 4556.79 ms | 29.6% bf16 MFU | 122235 tok/s step 11744/19560 | loss 3.367061 (+0.26z)| norm 0.2491 (-0.28z)| lr 3.02e-04 | 4303.65 ms | 31.4% bf16 MFU | 122215 tok/s step 11745/19560 | loss 3.377052 (+0.54z)| norm 0.2552 (+0.04z)| lr 3.02e-04 | 4327.57 ms | 31.2% bf16 MFU | 122162 tok/s step 11746/19560 | loss 3.424545 (+1.89z)| norm 0.2405 (-0.72z)| lr 3.02e-04 | 4351.26 ms | 31.0% bf16 MFU | 122078 tok/s step 11747/19560 | loss 3.386284 (+0.80z)| norm 0.2745 (+1.01z)| lr 3.02e-04 | 4335.08 ms | 31.1% bf16 MFU | 122021 tok/s step 11748/19560 | loss 3.420202 (+1.72z)| norm 0.2381 (-0.85z)| lr 3.02e-04 | 4164.26 ms | 32.4% bf16 MFU | 122215 tok/s step 11749/19560 | loss 3.425858 (+1.84z)| norm 0.2612 (+0.32z)| lr 3.02e-04 | 4376.68 ms | 30.8% bf16 MFU | 122094 tok/s step 11750/19560 | loss 3.284888 (-2.03z)| norm 0.2417 (-0.70z)| lr 3.02e-04 | 4237.88 ms | 31.9% bf16 MFU | 122175 tok/s val loss 3.343313 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2954/10042 = 0.294165 step 11751/19560 | loss 3.346685 (-0.33z)| norm 0.2571 (+0.11z)| lr 3.01e-04 | 4316.24 ms | 31.3% bf16 MFU | 122140 tok/s step 11752/19560 | loss 3.364570 (+0.16z)| norm 0.2622 (+0.35z)| lr 3.01e-04 | 4292.15 ms | 31.5% bf16 MFU | 122140 tok/s step 11753/19560 | loss 3.364224 (+0.14z)| norm 0.2494 (-0.32z)| lr 3.01e-04 | 4360.23 ms | 31.0% bf16 MFU | 122045 tok/s step 11754/19560 | loss 3.409299 (+1.39z)| norm 0.2860 (+1.57z)| lr 3.01e-04 | 4200.45 ms | 32.1% bf16 MFU | 122184 tok/s step 11755/19560 | loss 3.358468 (-0.02z)| norm 0.2466 (-0.47z)| lr 3.01e-04 | 4254.75 ms | 31.7% bf16 MFU | 122236 tok/s step 11756/19560 | loss 3.375247 (+0.45z)| norm 0.2941 (+1.95z)| lr 3.01e-04 | 4310.77 ms | 31.3% bf16 MFU | 122205 tok/s step 11757/19560 | loss 3.373129 (+0.40z)| norm 0.2623 (+0.32z)| lr 3.01e-04 | 4315.36 ms | 31.3% bf16 MFU | 122170 tok/s step 11758/19560 | loss 3.351111 (-0.21z)| norm 0.2556 (-0.02z)| lr 3.01e-04 | 7094.90 ms | 19.0% bf16 MFU | 119756 tok/s step 11759/19560 | loss 3.327604 (-0.87z)| norm 0.2499 (-0.31z)| lr 3.01e-04 | 4158.93 ms | 32.5% bf16 MFU | 120071 tok/s step 11760/19560 | loss 3.409216 (+1.47z)| norm 0.2624 (+0.34z)| lr 3.01e-04 | 4208.71 ms | 32.1% bf16 MFU | 120296 tok/s step 11761/19560 | loss 3.437967 (+2.22z)| norm 0.2515 (-0.22z)| lr 3.01e-04 | 4171.39 ms | 32.4% bf16 MFU | 120566 tok/s step 11762/19560 | loss 3.374143 (+0.43z)| norm 0.2635 (+0.40z)| lr 3.01e-04 | 4162.52 ms | 32.4% bf16 MFU | 120835 tok/s step 11763/19560 | loss 3.356073 (-0.08z)| norm 0.2674 (+0.60z)| lr 3.01e-04 | 4170.22 ms | 32.4% bf16 MFU | 121080 tok/s step 11764/19560 | loss 3.325882 (-0.92z)| norm 0.2473 (-0.43z)| lr 3.01e-04 | 4199.09 ms | 32.2% bf16 MFU | 121269 tok/s step 11765/19560 | loss 3.377373 (+0.51z)| norm 0.2459 (-0.50z)| lr 3.01e-04 | 4214.04 ms | 32.0% bf16 MFU | 121426 tok/s step 11766/19560 | loss 3.400795 (+1.15z)| norm 0.2585 (+0.15z)| lr 3.01e-04 | 4180.82 ms | 32.3% bf16 MFU | 121625 tok/s step 11767/19560 | loss 3.369407 (+0.26z)| norm 0.2602 (+0.25z)| lr 3.00e-04 | 4187.73 ms | 32.2% bf16 MFU | 121803 tok/s step 11768/19560 | loss 3.324456 (-1.00z)| norm 0.2455 (-0.51z)| lr 3.00e-04 | 4175.68 ms | 32.3% bf16 MFU | 121991 tok/s step 11769/19560 | loss 3.377577 (+0.50z)| norm 0.2635 (+0.41z)| lr 3.00e-04 | 4186.23 ms | 32.3% bf16 MFU | 122154 tok/s step 11770/19560 | loss 3.369264 (+0.25z)| norm 0.2673 (+0.61z)| lr 3.00e-04 | 4173.91 ms | 32.3% bf16 MFU | 122326 tok/s step 11771/19560 | loss 3.324816 (-1.00z)| norm 0.2131 (-2.16z)| lr 3.00e-04 | 4175.79 ms | 32.3% bf16 MFU | 122488 tok/s step 11772/19560 | loss 3.326222 (-0.95z)| norm 0.2376 (-0.92z)| lr 3.00e-04 | 4174.62 ms | 32.3% bf16 MFU | 122643 tok/s step 11773/19560 | loss 3.337942 (-0.64z)| norm 0.2296 (-1.33z)| lr 3.00e-04 | 4185.67 ms | 32.3% bf16 MFU | 122774 tok/s step 11774/19560 | loss 3.362752 (+0.07z)| norm 0.2305 (-1.28z)| lr 3.00e-04 | 4170.65 ms | 32.4% bf16 MFU | 122920 tok/s step 11775/19560 | loss 3.371289 (+0.31z)| norm 0.2482 (-0.37z)| lr 3.00e-04 | 4188.35 ms | 32.2% bf16 MFU | 123033 tok/s step 11776/19560 | loss 3.425853 (+1.85z)| norm 0.3852 (+5.71z)| lr 3.00e-04 | 4174.21 ms | 32.3% bf16 MFU | 123162 tok/s step 11777/19560 | loss 3.322907 (-1.10z)| norm 0.2468 (-0.45z)| lr 3.00e-04 | 4194.53 ms | 32.2% bf16 MFU | 123253 tok/s step 11778/19560 | loss 3.364207 (+0.08z)| norm 0.2540 (-0.13z)| lr 3.00e-04 | 4183.78 ms | 32.3% bf16 MFU | 123356 tok/s step 11779/19560 | loss 3.376316 (+0.42z)| norm 0.2426 (-0.66z)| lr 3.00e-04 | 4176.37 ms | 32.3% bf16 MFU | 123465 tok/s step 11780/19560 | loss 3.363832 (+0.07z)| norm 0.2232 (-1.52z)| lr 3.00e-04 | 4221.22 ms | 32.0% bf16 MFU | 123502 tok/s step 11781/19560 | loss 3.428796 (+1.91z)| norm 0.2458 (-0.50z)| lr 3.00e-04 | 4182.37 ms | 32.3% bf16 MFU | 123595 tok/s step 11782/19560 | loss 3.368858 (+0.19z)| norm 0.2462 (-0.49z)| lr 2.99e-04 | 4182.73 ms | 32.3% bf16 MFU | 123682 tok/s step 11783/19560 | loss 3.365290 (+0.08z)| norm 0.2511 (-0.27z)| lr 2.99e-04 | 4174.01 ms | 32.3% bf16 MFU | 123779 tok/s step 11784/19560 | loss 3.346488 (-0.45z)| norm 0.2455 (-0.52z)| lr 2.99e-04 | 4179.55 ms | 32.3% bf16 MFU | 123862 tok/s step 11785/19560 | loss 3.352478 (-0.27z)| norm 0.2394 (-0.79z)| lr 2.99e-04 | 4192.63 ms | 32.2% bf16 MFU | 123921 tok/s step 11786/19560 | loss 3.461069 (+2.78z)| norm 0.2671 (+0.45z)| lr 2.99e-04 | 4188.48 ms | 32.2% bf16 MFU | 123984 tok/s step 11787/19560 | loss 3.435679 (+2.02z)| norm 0.2339 (-1.05z)| lr 2.99e-04 | 4189.69 ms | 32.2% bf16 MFU | 124042 tok/s step 11788/19560 | loss 3.380896 (+0.48z)| norm 0.2540 (-0.14z)| lr 2.99e-04 | 4193.29 ms | 32.2% bf16 MFU | 124091 tok/s step 11789/19560 | loss 3.324891 (-1.07z)| norm 0.2630 (+0.26z)| lr 2.99e-04 | 4237.98 ms | 31.9% bf16 MFU | 124072 tok/s step 11790/19560 | loss 3.305378 (-1.61z)| norm 0.2685 (+0.52z)| lr 2.99e-04 | 4177.53 ms | 32.3% bf16 MFU | 124144 tok/s step 11791/19560 | loss 3.361196 (-0.06z)| norm 0.2701 (+0.58z)| lr 2.99e-04 | 4185.21 ms | 32.3% bf16 MFU | 124200 tok/s step 11792/19560 | loss 3.431294 (+1.84z)| norm 0.2427 (-0.66z)| lr 2.99e-04 | 4198.11 ms | 32.2% bf16 MFU | 124234 tok/s step 11793/19560 | loss 3.343297 (-0.57z)| norm 0.2646 (+0.33z)| lr 2.99e-04 | 4200.37 ms | 32.1% bf16 MFU | 124264 tok/s step 11794/19560 | loss 3.378779 (+0.40z)| norm 0.2552 (-0.10z)| lr 2.99e-04 | 4190.94 ms | 32.2% bf16 MFU | 124305 tok/s step 11795/19560 | loss 3.341053 (-0.64z)| norm 0.2508 (-0.30z)| lr 2.99e-04 | 4170.14 ms | 32.4% bf16 MFU | 124376 tok/s step 11796/19560 | loss 3.400797 (+1.01z)| norm 0.2597 (+0.10z)| lr 2.99e-04 | 4191.83 ms | 32.2% bf16 MFU | 124411 tok/s step 11797/19560 | loss 3.367839 (+0.10z)| norm 0.2790 (+0.97z)| lr 2.98e-04 | 4175.34 ms | 32.3% bf16 MFU | 124469 tok/s step 11798/19560 | loss 3.389222 (+0.68z)| norm 0.2434 (-0.64z)| lr 2.98e-04 | 4177.80 ms | 32.3% bf16 MFU | 124520 tok/s step 11799/19560 | loss 3.421748 (+1.56z)| norm 0.2668 (+0.42z)| lr 2.98e-04 | 4175.13 ms | 32.3% bf16 MFU | 124573 tok/s step 11800/19560 | loss 3.347683 (-0.48z)| norm 0.2580 (+0.02z)| lr 2.98e-04 | 4191.80 ms | 32.2% bf16 MFU | 124598 tok/s step 11801/19560 | loss 3.369174 (+0.11z)| norm 0.2353 (-1.03z)| lr 2.98e-04 | 4183.30 ms | 32.3% bf16 MFU | 124635 tok/s step 11802/19560 | loss 3.344535 (-0.57z)| norm 0.2652 (+0.35z)| lr 2.98e-04 | 4168.90 ms | 32.4% bf16 MFU | 124691 tok/s step 11803/19560 | loss 3.364904 (-0.01z)| norm 0.2368 (-0.96z)| lr 2.98e-04 | 4197.14 ms | 32.2% bf16 MFU | 124702 tok/s step 11804/19560 | loss 3.362434 (-0.08z)| norm 0.2988 (+1.84z)| lr 2.98e-04 | 4182.24 ms | 32.3% bf16 MFU | 124735 tok/s step 11805/19560 | loss 3.398772 (+0.94z)| norm 0.2714 (+0.60z)| lr 2.98e-04 | 4169.95 ms | 32.4% bf16 MFU | 124785 tok/s step 11806/19560 | loss 3.309565 (-1.53z)| norm 0.2570 (-0.05z)| lr 2.98e-04 | 4186.56 ms | 32.3% bf16 MFU | 124807 tok/s step 11807/19560 | loss 3.369518 (+0.12z)| norm 0.2714 (+0.58z)| lr 2.98e-04 | 4182.92 ms | 32.3% bf16 MFU | 124834 tok/s step 11808/19560 | loss 3.367488 (+0.06z)| norm 0.2433 (-0.68z)| lr 2.98e-04 | 4182.70 ms | 32.3% bf16 MFU | 124859 tok/s step 11809/19560 | loss 3.300877 (-1.75z)| norm 0.2572 (-0.05z)| lr 2.98e-04 | 4180.47 ms | 32.3% bf16 MFU | 124887 tok/s step 11810/19560 | loss 3.430606 (+1.77z)| norm 0.2503 (-0.36z)| lr 2.98e-04 | 4177.31 ms | 32.3% bf16 MFU | 124918 tok/s step 11811/19560 | loss 3.384485 (+0.50z)| norm 0.2641 (+0.27z)| lr 2.98e-04 | 4179.25 ms | 32.3% bf16 MFU | 124945 tok/s step 11812/19560 | loss 3.365173 (-0.03z)| norm 0.2988 (+1.81z)| lr 2.97e-04 | 4176.97 ms | 32.3% bf16 MFU | 124973 tok/s step 11813/19560 | loss 3.325502 (-1.11z)| norm 0.2434 (-0.68z)| lr 2.97e-04 | 4185.78 ms | 32.3% bf16 MFU | 124988 tok/s step 11814/19560 | loss 3.418159 (+1.39z)| norm 0.2604 (+0.08z)| lr 2.97e-04 | 4177.05 ms | 32.3% bf16 MFU | 125014 tok/s step 11815/19560 | loss 3.312106 (-1.48z)| norm 0.2386 (-0.89z)| lr 2.97e-04 | 4180.23 ms | 32.3% bf16 MFU | 125034 tok/s step 11816/19560 | loss 3.401734 (+0.96z)| norm 0.2529 (-0.25z)| lr 2.97e-04 | 4171.92 ms | 32.4% bf16 MFU | 125066 tok/s step 11817/19560 | loss 3.360456 (-0.16z)| norm 0.2539 (-0.21z)| lr 2.97e-04 | 4175.78 ms | 32.3% bf16 MFU | 125091 tok/s step 11818/19560 | loss 3.347804 (-0.51z)| norm 0.2626 (+0.19z)| lr 2.97e-04 | 4204.87 ms | 32.1% bf16 MFU | 125070 tok/s step 11819/19560 | loss 3.377947 (+0.33z)| norm 0.2538 (-0.21z)| lr 2.97e-04 | 4173.97 ms | 32.3% bf16 MFU | 125097 tok/s step 11820/19560 | loss 3.352717 (-0.36z)| norm 0.2545 (-0.17z)| lr 2.97e-04 | 4180.14 ms | 32.3% bf16 MFU | 125114 tok/s step 11821/19560 | loss 3.325033 (-1.11z)| norm 0.2453 (-0.59z)| lr 2.97e-04 | 4181.59 ms | 32.3% bf16 MFU | 125127 tok/s step 11822/19560 | loss 3.299459 (-1.78z)| norm 0.2416 (-0.74z)| lr 2.97e-04 | 4179.69 ms | 32.3% bf16 MFU | 125142 tok/s step 11823/19560 | loss 3.377538 (+0.33z)| norm 0.2300 (-1.25z)| lr 2.97e-04 | 4166.57 ms | 32.4% bf16 MFU | 125177 tok/s step 11824/19560 | loss 3.347591 (-0.48z)| norm 0.2561 (-0.09z)| lr 2.97e-04 | 4181.26 ms | 32.3% bf16 MFU | 125188 tok/s step 11825/19560 | loss 3.366440 (+0.05z)| norm 0.2439 (-0.63z)| lr 2.97e-04 | 4164.57 ms | 32.4% bf16 MFU | 125223 tok/s step 11826/19560 | loss 3.416007 (+1.41z)| norm 0.2325 (-1.12z)| lr 2.97e-04 | 4192.15 ms | 32.2% bf16 MFU | 125215 tok/s step 11827/19560 | loss 3.424130 (+1.60z)| norm 0.2707 (+0.57z)| lr 2.97e-04 | 4168.26 ms | 32.4% bf16 MFU | 125243 tok/s step 11828/19560 | loss 3.318317 (-1.26z)| norm 0.2330 (-1.10z)| lr 2.96e-04 | 4209.71 ms | 32.1% bf16 MFU | 125208 tok/s step 11829/19560 | loss 3.470280 (+2.82z)| norm 0.2507 (-0.30z)| lr 2.96e-04 | 4189.78 ms | 32.2% bf16 MFU | 125204 tok/s step 11830/19560 | loss 3.407407 (+1.13z)| norm 0.2590 (+0.07z)| lr 2.96e-04 | 4169.04 ms | 32.4% bf16 MFU | 125232 tok/s step 11831/19560 | loss 3.302462 (-1.66z)| norm 0.2448 (-0.56z)| lr 2.96e-04 | 4185.59 ms | 32.3% bf16 MFU | 125234 tok/s step 11832/19560 | loss 3.306046 (-1.54z)| norm 0.2721 (+0.65z)| lr 2.96e-04 | 4173.42 ms | 32.4% bf16 MFU | 125253 tok/s step 11833/19560 | loss 3.328669 (-0.96z)| norm 0.2517 (-0.26z)| lr 2.96e-04 | 4183.52 ms | 32.3% bf16 MFU | 125257 tok/s step 11834/19560 | loss 3.320799 (-1.15z)| norm 0.2609 (+0.14z)| lr 2.96e-04 | 4166.72 ms | 32.4% bf16 MFU | 125285 tok/s step 11835/19560 | loss 3.420255 (+1.46z)| norm 0.2675 (+0.44z)| lr 2.96e-04 | 4188.35 ms | 32.2% bf16 MFU | 125280 tok/s step 11836/19560 | loss 3.284212 (-2.08z)| norm 0.2531 (-0.22z)| lr 2.96e-04 | 4162.40 ms | 32.4% bf16 MFU | 125314 tok/s step 11837/19560 | loss 3.338883 (-0.66z)| norm 0.2496 (-0.38z)| lr 2.96e-04 | 4184.28 ms | 32.3% bf16 MFU | 125313 tok/s step 11838/19560 | loss 3.342245 (-0.57z)| norm 0.2632 (+0.23z)| lr 2.96e-04 | 4180.82 ms | 32.3% bf16 MFU | 125317 tok/s step 11839/19560 | loss 3.348615 (-0.39z)| norm 0.2764 (+0.83z)| lr 2.96e-04 | 4167.76 ms | 32.4% bf16 MFU | 125341 tok/s step 11840/19560 | loss 3.346960 (-0.44z)| norm 0.2490 (-0.41z)| lr 2.96e-04 | 4180.44 ms | 32.3% bf16 MFU | 125345 tok/s step 11841/19560 | loss 3.459237 (+2.42z)| norm 0.2762 (+0.82z)| lr 2.96e-04 | 4187.71 ms | 32.2% bf16 MFU | 125338 tok/s step 11842/19560 | loss 3.323805 (-1.04z)| norm 0.2416 (-0.74z)| lr 2.96e-04 | 4169.93 ms | 32.4% bf16 MFU | 125357 tok/s step 11843/19560 | loss 3.393133 (+0.72z)| norm 0.2512 (-0.30z)| lr 2.95e-04 | 4177.90 ms | 32.3% bf16 MFU | 125364 tok/s step 11844/19560 | loss 3.392287 (+0.68z)| norm 0.2321 (-1.15z)| lr 2.95e-04 | 4200.23 ms | 32.1% bf16 MFU | 125337 tok/s step 11845/19560 | loss 3.348562 (-0.45z)| norm 0.2452 (-0.56z)| lr 2.95e-04 | 4183.31 ms | 32.3% bf16 MFU | 125337 tok/s step 11846/19560 | loss 3.370011 (+0.11z)| norm 0.2571 (-0.03z)| lr 2.95e-04 | 4168.29 ms | 32.4% bf16 MFU | 125359 tok/s step 11847/19560 | loss 3.400849 (+0.90z)| norm 0.2590 (+0.06z)| lr 2.95e-04 | 4180.47 ms | 32.3% bf16 MFU | 125361 tok/s step 11848/19560 | loss 3.348008 (-0.45z)| norm 0.3014 (+1.93z)| lr 2.95e-04 | 4182.81 ms | 32.3% bf16 MFU | 125361 tok/s step 11849/19560 | loss 3.307128 (-1.50z)| norm 0.2336 (-1.06z)| lr 2.95e-04 | 4185.27 ms | 32.3% bf16 MFU | 125356 tok/s step 11850/19560 | loss 3.332100 (-0.84z)| norm 0.2701 (+0.54z)| lr 2.95e-04 | 4277.65 ms | 31.6% bf16 MFU | 125216 tok/s step 11851/19560 | loss 3.370061 (+0.15z)| norm 0.2450 (-0.56z)| lr 2.95e-04 | 4183.48 ms | 32.3% bf16 MFU | 125222 tok/s step 11852/19560 | loss 3.373148 (+0.23z)| norm 0.2555 (-0.10z)| lr 2.95e-04 | 4179.86 ms | 32.3% bf16 MFU | 125232 tok/s step 11853/19560 | loss 3.497581 (+3.31z)| norm 0.2721 (+0.63z)| lr 2.95e-04 | 4173.11 ms | 32.4% bf16 MFU | 125252 tok/s step 11854/19560 | loss 3.323564 (-1.05z)| norm 0.2613 (+0.23z)| lr 2.95e-04 | 4168.24 ms | 32.4% bf16 MFU | 125279 tok/s step 11855/19560 | loss 3.420276 (+1.35z)| norm 0.2471 (-0.50z)| lr 2.95e-04 | 4170.80 ms | 32.4% bf16 MFU | 125300 tok/s step 11856/19560 | loss 3.350536 (-0.38z)| norm 0.2588 (+0.13z)| lr 2.95e-04 | 4196.04 ms | 32.2% bf16 MFU | 125283 tok/s step 11857/19560 | loss 3.333310 (-0.81z)| norm 0.2528 (-0.18z)| lr 2.95e-04 | 4176.53 ms | 32.3% bf16 MFU | 125295 tok/s step 11858/19560 | loss 3.354810 (-0.27z)| norm 0.2480 (-0.42z)| lr 2.94e-04 | 4186.58 ms | 32.3% bf16 MFU | 125292 tok/s step 11859/19560 | loss 3.343406 (-0.55z)| norm 0.2329 (-1.21z)| lr 2.94e-04 | 4197.91 ms | 32.2% bf16 MFU | 125272 tok/s step 11860/19560 | loss 3.354135 (-0.29z)| norm 0.2366 (-1.01z)| lr 2.94e-04 | 4175.14 ms | 32.3% bf16 MFU | 125287 tok/s step 11861/19560 | loss 3.345000 (-0.52z)| norm 0.2350 (-1.08z)| lr 2.94e-04 | 4172.03 ms | 32.4% bf16 MFU | 125306 tok/s step 11862/19560 | loss 3.444254 (+1.91z)| norm 0.2517 (-0.19z)| lr 2.94e-04 | 4177.38 ms | 32.3% bf16 MFU | 125316 tok/s step 11863/19560 | loss 3.402485 (+0.87z)| norm 0.2462 (-0.48z)| lr 2.94e-04 | 4176.03 ms | 32.3% bf16 MFU | 125328 tok/s step 11864/19560 | loss 3.320456 (-1.13z)| norm 0.2506 (-0.24z)| lr 2.94e-04 | 4172.57 ms | 32.4% bf16 MFU | 125344 tok/s step 11865/19560 | loss 3.323808 (-1.05z)| norm 0.2575 (+0.13z)| lr 2.94e-04 | 4177.84 ms | 32.3% bf16 MFU | 125351 tok/s step 11866/19560 | loss 3.359795 (-0.16z)| norm 0.2398 (-0.80z)| lr 2.94e-04 | 4175.38 ms | 32.3% bf16 MFU | 125362 tok/s step 11867/19560 | loss 3.393429 (+0.67z)| norm 0.2712 (+0.86z)| lr 2.94e-04 | 4184.41 ms | 32.3% bf16 MFU | 125359 tok/s step 11868/19560 | loss 3.383458 (+0.42z)| norm 0.2303 (-1.29z)| lr 2.94e-04 | 4168.87 ms | 32.4% bf16 MFU | 125379 tok/s step 11869/19560 | loss 3.305994 (-1.48z)| norm 0.2586 (+0.20z)| lr 2.94e-04 | 4170.42 ms | 32.4% bf16 MFU | 125396 tok/s step 11870/19560 | loss 3.350837 (-0.39z)| norm 0.2406 (-0.75z)| lr 2.94e-04 | 4172.32 ms | 32.4% bf16 MFU | 125409 tok/s step 11871/19560 | loss 3.395624 (+0.70z)| norm 0.2649 (+0.53z)| lr 2.94e-04 | 4181.86 ms | 32.3% bf16 MFU | 125407 tok/s step 11872/19560 | loss 3.348634 (-0.45z)| norm 0.2361 (-0.98z)| lr 2.94e-04 | 4181.43 ms | 32.3% bf16 MFU | 125406 tok/s step 11873/19560 | loss 3.329038 (-0.92z)| norm 0.2600 (+0.28z)| lr 2.94e-04 | 4185.53 ms | 32.3% bf16 MFU | 125399 tok/s step 11874/19560 | loss 3.331658 (-0.84z)| norm 0.2445 (-0.54z)| lr 2.93e-04 | 4173.29 ms | 32.4% bf16 MFU | 125410 tok/s step 11875/19560 | loss 3.320692 (-1.10z)| norm 0.2501 (-0.24z)| lr 2.93e-04 | 4180.52 ms | 32.3% bf16 MFU | 125410 tok/s step 11876/19560 | loss 3.378802 (+0.34z)| norm 0.2310 (-1.24z)| lr 2.93e-04 | 4173.04 ms | 32.4% bf16 MFU | 125422 tok/s step 11877/19560 | loss 3.292987 (-1.75z)| norm 0.2450 (-0.50z)| lr 2.93e-04 | 4188.02 ms | 32.2% bf16 MFU | 125410 tok/s step 11878/19560 | loss 3.373446 (+0.22z)| norm 0.2410 (-0.71z)| lr 2.93e-04 | 4172.55 ms | 32.4% bf16 MFU | 125422 tok/s step 11879/19560 | loss 3.377889 (+0.33z)| norm 0.2753 (+1.09z)| lr 2.93e-04 | 4170.69 ms | 32.4% bf16 MFU | 125436 tok/s step 11880/19560 | loss 3.391479 (+0.66z)| norm 0.3521 (+4.64z)| lr 2.93e-04 | 4180.02 ms | 32.3% bf16 MFU | 125436 tok/s step 11881/19560 | loss 3.344062 (-0.52z)| norm 0.3011 (+2.14z)| lr 2.93e-04 | 4176.46 ms | 32.3% bf16 MFU | 125441 tok/s step 11882/19560 | loss 3.368296 (+0.09z)| norm 0.2503 (-0.24z)| lr 2.93e-04 | 4176.05 ms | 32.3% bf16 MFU | 125446 tok/s step 11883/19560 | loss 3.338429 (-0.65z)| norm 0.2640 (+0.40z)| lr 2.93e-04 | 4211.59 ms | 32.1% bf16 MFU | 125398 tok/s step 11884/19560 | loss 3.317906 (-1.15z)| norm 0.2639 (+0.41z)| lr 2.93e-04 | 4174.36 ms | 32.3% bf16 MFU | 125408 tok/s step 11885/19560 | loss 3.407472 (+1.07z)| norm 0.2692 (+0.66z)| lr 2.93e-04 | 4180.39 ms | 32.3% bf16 MFU | 125408 tok/s step 11886/19560 | loss 3.394919 (+0.75z)| norm 0.2638 (+0.40z)| lr 2.93e-04 | 4172.93 ms | 32.4% bf16 MFU | 125420 tok/s step 11887/19560 | loss 3.389676 (+0.61z)| norm 0.2668 (+0.54z)| lr 2.93e-04 | 4223.13 ms | 32.0% bf16 MFU | 125356 tok/s step 11888/19560 | loss 3.364524 (-0.01z)| norm 0.2514 (-0.19z)| lr 2.93e-04 | 4181.41 ms | 32.3% bf16 MFU | 125358 tok/s step 11889/19560 | loss 3.452284 (+2.17z)| norm 0.2646 (+0.43z)| lr 2.92e-04 | 4175.89 ms | 32.3% bf16 MFU | 125367 tok/s step 11890/19560 | loss 3.363787 (-0.02z)| norm 0.2621 (+0.31z)| lr 2.92e-04 | 4184.35 ms | 32.3% bf16 MFU | 125364 tok/s step 11891/19560 | loss 3.380245 (+0.38z)| norm 0.2870 (+1.49z)| lr 2.92e-04 | 4193.22 ms | 32.2% bf16 MFU | 125347 tok/s step 11892/19560 | loss 3.370712 (+0.13z)| norm 0.2566 (+0.04z)| lr 2.92e-04 | 4174.75 ms | 32.3% bf16 MFU | 125359 tok/s step 11893/19560 | loss 3.363056 (-0.05z)| norm 0.2840 (+1.32z)| lr 2.92e-04 | 4188.64 ms | 32.2% bf16 MFU | 125350 tok/s step 11894/19560 | loss 3.324828 (-0.99z)| norm 0.2519 (-0.19z)| lr 2.92e-04 | 4177.22 ms | 32.3% bf16 MFU | 125358 tok/s step 11895/19560 | loss 3.405539 (+1.01z)| norm 0.2739 (+0.84z)| lr 2.92e-04 | 4188.58 ms | 32.2% bf16 MFU | 125348 tok/s step 11896/19560 | loss 3.353888 (-0.28z)| norm 0.2797 (+1.10z)| lr 2.92e-04 | 4170.45 ms | 32.4% bf16 MFU | 125367 tok/s step 11897/19560 | loss 3.244582 (-2.88z)| norm 0.2497 (-0.31z)| lr 2.92e-04 | 4177.74 ms | 32.3% bf16 MFU | 125373 tok/s step 11898/19560 | loss 3.331226 (-0.78z)| norm 0.2941 (+1.75z)| lr 2.92e-04 | 4184.32 ms | 32.3% bf16 MFU | 125369 tok/s step 11899/19560 | loss 3.334808 (-0.70z)| norm 0.2386 (-0.85z)| lr 2.92e-04 | 4177.28 ms | 32.3% bf16 MFU | 125376 tok/s step 11900/19560 | loss 3.292994 (-1.69z)| norm 0.2657 (+0.41z)| lr 2.92e-04 | 4166.95 ms | 32.4% bf16 MFU | 125399 tok/s step 11901/19560 | loss 3.351187 (-0.30z)| norm 0.2419 (-0.71z)| lr 2.92e-04 | 4170.09 ms | 32.4% bf16 MFU | 125415 tok/s step 11902/19560 | loss 3.372514 (+0.21z)| norm 0.2784 (+1.00z)| lr 2.92e-04 | 4175.53 ms | 32.3% bf16 MFU | 125422 tok/s step 11903/19560 | loss 3.363659 (-0.00z)| norm 0.2533 (-0.20z)| lr 2.92e-04 | 4181.71 ms | 32.3% bf16 MFU | 125420 tok/s step 11904/19560 | loss 3.373638 (+0.25z)| norm 0.2529 (-0.20z)| lr 2.91e-04 | 4177.11 ms | 32.3% bf16 MFU | 125425 tok/s step 11905/19560 | loss 3.316355 (-1.13z)| norm 0.2637 (+0.40z)| lr 2.91e-04 | 4169.90 ms | 32.4% bf16 MFU | 125440 tok/s step 11906/19560 | loss 3.314145 (-1.17z)| norm 0.2523 (-0.24z)| lr 2.91e-04 | 4175.03 ms | 32.3% bf16 MFU | 125447 tok/s step 11907/19560 | loss 3.339989 (-0.54z)| norm 0.2467 (-0.55z)| lr 2.91e-04 | 4183.91 ms | 32.3% bf16 MFU | 125440 tok/s step 11908/19560 | loss 3.358942 (-0.09z)| norm 0.2757 (+1.07z)| lr 2.91e-04 | 4179.26 ms | 32.3% bf16 MFU | 125441 tok/s step 11909/19560 | loss 3.328104 (-0.81z)| norm 0.2428 (-0.80z)| lr 2.91e-04 | 4177.46 ms | 32.3% bf16 MFU | 125444 tok/s step 11910/19560 | loss 3.320315 (-0.99z)| norm 0.2685 (+0.65z)| lr 2.91e-04 | 4171.88 ms | 32.4% bf16 MFU | 125455 tok/s step 11911/19560 | loss 3.322502 (-0.92z)| norm 0.2618 (+0.26z)| lr 2.91e-04 | 4169.81 ms | 32.4% bf16 MFU | 125469 tok/s step 11912/19560 | loss 3.400853 (+0.94z)| norm 0.2680 (+0.61z)| lr 2.91e-04 | 4199.46 ms | 32.2% bf16 MFU | 125438 tok/s step 11913/19560 | loss 3.296703 (-1.52z)| norm 0.2654 (+0.45z)| lr 2.91e-04 | 4174.79 ms | 32.3% bf16 MFU | 125445 tok/s step 11914/19560 | loss 3.371049 (+0.26z)| norm 0.2543 (-0.18z)| lr 2.91e-04 | 4197.65 ms | 32.2% bf16 MFU | 125418 tok/s step 11915/19560 | loss 3.305323 (-1.31z)| norm 0.2613 (+0.21z)| lr 2.91e-04 | 4171.68 ms | 32.4% bf16 MFU | 125431 tok/s step 11916/19560 | loss 3.324251 (-0.84z)| norm 0.2339 (-1.35z)| lr 2.91e-04 | 4166.59 ms | 32.4% bf16 MFU | 125451 tok/s step 11917/19560 | loss 3.328763 (-0.73z)| norm 0.2615 (+0.23z)| lr 2.91e-04 | 4178.67 ms | 32.3% bf16 MFU | 125452 tok/s step 11918/19560 | loss 3.420370 (+1.47z)| norm 0.2363 (-1.19z)| lr 2.91e-04 | 4212.92 ms | 32.0% bf16 MFU | 125402 tok/s step 11919/19560 | loss 3.368429 (+0.21z)| norm 0.2565 (-0.03z)| lr 2.91e-04 | 4191.81 ms | 32.2% bf16 MFU | 125385 tok/s step 11920/19560 | loss 3.381686 (+0.54z)| norm 0.2481 (-0.52z)| lr 2.90e-04 | 4168.78 ms | 32.4% bf16 MFU | 125404 tok/s step 11921/19560 | loss 3.335600 (-0.58z)| norm 0.2494 (-0.44z)| lr 2.90e-04 | 4176.10 ms | 32.3% bf16 MFU | 125411 tok/s step 11922/19560 | loss 3.393397 (+0.83z)| norm 0.2483 (-0.50z)| lr 2.90e-04 | 4181.85 ms | 32.3% bf16 MFU | 125409 tok/s step 11923/19560 | loss 3.398261 (+0.93z)| norm 0.2429 (-0.80z)| lr 2.90e-04 | 4180.74 ms | 32.3% bf16 MFU | 125409 tok/s step 11924/19560 | loss 3.305528 (-1.31z)| norm 0.2478 (-0.52z)| lr 2.90e-04 | 4173.37 ms | 32.4% bf16 MFU | 125420 tok/s step 11925/19560 | loss 3.315799 (-1.04z)| norm 0.2383 (-1.04z)| lr 2.90e-04 | 4178.60 ms | 32.3% bf16 MFU | 125423 tok/s step 11926/19560 | loss 3.313002 (-1.09z)| norm 0.2551 (-0.09z)| lr 2.90e-04 | 4172.55 ms | 32.4% bf16 MFU | 125434 tok/s step 11927/19560 | loss 3.330842 (-0.65z)| norm 0.2433 (-0.75z)| lr 2.90e-04 | 4230.30 ms | 31.9% bf16 MFU | 125359 tok/s step 11928/19560 | loss 3.353642 (-0.10z)| norm 0.2494 (-0.40z)| lr 2.90e-04 | 4175.22 ms | 32.3% bf16 MFU | 125370 tok/s step 11929/19560 | loss 3.339379 (-0.44z)| norm 0.2465 (-0.57z)| lr 2.90e-04 | 4905.67 ms | 27.5% bf16 MFU | 124445 tok/s step 11930/19560 | loss 3.313066 (-1.07z)| norm 0.2556 (-0.04z)| lr 2.90e-04 | 4723.18 ms | 28.6% bf16 MFU | 123773 tok/s step 11931/19560 | loss 3.325840 (-0.75z)| norm 0.2544 (-0.12z)| lr 2.90e-04 | 4371.72 ms | 30.9% bf16 MFU | 123581 tok/s step 11932/19560 | loss 3.353947 (-0.07z)| norm 0.2530 (-0.19z)| lr 2.90e-04 | 4537.98 ms | 29.8% bf16 MFU | 123178 tok/s step 11933/19560 | loss 3.362658 (+0.15z)| norm 0.2587 (+0.16z)| lr 2.90e-04 | 4391.87 ms | 30.7% bf16 MFU | 122988 tok/s step 11934/19560 | loss 3.346415 (-0.25z)| norm 0.2521 (-0.23z)| lr 2.90e-04 | 4357.44 ms | 31.0% bf16 MFU | 122855 tok/s step 11935/19560 | loss 3.368903 (+0.30z)| norm 0.2731 (+1.01z)| lr 2.89e-04 | 4410.96 ms | 30.6% bf16 MFU | 122655 tok/s step 11936/19560 | loss 3.349926 (-0.16z)| norm 0.2808 (+1.44z)| lr 2.89e-04 | 4412.88 ms | 30.6% bf16 MFU | 122463 tok/s step 11937/19560 | loss 3.339444 (-0.43z)| norm 0.2504 (-0.35z)| lr 2.89e-04 | 4282.25 ms | 31.5% bf16 MFU | 122461 tok/s step 11938/19560 | loss 3.353628 (-0.07z)| norm 0.2782 (+1.27z)| lr 2.89e-04 | 4256.86 ms | 31.7% bf16 MFU | 122496 tok/s step 11939/19560 | loss 3.362981 (+0.17z)| norm 0.2434 (-0.76z)| lr 2.89e-04 | 4372.80 ms | 30.9% bf16 MFU | 122366 tok/s step 11940/19560 | loss 3.405210 (+1.21z)| norm 0.2713 (+0.91z)| lr 2.89e-04 | 4222.92 ms | 32.0% bf16 MFU | 122456 tok/s step 11941/19560 | loss 3.349376 (-0.18z)| norm 0.2574 (+0.07z)| lr 2.89e-04 | 4252.79 ms | 31.7% bf16 MFU | 122497 tok/s step 11942/19560 | loss 3.393623 (+0.93z)| norm 0.2771 (+1.24z)| lr 2.89e-04 | 4254.82 ms | 31.7% bf16 MFU | 122533 tok/s step 11943/19560 | loss 3.335073 (-0.54z)| norm 0.2674 (+0.65z)| lr 2.89e-04 | 4175.57 ms | 32.3% bf16 MFU | 122685 tok/s step 11944/19560 | loss 3.391327 (+0.88z)| norm 0.2609 (+0.25z)| lr 2.89e-04 | 4175.32 ms | 32.3% bf16 MFU | 122829 tok/s step 11945/19560 | loss 3.339683 (-0.42z)| norm 0.2465 (-0.60z)| lr 2.89e-04 | 4298.80 ms | 31.4% bf16 MFU | 122785 tok/s step 11946/19560 | loss 3.314608 (-1.04z)| norm 0.2750 (+1.09z)| lr 2.89e-04 | 4288.92 ms | 31.5% bf16 MFU | 122758 tok/s step 11947/19560 | loss 3.305182 (-1.25z)| norm 0.2518 (-0.29z)| lr 2.89e-04 | 4265.87 ms | 31.7% bf16 MFU | 122766 tok/s step 11948/19560 | loss 3.393350 (+0.93z)| norm 0.3027 (+2.64z)| lr 2.89e-04 | 4215.84 ms | 32.0% bf16 MFU | 122845 tok/s step 11949/19560 | loss 3.385725 (+0.73z)| norm 0.2336 (-1.34z)| lr 2.89e-04 | 4221.34 ms | 32.0% bf16 MFU | 122913 tok/s step 11950/19560 | loss 3.406712 (+1.23z)| norm 0.2707 (+0.77z)| lr 2.88e-04 | 4216.18 ms | 32.0% bf16 MFU | 122985 tok/s step 11951/19560 | loss 3.365870 (+0.22z)| norm 0.2492 (-0.47z)| lr 2.88e-04 | 4176.47 ms | 32.3% bf16 MFU | 123112 tok/s step 11952/19560 | loss 3.333399 (-0.59z)| norm 0.2557 (-0.10z)| lr 2.88e-04 | 4216.60 ms | 32.0% bf16 MFU | 123174 tok/s step 11953/19560 | loss 3.351835 (-0.12z)| norm 0.2712 (+0.79z)| lr 2.88e-04 | 4228.48 ms | 31.9% bf16 MFU | 123214 tok/s step 11954/19560 | loss 3.357187 (+0.02z)| norm 0.2737 (+0.93z)| lr 2.88e-04 | 4232.14 ms | 31.9% bf16 MFU | 123248 tok/s step 11955/19560 | loss 3.298010 (-1.44z)| norm 0.2591 (+0.08z)| lr 2.88e-04 | 4187.77 ms | 32.2% bf16 MFU | 123345 tok/s step 11956/19560 | loss 3.386876 (+0.78z)| norm 0.2554 (-0.15z)| lr 2.88e-04 | 4170.67 ms | 32.4% bf16 MFU | 123463 tok/s step 11957/19560 | loss 3.347590 (-0.19z)| norm 0.2514 (-0.38z)| lr 2.88e-04 | 4200.71 ms | 32.1% bf16 MFU | 123531 tok/s step 11958/19560 | loss 3.367370 (+0.33z)| norm 0.2483 (-0.57z)| lr 2.88e-04 | 4176.50 ms | 32.3% bf16 MFU | 123631 tok/s step 11959/19560 | loss 3.322877 (-0.84z)| norm 0.2464 (-0.67z)| lr 2.88e-04 | 4179.92 ms | 32.3% bf16 MFU | 123721 tok/s step 11960/19560 | loss 3.388275 (+0.87z)| norm 0.2588 (+0.06z)| lr 2.88e-04 | 4305.75 ms | 31.4% bf16 MFU | 123623 tok/s step 11961/19560 | loss 3.354416 (-0.03z)| norm 0.2690 (+0.65z)| lr 2.88e-04 | 4182.10 ms | 32.3% bf16 MFU | 123710 tok/s step 11962/19560 | loss 3.359454 (+0.09z)| norm 0.2487 (-0.54z)| lr 2.88e-04 | 4173.96 ms | 32.3% bf16 MFU | 123805 tok/s step 11963/19560 | loss 3.335961 (-0.52z)| norm 0.2479 (-0.58z)| lr 2.88e-04 | 4199.17 ms | 32.2% bf16 MFU | 123858 tok/s step 11964/19560 | loss 3.327599 (-0.76z)| norm 0.2666 (+0.52z)| lr 2.88e-04 | 4179.97 ms | 32.3% bf16 MFU | 123936 tok/s step 11965/19560 | loss 3.376853 (+0.57z)| norm 0.2674 (+0.56z)| lr 2.88e-04 | 4216.72 ms | 32.0% bf16 MFU | 123956 tok/s step 11966/19560 | loss 3.349658 (-0.17z)| norm 0.2846 (+1.54z)| lr 2.87e-04 | 4170.12 ms | 32.4% bf16 MFU | 124044 tok/s step 11967/19560 | loss 3.302791 (-1.43z)| norm 0.2706 (+0.73z)| lr 2.87e-04 | 4172.36 ms | 32.4% bf16 MFU | 124125 tok/s step 11968/19560 | loss 3.383435 (+0.74z)| norm 0.2842 (+1.50z)| lr 2.87e-04 | 4185.14 ms | 32.3% bf16 MFU | 124183 tok/s step 11969/19560 | loss 3.380096 (+0.69z)| norm 0.2663 (+0.47z)| lr 2.87e-04 | 4186.00 ms | 32.3% bf16 MFU | 124236 tok/s step 11970/19560 | loss 3.338369 (-0.47z)| norm 0.2861 (+1.59z)| lr 2.87e-04 | 4173.52 ms | 32.4% bf16 MFU | 124305 tok/s step 11971/19560 | loss 3.296269 (-1.61z)| norm 0.2512 (-0.43z)| lr 2.87e-04 | 4178.92 ms | 32.3% bf16 MFU | 124363 tok/s step 11972/19560 | loss 3.362996 (+0.24z)| norm 0.2955 (+2.09z)| lr 2.87e-04 | 4174.12 ms | 32.3% bf16 MFU | 124425 tok/s step 11973/19560 | loss 3.342123 (-0.34z)| norm 0.2686 (+0.54z)| lr 2.87e-04 | 4180.83 ms | 32.3% bf16 MFU | 124474 tok/s step 11974/19560 | loss 3.578295 (+5.41z)| norm 0.2692 (+0.57z)| lr 2.87e-04 | 4180.86 ms | 32.3% bf16 MFU | 124520 tok/s step 11975/19560 | loss 3.344588 (-0.27z)| norm 0.2654 (+0.34z)| lr 2.87e-04 | 4173.35 ms | 32.4% bf16 MFU | 124576 tok/s step 11976/19560 | loss 3.362488 (+0.17z)| norm 0.2373 (-1.26z)| lr 2.87e-04 | 4166.31 ms | 32.4% bf16 MFU | 124639 tok/s step 11977/19560 | loss 3.379698 (+0.58z)| norm 0.2470 (-0.71z)| lr 2.87e-04 | 4169.05 ms | 32.4% bf16 MFU | 124695 tok/s step 11978/19560 | loss 3.330824 (-0.62z)| norm 0.2553 (-0.21z)| lr 2.87e-04 | 4222.08 ms | 32.0% bf16 MFU | 124669 tok/s step 11979/19560 | loss 3.360179 (+0.10z)| norm 0.2325 (-1.53z)| lr 2.87e-04 | 4171.24 ms | 32.4% bf16 MFU | 124720 tok/s step 11980/19560 | loss 3.387656 (+0.77z)| norm 0.2651 (+0.36z)| lr 2.87e-04 | 4185.84 ms | 32.3% bf16 MFU | 124747 tok/s step 11981/19560 | loss 3.292690 (-1.58z)| norm 0.3205 (+3.41z)| lr 2.86e-04 | 4176.76 ms | 32.3% bf16 MFU | 124786 tok/s step 11982/19560 | loss 3.385826 (+0.78z)| norm 0.2483 (-0.60z)| lr 2.86e-04 | 4178.23 ms | 32.3% bf16 MFU | 124820 tok/s step 11983/19560 | loss 3.357841 (+0.08z)| norm 0.2781 (+1.04z)| lr 2.86e-04 | 4218.67 ms | 32.0% bf16 MFU | 124793 tok/s step 11984/19560 | loss 3.333843 (-0.53z)| norm 0.2369 (-1.23z)| lr 2.86e-04 | 4169.04 ms | 32.4% bf16 MFU | 124841 tok/s step 11985/19560 | loss 3.444630 (+2.26z)| norm 0.2787 (+1.06z)| lr 2.86e-04 | 4181.34 ms | 32.3% bf16 MFU | 124869 tok/s step 11986/19560 | loss 3.352190 (-0.08z)| norm 0.2472 (-0.67z)| lr 2.86e-04 | 4192.03 ms | 32.2% bf16 MFU | 124879 tok/s step 11987/19560 | loss 3.348706 (-0.17z)| norm 0.2749 (+0.84z)| lr 2.86e-04 | 4177.96 ms | 32.3% bf16 MFU | 124909 tok/s step 11988/19560 | loss 3.376560 (+0.53z)| norm 0.2592 (-0.04z)| lr 2.86e-04 | 4168.78 ms | 32.4% bf16 MFU | 124952 tok/s step 11989/19560 | loss 3.409727 (+1.35z)| norm 0.2657 (+0.31z)| lr 2.86e-04 | 4175.29 ms | 32.3% bf16 MFU | 124983 tok/s step 11990/19560 | loss 3.317888 (-0.95z)| norm 0.2588 (-0.08z)| lr 2.86e-04 | 4436.11 ms | 30.4% bf16 MFU | 124643 tok/s step 11991/19560 | loss 3.397457 (+1.08z)| norm 0.2877 (+1.51z)| lr 2.86e-04 | 4182.82 ms | 32.3% bf16 MFU | 124678 tok/s step 11992/19560 | loss 3.366021 (+0.27z)| norm 0.2413 (-1.07z)| lr 2.86e-04 | 4177.38 ms | 32.3% bf16 MFU | 124720 tok/s step 11993/19560 | loss 3.331330 (-0.62z)| norm 0.2695 (+0.49z)| lr 2.86e-04 | 4174.27 ms | 32.3% bf16 MFU | 124764 tok/s step 11994/19560 | loss 3.298459 (-1.44z)| norm 0.2432 (-0.96z)| lr 2.86e-04 | 4188.45 ms | 32.2% bf16 MFU | 124784 tok/s step 11995/19560 | loss 3.305673 (-1.23z)| norm 0.2620 (+0.08z)| lr 2.86e-04 | 4174.06 ms | 32.3% bf16 MFU | 124825 tok/s step 11996/19560 | loss 3.311308 (-1.07z)| norm 0.2411 (-1.09z)| lr 2.86e-04 | 4178.68 ms | 32.3% bf16 MFU | 124857 tok/s step 11997/19560 | loss 3.383815 (+0.75z)| norm 0.2457 (-0.83z)| lr 2.85e-04 | 4194.69 ms | 32.2% bf16 MFU | 124864 tok/s step 11998/19560 | loss 3.343296 (-0.28z)| norm 0.2303 (-1.67z)| lr 2.85e-04 | 4185.17 ms | 32.3% bf16 MFU | 124884 tok/s step 11999/19560 | loss 3.341052 (-0.33z)| norm 0.2505 (-0.54z)| lr 2.85e-04 | 4214.09 ms | 32.0% bf16 MFU | 124861 tok/s step 12000/19560 | loss 3.299704 (-1.36z)| norm 0.2435 (-0.94z)| lr 2.85e-04 | 4174.82 ms | 32.3% bf16 MFU | 124897 tok/s val loss 3.337701 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2973/10042 = 0.296057 step 12001/19560 | loss 3.352640 (-0.03z)| norm 0.2622 (+0.10z)| lr 2.85e-04 | 4189.79 ms | 32.2% bf16 MFU | 124909 tok/s step 12002/19560 | loss 3.392937 (+0.98z)| norm 0.2257 (-1.90z)| lr 2.85e-04 | 4183.01 ms | 32.3% bf16 MFU | 124930 tok/s step 12003/19560 | loss 3.336697 (-0.45z)| norm 0.2590 (-0.07z)| lr 2.85e-04 | 4210.36 ms | 32.1% bf16 MFU | 124910 tok/s step 12004/19560 | loss 3.368017 (+0.35z)| norm 0.2291 (-1.72z)| lr 2.85e-04 | 4178.07 ms | 32.3% bf16 MFU | 124939 tok/s step 12005/19560 | loss 3.415915 (+1.54z)| norm 0.2471 (-0.73z)| lr 2.85e-04 | 4174.62 ms | 32.3% bf16 MFU | 124971 tok/s step 12006/19560 | loss 3.314875 (-1.01z)| norm 0.2385 (-1.20z)| lr 2.85e-04 | 4206.18 ms | 32.1% bf16 MFU | 124955 tok/s step 12007/19560 | loss 3.391166 (+0.92z)| norm 0.2524 (-0.43z)| lr 2.85e-04 | 4170.32 ms | 32.4% bf16 MFU | 124993 tok/s step 12008/19560 | loss 3.332147 (-0.56z)| norm 0.2394 (-1.21z)| lr 2.85e-04 | 4206.76 ms | 32.1% bf16 MFU | 124975 tok/s step 12009/19560 | loss 3.434313 (+1.97z)| norm 0.2857 (+1.66z)| lr 2.85e-04 | 4189.90 ms | 32.2% bf16 MFU | 124983 tok/s step 12010/19560 | loss 3.305790 (-1.21z)| norm 0.2687 (+0.59z)| lr 2.85e-04 | 4176.31 ms | 32.3% bf16 MFU | 125011 tok/s step 12011/19560 | loss 3.354676 (-0.00z)| norm 0.2709 (+0.72z)| lr 2.85e-04 | 4170.45 ms | 32.4% bf16 MFU | 125046 tok/s step 12012/19560 | loss 3.361182 (+0.15z)| norm 0.2509 (-0.52z)| lr 2.84e-04 | 4170.27 ms | 32.4% bf16 MFU | 125080 tok/s step 12013/19560 | loss 3.321430 (-0.82z)| norm 0.2487 (-0.64z)| lr 2.84e-04 | 4175.03 ms | 32.3% bf16 MFU | 125104 tok/s step 12014/19560 | loss 3.335703 (-0.46z)| norm 0.2437 (-0.94z)| lr 2.84e-04 | 4244.24 ms | 31.8% bf16 MFU | 125026 tok/s step 12015/19560 | loss 3.310502 (-1.07z)| norm 0.2508 (-0.50z)| lr 2.84e-04 | 4171.72 ms | 32.4% bf16 MFU | 125058 tok/s step 12016/19560 | loss 3.334650 (-0.46z)| norm 0.2459 (-0.79z)| lr 2.84e-04 | 4182.55 ms | 32.3% bf16 MFU | 125073 tok/s step 12017/19560 | loss 3.357915 (+0.14z)| norm 0.2334 (-1.54z)| lr 2.84e-04 | 4161.10 ms | 32.4% bf16 MFU | 125119 tok/s step 12018/19560 | loss 3.339521 (-0.32z)| norm 0.2619 (+0.21z)| lr 2.84e-04 | 4182.57 ms | 32.3% bf16 MFU | 125131 tok/s step 12019/19560 | loss 3.358104 (+0.16z)| norm 0.2499 (-0.51z)| lr 2.84e-04 | 4197.06 ms | 32.2% bf16 MFU | 125120 tok/s step 12020/19560 | loss 3.395941 (+1.12z)| norm 0.2339 (-1.48z)| lr 2.84e-04 | 4165.74 ms | 32.4% bf16 MFU | 125157 tok/s step 12021/19560 | loss 3.374180 (+0.56z)| norm 0.2330 (-1.52z)| lr 2.84e-04 | 4174.64 ms | 32.3% bf16 MFU | 125178 tok/s step 12022/19560 | loss 3.366223 (+0.35z)| norm 0.2465 (-0.68z)| lr 2.84e-04 | 4185.91 ms | 32.3% bf16 MFU | 125182 tok/s step 12023/19560 | loss 3.329227 (-0.59z)| norm 0.2415 (-0.97z)| lr 2.84e-04 | 4168.60 ms | 32.4% bf16 MFU | 125212 tok/s step 12024/19560 | loss 3.446823 (+2.37z)| norm 0.2537 (-0.21z)| lr 2.84e-04 | 4177.07 ms | 32.3% bf16 MFU | 125227 tok/s step 12025/19560 | loss 3.367654 (+0.36z)| norm 0.2520 (-0.32z)| lr 2.84e-04 | 4162.46 ms | 32.4% bf16 MFU | 125263 tok/s step 12026/19560 | loss 3.393343 (+1.01z)| norm 0.2773 (+1.27z)| lr 2.84e-04 | 4201.08 ms | 32.1% bf16 MFU | 125240 tok/s step 12027/19560 | loss 3.290424 (-1.62z)| norm 0.2402 (-1.05z)| lr 2.83e-04 | 4169.74 ms | 32.4% bf16 MFU | 125265 tok/s step 12028/19560 | loss 3.394025 (+1.01z)| norm 0.2739 (+1.06z)| lr 2.83e-04 | 4178.62 ms | 32.3% bf16 MFU | 125275 tok/s step 12029/19560 | loss 3.320358 (-0.87z)| norm 0.2452 (-0.75z)| lr 2.83e-04 | 4179.14 ms | 32.3% bf16 MFU | 125284 tok/s step 12030/19560 | loss 3.325521 (-0.73z)| norm 0.2394 (-1.10z)| lr 2.83e-04 | 4224.31 ms | 32.0% bf16 MFU | 125225 tok/s step 12031/19560 | loss 3.415107 (+1.54z)| norm 0.2840 (+1.68z)| lr 2.83e-04 | 4183.46 ms | 32.3% bf16 MFU | 125230 tok/s step 12032/19560 | loss 3.342446 (-0.30z)| norm 0.2505 (-0.41z)| lr 2.83e-04 | 4179.94 ms | 32.3% bf16 MFU | 125240 tok/s step 12033/19560 | loss 3.279361 (-1.87z)| norm 0.2789 (+1.35z)| lr 2.83e-04 | 4175.15 ms | 32.3% bf16 MFU | 125257 tok/s step 12034/19560 | loss 3.461905 (+2.62z)| norm 0.2671 (+0.61z)| lr 2.83e-04 | 4174.88 ms | 32.3% bf16 MFU | 125273 tok/s step 12035/19560 | loss 3.410328 (+1.34z)| norm 0.2371 (-1.24z)| lr 2.83e-04 | 4182.41 ms | 32.3% bf16 MFU | 125277 tok/s step 12036/19560 | loss 3.368309 (+0.31z)| norm 0.2578 (+0.05z)| lr 2.83e-04 | 4201.26 ms | 32.1% bf16 MFU | 125253 tok/s step 12037/19560 | loss 3.438726 (+1.98z)| norm 0.2680 (+0.67z)| lr 2.83e-04 | 4175.38 ms | 32.3% bf16 MFU | 125269 tok/s step 12038/19560 | loss 3.308115 (-1.16z)| norm 0.2869 (+1.81z)| lr 2.83e-04 | 4167.02 ms | 32.4% bf16 MFU | 125296 tok/s step 12039/19560 | loss 3.316281 (-0.96z)| norm 0.2639 (+0.40z)| lr 2.83e-04 | 4173.45 ms | 32.4% bf16 MFU | 125313 tok/s step 12040/19560 | loss 3.376112 (+0.48z)| norm 0.2867 (+1.77z)| lr 2.83e-04 | 4179.18 ms | 32.3% bf16 MFU | 125320 tok/s step 12041/19560 | loss 3.424220 (+1.61z)| norm 0.2678 (+0.62z)| lr 2.83e-04 | 4190.27 ms | 32.2% bf16 MFU | 125310 tok/s step 12042/19560 | loss 3.396739 (+0.94z)| norm 0.2689 (+0.68z)| lr 2.83e-04 | 4178.79 ms | 32.3% bf16 MFU | 125317 tok/s step 12043/19560 | loss 3.325864 (-0.76z)| norm 0.2515 (-0.37z)| lr 2.82e-04 | 4173.40 ms | 32.4% bf16 MFU | 125333 tok/s step 12044/19560 | loss 3.348907 (-0.21z)| norm 0.2622 (+0.26z)| lr 2.82e-04 | 4182.02 ms | 32.3% bf16 MFU | 125334 tok/s step 12045/19560 | loss 3.302930 (-1.31z)| norm 0.2766 (+1.13z)| lr 2.82e-04 | 4175.39 ms | 32.3% bf16 MFU | 125346 tok/s step 12046/19560 | loss 3.339381 (-0.42z)| norm 0.2852 (+1.62z)| lr 2.82e-04 | 4178.03 ms | 32.3% bf16 MFU | 125353 tok/s step 12047/19560 | loss 3.339545 (-0.41z)| norm 0.2495 (-0.53z)| lr 2.82e-04 | 4183.01 ms | 32.3% bf16 MFU | 125352 tok/s step 12048/19560 | loss 3.345451 (-0.26z)| norm 0.2728 (+0.86z)| lr 2.82e-04 | 4182.77 ms | 32.3% bf16 MFU | 125352 tok/s step 12049/19560 | loss 3.365594 (+0.22z)| norm 0.2580 (-0.03z)| lr 2.82e-04 | 4181.11 ms | 32.3% bf16 MFU | 125354 tok/s step 12050/19560 | loss 3.332488 (-0.57z)| norm 0.2616 (+0.18z)| lr 2.82e-04 | 4171.78 ms | 32.4% bf16 MFU | 125370 tok/s step 12051/19560 | loss 3.348397 (-0.18z)| norm 0.2556 (-0.19z)| lr 2.82e-04 | 4173.11 ms | 32.4% bf16 MFU | 125383 tok/s step 12052/19560 | loss 3.352515 (-0.09z)| norm 0.2462 (-0.76z)| lr 2.82e-04 | 4179.74 ms | 32.3% bf16 MFU | 125386 tok/s step 12053/19560 | loss 3.360547 (+0.10z)| norm 0.2736 (+0.88z)| lr 2.82e-04 | 4176.00 ms | 32.3% bf16 MFU | 125394 tok/s step 12054/19560 | loss 3.358249 (+0.04z)| norm 0.2632 (+0.25z)| lr 2.82e-04 | 4176.65 ms | 32.3% bf16 MFU | 125401 tok/s step 12055/19560 | loss 3.335503 (-0.53z)| norm 0.2708 (+0.70z)| lr 2.82e-04 | 4178.62 ms | 32.3% bf16 MFU | 125404 tok/s step 12056/19560 | loss 3.374421 (+0.43z)| norm 0.2693 (+0.60z)| lr 2.82e-04 | 4173.59 ms | 32.4% bf16 MFU | 125415 tok/s step 12057/19560 | loss 3.322682 (-0.84z)| norm 0.2409 (-1.12z)| lr 2.82e-04 | 4169.86 ms | 32.4% bf16 MFU | 125431 tok/s step 12058/19560 | loss 3.377398 (+0.50z)| norm 0.2595 (+0.00z)| lr 2.81e-04 | 4175.87 ms | 32.3% bf16 MFU | 125437 tok/s step 12059/19560 | loss 3.440092 (+2.00z)| norm 0.2672 (+0.47z)| lr 2.81e-04 | 4173.82 ms | 32.3% bf16 MFU | 125446 tok/s step 12060/19560 | loss 3.467997 (+2.58z)| norm 0.2518 (-0.47z)| lr 2.81e-04 | 4170.73 ms | 32.4% bf16 MFU | 125459 tok/s step 12061/19560 | loss 3.383375 (+0.57z)| norm 0.2676 (+0.49z)| lr 2.81e-04 | 4172.78 ms | 32.4% bf16 MFU | 125468 tok/s step 12062/19560 | loss 3.357296 (-0.05z)| norm 0.2503 (-0.56z)| lr 2.81e-04 | 4180.16 ms | 32.3% bf16 MFU | 125466 tok/s step 12063/19560 | loss 3.354939 (-0.10z)| norm 0.2724 (+0.78z)| lr 2.81e-04 | 4186.80 ms | 32.2% bf16 MFU | 125454 tok/s step 12064/19560 | loss 3.350280 (-0.21z)| norm 0.2460 (-0.81z)| lr 2.81e-04 | 4171.36 ms | 32.4% bf16 MFU | 125465 tok/s step 12065/19560 | loss 3.332528 (-0.63z)| norm 0.2646 (+0.32z)| lr 2.81e-04 | 4166.35 ms | 32.4% bf16 MFU | 125484 tok/s step 12066/19560 | loss 3.329650 (-0.70z)| norm 0.2464 (-0.77z)| lr 2.81e-04 | 4174.47 ms | 32.3% bf16 MFU | 125490 tok/s step 12067/19560 | loss 3.324918 (-0.80z)| norm 0.2532 (-0.37z)| lr 2.81e-04 | 4304.17 ms | 31.4% bf16 MFU | 125306 tok/s step 12068/19560 | loss 3.366411 (+0.19z)| norm 0.2468 (-0.75z)| lr 2.81e-04 | 4179.64 ms | 32.3% bf16 MFU | 125312 tok/s step 12069/19560 | loss 3.281692 (-1.79z)| norm 0.2637 (+0.28z)| lr 2.81e-04 | 4181.99 ms | 32.3% bf16 MFU | 125315 tok/s step 12070/19560 | loss 3.338644 (-0.44z)| norm 0.2721 (+0.80z)| lr 2.81e-04 | 4175.52 ms | 32.3% bf16 MFU | 125327 tok/s step 12071/19560 | loss 3.389880 (+0.75z)| norm 0.2413 (-1.07z)| lr 2.81e-04 | 4182.37 ms | 32.3% bf16 MFU | 125329 tok/s step 12072/19560 | loss 3.286952 (-1.63z)| norm 0.2863 (+1.65z)| lr 2.81e-04 | 4194.02 ms | 32.2% bf16 MFU | 125313 tok/s step 12073/19560 | loss 3.400930 (+1.00z)| norm 0.2607 (+0.09z)| lr 2.81e-04 | 4180.25 ms | 32.3% bf16 MFU | 125318 tok/s step 12074/19560 | loss 3.362585 (+0.11z)| norm 0.2423 (-1.00z)| lr 2.80e-04 | 4170.86 ms | 32.4% bf16 MFU | 125337 tok/s step 12075/19560 | loss 3.328775 (-0.69z)| norm 0.2536 (-0.32z)| lr 2.80e-04 | 4177.83 ms | 32.3% bf16 MFU | 125345 tok/s step 12076/19560 | loss 3.323509 (-0.80z)| norm 0.2442 (-0.89z)| lr 2.80e-04 | 4170.42 ms | 32.4% bf16 MFU | 125364 tok/s step 12077/19560 | loss 3.350480 (-0.16z)| norm 0.2466 (-0.75z)| lr 2.80e-04 | 4192.82 ms | 32.2% bf16 MFU | 125348 tok/s step 12078/19560 | loss 3.339515 (-0.41z)| norm 0.2643 (+0.36z)| lr 2.80e-04 | 4184.02 ms | 32.3% bf16 MFU | 125346 tok/s step 12079/19560 | loss 3.370537 (+0.32z)| norm 0.2680 (+0.59z)| lr 2.80e-04 | 4177.46 ms | 32.3% bf16 MFU | 125354 tok/s step 12080/19560 | loss 3.370589 (+0.32z)| norm 0.2388 (-1.23z)| lr 2.80e-04 | 4171.32 ms | 32.4% bf16 MFU | 125370 tok/s step 12081/19560 | loss 3.376755 (+0.46z)| norm 0.2662 (+0.48z)| lr 2.80e-04 | 4175.32 ms | 32.3% bf16 MFU | 125380 tok/s step 12082/19560 | loss 3.338163 (-0.45z)| norm 0.2528 (-0.34z)| lr 2.80e-04 | 4176.61 ms | 32.3% bf16 MFU | 125388 tok/s step 12083/19560 | loss 3.348758 (-0.21z)| norm 0.2622 (+0.24z)| lr 2.80e-04 | 4191.46 ms | 32.2% bf16 MFU | 125373 tok/s step 12084/19560 | loss 3.386233 (+0.68z)| norm 0.2520 (-0.40z)| lr 2.80e-04 | 4247.15 ms | 31.8% bf16 MFU | 125276 tok/s step 12085/19560 | loss 3.375981 (+0.43z)| norm 0.2436 (-0.91z)| lr 2.80e-04 | 4163.12 ms | 32.4% bf16 MFU | 125309 tok/s step 12086/19560 | loss 3.304488 (-1.24z)| norm 0.2498 (-0.52z)| lr 2.80e-04 | 4167.33 ms | 32.4% bf16 MFU | 125334 tok/s step 12087/19560 | loss 3.354406 (-0.07z)| norm 0.2931 (+2.12z)| lr 2.80e-04 | 4170.42 ms | 32.4% bf16 MFU | 125353 tok/s step 12088/19560 | loss 3.342947 (-0.34z)| norm 0.2784 (+1.20z)| lr 2.80e-04 | 4173.39 ms | 32.4% bf16 MFU | 125367 tok/s step 12089/19560 | loss 3.360847 (+0.09z)| norm 0.2512 (-0.45z)| lr 2.79e-04 | 4184.60 ms | 32.3% bf16 MFU | 125363 tok/s step 12090/19560 | loss 3.360066 (+0.07z)| norm 0.2592 (+0.03z)| lr 2.79e-04 | 4178.85 ms | 32.3% bf16 MFU | 125368 tok/s step 12091/19560 | loss 3.380457 (+0.54z)| norm 0.2528 (-0.36z)| lr 2.79e-04 | 4236.32 ms | 31.9% bf16 MFU | 125288 tok/s step 12092/19560 | loss 3.374669 (+0.39z)| norm 0.2732 (+0.88z)| lr 2.79e-04 | 4192.07 ms | 32.2% bf16 MFU | 125277 tok/s step 12093/19560 | loss 3.351663 (-0.14z)| norm 0.2478 (-0.66z)| lr 2.79e-04 | 4172.25 ms | 32.4% bf16 MFU | 125296 tok/s step 12094/19560 | loss 3.362564 (+0.11z)| norm 0.2489 (-0.58z)| lr 2.79e-04 | 4177.56 ms | 32.3% bf16 MFU | 125306 tok/s step 12095/19560 | loss 3.298722 (-1.40z)| norm 0.2548 (-0.21z)| lr 2.79e-04 | 4167.55 ms | 32.4% bf16 MFU | 125331 tok/s step 12096/19560 | loss 3.322428 (-0.82z)| norm 0.2428 (-0.94z)| lr 2.79e-04 | 4173.21 ms | 32.4% bf16 MFU | 125346 tok/s step 12097/19560 | loss 3.335865 (-0.50z)| norm 0.2360 (-1.34z)| lr 2.79e-04 | 4171.59 ms | 32.4% bf16 MFU | 125363 tok/s step 12098/19560 | loss 3.332658 (-0.57z)| norm 0.2354 (-1.36z)| lr 2.79e-04 | 4175.19 ms | 32.3% bf16 MFU | 125373 tok/s step 12099/19560 | loss 3.339905 (-0.41z)| norm 0.2642 (+0.42z)| lr 2.79e-04 | 4188.75 ms | 32.2% bf16 MFU | 125363 tok/s step 12100/19560 | loss 3.360756 (+0.08z)| norm 0.2413 (-0.99z)| lr 2.79e-04 | 4179.91 ms | 32.3% bf16 MFU | 125366 tok/s step 12101/19560 | loss 3.336079 (-0.50z)| norm 0.2460 (-0.68z)| lr 2.79e-04 | 4173.11 ms | 32.4% bf16 MFU | 125380 tok/s step 12102/19560 | loss 3.396489 (+1.09z)| norm 0.2468 (-0.62z)| lr 2.79e-04 | 4180.17 ms | 32.3% bf16 MFU | 125382 tok/s step 12103/19560 | loss 3.393661 (+1.00z)| norm 0.8394 (+10.77z)| lr 2.79e-04 | 4175.29 ms | 32.3% bf16 MFU | 125391 tok/s step 12104/19560 | loss 3.319267 (-0.97z)| norm 0.3718 (+2.01z)| lr 2.79e-04 | 4176.75 ms | 32.3% bf16 MFU | 125398 tok/s step 12105/19560 | loss 3.314780 (-1.07z)| norm 0.3041 (+0.76z)| lr 2.78e-04 | 4160.76 ms | 32.5% bf16 MFU | 125428 tok/s step 12106/19560 | loss 3.346998 (-0.22z)| norm 0.3477 (+1.53z)| lr 2.78e-04 | 4183.81 ms | 32.3% bf16 MFU | 125423 tok/s step 12107/19560 | loss 3.424787 (+1.80z)| norm 0.2808 (+0.31z)| lr 2.78e-04 | 4180.48 ms | 32.3% bf16 MFU | 125422 tok/s step 12108/19560 | loss 3.343839 (-0.31z)| norm 0.2818 (+0.33z)| lr 2.78e-04 | 4164.42 ms | 32.4% bf16 MFU | 125446 tok/s step 12109/19560 | loss 3.388715 (+0.85z)| norm 0.3117 (+0.87z)| lr 2.78e-04 | 4164.50 ms | 32.4% bf16 MFU | 125468 tok/s step 12110/19560 | loss 3.393173 (+0.97z)| norm 0.2684 (+0.08z)| lr 2.78e-04 | 4167.45 ms | 32.4% bf16 MFU | 125485 tok/s step 12111/19560 | loss 3.369205 (+0.33z)| norm 0.2859 (+0.40z)| lr 2.78e-04 | 4170.72 ms | 32.4% bf16 MFU | 125496 tok/s step 12112/19560 | loss 3.433370 (+1.98z)| norm 0.2738 (+0.17z)| lr 2.78e-04 | 4275.40 ms | 31.6% bf16 MFU | 125353 tok/s step 12113/19560 | loss 3.342984 (-0.36z)| norm 0.2732 (+0.16z)| lr 2.78e-04 | 4174.30 ms | 32.3% bf16 MFU | 125365 tok/s step 12114/19560 | loss 3.354417 (-0.06z)| norm 0.2811 (+0.30z)| lr 2.78e-04 | 4182.43 ms | 32.3% bf16 MFU | 125365 tok/s step 12115/19560 | loss 3.358789 (+0.06z)| norm 0.2746 (+0.18z)| lr 2.78e-04 | 4173.97 ms | 32.3% bf16 MFU | 125377 tok/s step 12116/19560 | loss 3.357602 (+0.03z)| norm 0.2503 (-0.26z)| lr 2.78e-04 | 4174.99 ms | 32.3% bf16 MFU | 125387 tok/s step 12117/19560 | loss 3.410470 (+1.43z)| norm 0.2614 (-0.05z)| lr 2.78e-04 | 4177.35 ms | 32.3% bf16 MFU | 125393 tok/s step 12118/19560 | loss 3.341681 (-0.40z)| norm 0.2550 (-0.17z)| lr 2.78e-04 | 4178.12 ms | 32.3% bf16 MFU | 125398 tok/s step 12119/19560 | loss 3.337900 (-0.49z)| norm 0.2449 (-0.35z)| lr 2.78e-04 | 4523.91 ms | 29.8% bf16 MFU | 124922 tok/s step 12120/19560 | loss 3.314271 (-1.11z)| norm 0.2469 (-0.31z)| lr 2.77e-04 | 4785.16 ms | 28.2% bf16 MFU | 124154 tok/s step 12121/19560 | loss 3.382763 (+0.71z)| norm 0.2421 (-0.40z)| lr 2.77e-04 | 4679.83 ms | 28.9% bf16 MFU | 123548 tok/s step 12122/19560 | loss 3.389069 (+0.86z)| norm 0.2869 (+0.41z)| lr 2.77e-04 | 4426.73 ms | 30.5% bf16 MFU | 123293 tok/s step 12123/19560 | loss 3.370897 (+0.36z)| norm 0.2620 (-0.04z)| lr 2.77e-04 | 4391.70 ms | 30.7% bf16 MFU | 123097 tok/s step 12124/19560 | loss 3.326747 (-0.83z)| norm 0.2405 (-0.43z)| lr 2.77e-04 | 4472.28 ms | 30.2% bf16 MFU | 122804 tok/s step 12125/19560 | loss 3.319560 (-1.01z)| norm 0.2504 (-0.25z)| lr 2.77e-04 | 4238.63 ms | 31.9% bf16 MFU | 122848 tok/s step 12126/19560 | loss 3.360318 (+0.09z)| norm 0.2435 (-0.38z)| lr 2.77e-04 | 4600.21 ms | 29.4% bf16 MFU | 122404 tok/s step 12127/19560 | loss 3.316674 (-1.09z)| norm 0.2459 (-0.33z)| lr 2.77e-04 | 4261.36 ms | 31.7% bf16 MFU | 122436 tok/s step 12128/19560 | loss 3.419497 (+1.66z)| norm 0.2605 (-0.07z)| lr 2.77e-04 | 4361.90 ms | 31.0% bf16 MFU | 122324 tok/s step 12129/19560 | loss 3.361673 (+0.10z)| norm 0.2357 (-0.52z)| lr 2.77e-04 | 4324.42 ms | 31.2% bf16 MFU | 122270 tok/s step 12130/19560 | loss 3.307148 (-1.34z)| norm 0.2724 (+0.14z)| lr 2.77e-04 | 4172.26 ms | 32.4% bf16 MFU | 122439 tok/s step 12131/19560 | loss 3.330499 (-0.72z)| norm 0.2425 (-0.40z)| lr 2.77e-04 | 4219.72 ms | 32.0% bf16 MFU | 122530 tok/s step 12132/19560 | loss 3.336617 (-0.55z)| norm 0.2770 (+0.22z)| lr 2.77e-04 | 4174.29 ms | 32.3% bf16 MFU | 122683 tok/s step 12133/19560 | loss 3.338117 (-0.49z)| norm 0.2530 (-0.22z)| lr 2.77e-04 | 4220.05 ms | 32.0% bf16 MFU | 122761 tok/s step 12134/19560 | loss 3.273488 (-2.20z)| norm 0.2495 (-0.28z)| lr 2.77e-04 | 4174.91 ms | 32.3% bf16 MFU | 122902 tok/s step 12135/19560 | loss 3.347203 (-0.23z)| norm 0.2530 (-0.22z)| lr 2.77e-04 | 4166.04 ms | 32.4% bf16 MFU | 123049 tok/s step 12136/19560 | loss 3.362547 (+0.18z)| norm 0.2381 (-0.49z)| lr 2.76e-04 | 4208.68 ms | 32.1% bf16 MFU | 123125 tok/s step 12137/19560 | loss 3.293715 (-1.65z)| norm 0.2378 (-0.49z)| lr 2.76e-04 | 4243.31 ms | 31.8% bf16 MFU | 123147 tok/s step 12138/19560 | loss 3.361508 (+0.17z)| norm 0.2371 (-0.50z)| lr 2.76e-04 | 4166.41 ms | 32.4% bf16 MFU | 123281 tok/s step 12139/19560 | loss 3.300769 (-1.45z)| norm 0.2270 (-0.67z)| lr 2.76e-04 | 4178.39 ms | 32.3% bf16 MFU | 123391 tok/s step 12140/19560 | loss 3.351173 (-0.10z)| norm 0.2517 (-0.22z)| lr 2.76e-04 | 4179.26 ms | 32.3% bf16 MFU | 123494 tok/s step 12141/19560 | loss 3.327756 (-0.73z)| norm 0.2380 (-0.47z)| lr 2.76e-04 | 4229.22 ms | 31.9% bf16 MFU | 123518 tok/s step 12142/19560 | loss 3.369923 (+0.40z)| norm 0.2353 (-0.52z)| lr 2.76e-04 | 4328.07 ms | 31.2% bf16 MFU | 123399 tok/s step 12143/19560 | loss 3.269403 (-2.27z)| norm 0.2253 (-0.70z)| lr 2.76e-04 | 4183.79 ms | 32.3% bf16 MFU | 123494 tok/s step 12144/19560 | loss 3.311305 (-1.14z)| norm 0.2480 (-0.28z)| lr 2.76e-04 | 4213.33 ms | 32.0% bf16 MFU | 123541 tok/s step 12145/19560 | loss 3.332437 (-0.58z)| norm 0.2451 (-0.34z)| lr 2.76e-04 | 4223.60 ms | 32.0% bf16 MFU | 123571 tok/s step 12146/19560 | loss 3.350096 (-0.12z)| norm 0.2527 (-0.20z)| lr 2.76e-04 | 4172.83 ms | 32.4% bf16 MFU | 123675 tok/s step 12147/19560 | loss 3.357777 (+0.09z)| norm 0.2441 (-0.35z)| lr 2.76e-04 | 4202.54 ms | 32.1% bf16 MFU | 123729 tok/s step 12148/19560 | loss 3.342507 (-0.31z)| norm 0.2500 (-0.25z)| lr 2.76e-04 | 4176.93 ms | 32.3% bf16 MFU | 123818 tok/s step 12149/19560 | loss 3.318711 (-0.92z)| norm 0.2349 (-0.53z)| lr 2.76e-04 | 4281.80 ms | 31.5% bf16 MFU | 123750 tok/s step 12150/19560 | loss 3.403320 (+1.30z)| norm 0.2449 (-0.34z)| lr 2.76e-04 | 4223.10 ms | 32.0% bf16 MFU | 123769 tok/s step 12151/19560 | loss 3.274568 (-2.04z)| norm 0.2547 (-0.17z)| lr 2.75e-04 | 4172.96 ms | 32.4% bf16 MFU | 123863 tok/s step 12152/19560 | loss 3.353768 (+0.03z)| norm 0.2601 (-0.07z)| lr 2.75e-04 | 4199.30 ms | 32.2% bf16 MFU | 123912 tok/s step 12153/19560 | loss 3.341133 (-0.30z)| norm 0.2617 (-0.04z)| lr 2.75e-04 | 4187.38 ms | 32.2% bf16 MFU | 123977 tok/s step 12154/19560 | loss 3.331995 (-0.54z)| norm 0.2459 (-0.32z)| lr 2.75e-04 | 4402.13 ms | 30.7% bf16 MFU | 123733 tok/s step 12155/19560 | loss 3.341287 (-0.30z)| norm 0.2546 (-0.17z)| lr 2.75e-04 | 4185.82 ms | 32.3% bf16 MFU | 123809 tok/s step 12156/19560 | loss 3.268858 (-2.19z)| norm 0.2378 (-0.47z)| lr 2.75e-04 | 5003.17 ms | 27.0% bf16 MFU | 122858 tok/s step 12157/19560 | loss 3.333352 (-0.49z)| norm 0.2609 (-0.05z)| lr 2.75e-04 | 4189.75 ms | 32.2% bf16 MFU | 122972 tok/s step 12158/19560 | loss 3.357653 (+0.15z)| norm 0.2589 (-0.09z)| lr 2.75e-04 | 4170.87 ms | 32.4% bf16 MFU | 123109 tok/s step 12159/19560 | loss 3.282260 (-1.82z)| norm 0.2556 (-0.15z)| lr 2.75e-04 | 4188.72 ms | 32.2% bf16 MFU | 123212 tok/s step 12160/19560 | loss 3.388812 (+0.99z)| norm 0.2397 (-0.43z)| lr 2.75e-04 | 4177.67 ms | 32.3% bf16 MFU | 123326 tok/s step 12161/19560 | loss 3.347440 (-0.12z)| norm 0.2380 (-0.46z)| lr 2.75e-04 | 4182.15 ms | 32.3% bf16 MFU | 123428 tok/s step 12162/19560 | loss 3.310802 (-1.10z)| norm 0.2445 (-0.34z)| lr 2.75e-04 | 4182.31 ms | 32.3% bf16 MFU | 123524 tok/s step 12163/19560 | loss 3.338591 (-0.32z)| norm 0.2438 (-0.35z)| lr 2.75e-04 | 4188.42 ms | 32.2% bf16 MFU | 123607 tok/s step 12164/19560 | loss 3.385202 (+0.98z)| norm 0.2505 (-0.23z)| lr 2.75e-04 | 4180.92 ms | 32.3% bf16 MFU | 123697 tok/s step 12165/19560 | loss 3.273859 (-2.11z)| norm 0.2538 (-0.17z)| lr 2.75e-04 | 4259.95 ms | 31.7% bf16 MFU | 123665 tok/s step 12166/19560 | loss 3.355990 (+0.19z)| norm 0.2410 (-0.39z)| lr 2.75e-04 | 4213.13 ms | 32.0% bf16 MFU | 123704 tok/s step 12167/19560 | loss 3.335341 (-0.40z)| norm 0.2587 (-0.07z)| lr 2.74e-04 | 4184.29 ms | 32.3% bf16 MFU | 123784 tok/s step 12168/19560 | loss 3.321563 (-0.77z)| norm 0.2505 (-0.21z)| lr 2.74e-04 | 4257.25 ms | 31.7% bf16 MFU | 123752 tok/s step 12169/19560 | loss 3.379769 (+0.90z)| norm 0.2566 (-0.10z)| lr 2.74e-04 | 4176.49 ms | 32.3% bf16 MFU | 123841 tok/s step 12170/19560 | loss 3.413965 (+1.86z)| norm 0.2499 (-0.22z)| lr 2.74e-04 | 4169.95 ms | 32.4% bf16 MFU | 123936 tok/s step 12171/19560 | loss 3.254746 (-2.60z)| norm 0.2357 (-0.48z)| lr 2.74e-04 | 4184.85 ms | 32.3% bf16 MFU | 124003 tok/s step 12172/19560 | loss 3.389860 (+1.14z)| norm 0.2589 (-0.05z)| lr 2.74e-04 | 4298.57 ms | 31.4% bf16 MFU | 123901 tok/s step 12173/19560 | loss 3.346639 (-0.06z)| norm 0.2502 (-0.21z)| lr 2.74e-04 | 4210.11 ms | 32.1% bf16 MFU | 123933 tok/s step 12174/19560 | loss 3.397667 (+1.34z)| norm 0.2627 (+0.02z)| lr 2.74e-04 | 4253.19 ms | 31.7% bf16 MFU | 123900 tok/s step 12175/19560 | loss 3.360944 (+0.32z)| norm 0.2358 (-0.46z)| lr 2.74e-04 | 4188.83 ms | 32.2% bf16 MFU | 123963 tok/s step 12176/19560 | loss 3.355342 (+0.16z)| norm 0.2399 (-0.39z)| lr 2.74e-04 | 4175.48 ms | 32.3% bf16 MFU | 124043 tok/s step 12177/19560 | loss 3.318538 (-0.85z)| norm 0.2533 (-0.14z)| lr 2.74e-04 | 4184.02 ms | 32.3% bf16 MFU | 124106 tok/s step 12178/19560 | loss 3.352510 (+0.09z)| norm 0.2469 (-0.26z)| lr 2.74e-04 | 4175.74 ms | 32.3% bf16 MFU | 124179 tok/s step 12179/19560 | loss 3.340181 (-0.25z)| norm 0.2565 (-0.08z)| lr 2.74e-04 | 4183.87 ms | 32.3% bf16 MFU | 124235 tok/s step 12180/19560 | loss 3.338454 (-0.30z)| norm 0.2502 (-0.20z)| lr 2.74e-04 | 4182.80 ms | 32.3% bf16 MFU | 124291 tok/s step 12181/19560 | loss 3.352931 (+0.11z)| norm 0.2672 (+0.11z)| lr 2.74e-04 | 4187.05 ms | 32.2% bf16 MFU | 124337 tok/s step 12182/19560 | loss 3.305394 (-1.19z)| norm 0.2571 (-0.07z)| lr 2.73e-04 | 4193.85 ms | 32.2% bf16 MFU | 124371 tok/s step 12183/19560 | loss 3.373226 (+0.67z)| norm 0.2518 (-0.16z)| lr 2.73e-04 | 4188.56 ms | 32.2% bf16 MFU | 124411 tok/s step 12184/19560 | loss 3.354667 (+0.16z)| norm 0.2526 (-0.15z)| lr 2.73e-04 | 4234.28 ms | 31.9% bf16 MFU | 124381 tok/s step 12185/19560 | loss 3.338283 (-0.29z)| norm 0.2536 (-0.13z)| lr 2.73e-04 | 4171.89 ms | 32.4% bf16 MFU | 124446 tok/s step 12186/19560 | loss 3.316137 (-0.89z)| norm 0.2404 (-0.37z)| lr 2.73e-04 | 4195.58 ms | 32.2% bf16 MFU | 124472 tok/s step 12187/19560 | loss 3.383180 (+0.99z)| norm 0.2581 (-0.04z)| lr 2.73e-04 | 4182.87 ms | 32.3% bf16 MFU | 124515 tok/s step 12188/19560 | loss 3.320834 (-0.77z)| norm 0.2448 (-0.29z)| lr 2.73e-04 | 4187.10 ms | 32.2% bf16 MFU | 124550 tok/s step 12189/19560 | loss 3.410093 (+1.84z)| norm 0.2485 (-0.21z)| lr 2.73e-04 | 4184.03 ms | 32.3% bf16 MFU | 124588 tok/s step 12190/19560 | loss 3.289330 (-1.65z)| norm 0.2357 (-0.44z)| lr 2.73e-04 | 4186.67 ms | 32.2% bf16 MFU | 124620 tok/s step 12191/19560 | loss 3.414395 (+1.92z)| norm 0.2676 (+0.13z)| lr 2.73e-04 | 4176.11 ms | 32.3% bf16 MFU | 124666 tok/s step 12192/19560 | loss 3.355529 (+0.24z)| norm 0.2410 (-0.35z)| lr 2.73e-04 | 4173.08 ms | 32.4% bf16 MFU | 124715 tok/s step 12193/19560 | loss 3.323318 (-0.67z)| norm 0.2568 (-0.06z)| lr 2.73e-04 | 4182.08 ms | 32.3% bf16 MFU | 124747 tok/s step 12194/19560 | loss 3.477970 (+3.51z)| norm 0.2855 (+0.46z)| lr 2.73e-04 | 4190.42 ms | 32.2% bf16 MFU | 124766 tok/s step 12195/19560 | loss 3.410038 (+1.64z)| norm 0.2586 (-0.03z)| lr 2.73e-04 | 4188.82 ms | 32.2% bf16 MFU | 124785 tok/s step 12196/19560 | loss 3.356332 (+0.20z)| norm 0.2639 (+0.06z)| lr 2.73e-04 | 4188.64 ms | 32.2% bf16 MFU | 124805 tok/s step 12197/19560 | loss 3.327288 (-0.59z)| norm 0.2764 (+0.29z)| lr 2.73e-04 | 4187.63 ms | 32.2% bf16 MFU | 124824 tok/s step 12198/19560 | loss 3.330749 (-0.49z)| norm 0.2567 (-0.07z)| lr 2.72e-04 | 4175.56 ms | 32.3% bf16 MFU | 124861 tok/s step 12199/19560 | loss 3.324620 (-0.65z)| norm 0.2716 (+0.20z)| lr 2.72e-04 | 4181.25 ms | 32.3% bf16 MFU | 124888 tok/s step 12200/19560 | loss 3.357979 (+0.24z)| norm 0.2488 (-0.21z)| lr 2.72e-04 | 4180.21 ms | 32.3% bf16 MFU | 124914 tok/s step 12201/19560 | loss 3.289013 (-1.62z)| norm 0.2670 (+0.12z)| lr 2.72e-04 | 4182.87 ms | 32.3% bf16 MFU | 124936 tok/s step 12202/19560 | loss 3.301064 (-1.27z)| norm 0.2601 (-0.01z)| lr 2.72e-04 | 4179.58 ms | 32.3% bf16 MFU | 124961 tok/s step 12203/19560 | loss 3.309832 (-1.02z)| norm 0.2467 (-0.25z)| lr 2.72e-04 | 4181.99 ms | 32.3% bf16 MFU | 124981 tok/s step 12204/19560 | loss 3.350080 (+0.06z)| norm 0.2515 (-0.17z)| lr 2.72e-04 | 4186.41 ms | 32.3% bf16 MFU | 124994 tok/s step 12205/19560 | loss 3.320456 (-0.73z)| norm 0.2457 (-0.27z)| lr 2.72e-04 | 4180.84 ms | 32.3% bf16 MFU | 125014 tok/s step 12206/19560 | loss 3.352827 (+0.14z)| norm 0.2646 (+0.07z)| lr 2.72e-04 | 4189.68 ms | 32.2% bf16 MFU | 125021 tok/s step 12207/19560 | loss 3.276029 (-1.90z)| norm 0.2354 (-0.45z)| lr 2.72e-04 | 4198.41 ms | 32.2% bf16 MFU | 125013 tok/s step 12208/19560 | loss 3.333436 (-0.35z)| norm 0.2668 (+0.11z)| lr 2.72e-04 | 4189.63 ms | 32.2% bf16 MFU | 125020 tok/s step 12209/19560 | loss 3.317857 (-0.76z)| norm 0.2749 (+0.26z)| lr 2.72e-04 | 4192.35 ms | 32.2% bf16 MFU | 125022 tok/s step 12210/19560 | loss 3.413707 (+1.77z)| norm 0.2681 (+0.13z)| lr 2.72e-04 | 4185.84 ms | 32.3% bf16 MFU | 125033 tok/s step 12211/19560 | loss 3.399514 (+1.38z)| norm 0.2665 (+0.10z)| lr 2.72e-04 | 4180.98 ms | 32.3% bf16 MFU | 125052 tok/s step 12212/19560 | loss 3.270416 (-1.97z)| norm 0.2627 (+0.03z)| lr 2.72e-04 | 4175.87 ms | 32.3% bf16 MFU | 125077 tok/s step 12213/19560 | loss 3.373265 (+0.70z)| norm 0.2414 (-0.35z)| lr 2.72e-04 | 4174.93 ms | 32.3% bf16 MFU | 125102 tok/s step 12214/19560 | loss 3.364831 (+0.47z)| norm 0.2492 (-0.21z)| lr 2.71e-04 | 4173.48 ms | 32.4% bf16 MFU | 125128 tok/s step 12215/19560 | loss 3.307946 (-1.00z)| norm 0.2428 (-0.32z)| lr 2.71e-04 | 4173.07 ms | 32.4% bf16 MFU | 125153 tok/s step 12216/19560 | loss 3.328475 (-0.46z)| norm 0.2435 (-0.30z)| lr 2.71e-04 | 4192.67 ms | 32.2% bf16 MFU | 125148 tok/s step 12217/19560 | loss 3.409169 (+1.61z)| norm 0.2365 (-0.43z)| lr 2.71e-04 | 4191.52 ms | 32.2% bf16 MFU | 125145 tok/s step 12218/19560 | loss 3.351820 (+0.14z)| norm 0.2589 (-0.02z)| lr 2.71e-04 | 4176.17 ms | 32.3% bf16 MFU | 125165 tok/s step 12219/19560 | loss 3.329701 (-0.42z)| norm 0.2456 (-0.26z)| lr 2.71e-04 | 4201.54 ms | 32.1% bf16 MFU | 125146 tok/s step 12220/19560 | loss 3.384773 (+0.99z)| norm 0.2592 (-0.01z)| lr 2.71e-04 | 4193.60 ms | 32.2% bf16 MFU | 125139 tok/s step 12221/19560 | loss 3.296345 (-1.27z)| norm 0.2381 (-0.40z)| lr 2.71e-04 | 4178.18 ms | 32.3% bf16 MFU | 125157 tok/s step 12222/19560 | loss 3.339613 (-0.15z)| norm 0.2486 (-0.20z)| lr 2.71e-04 | 4171.64 ms | 32.4% bf16 MFU | 125183 tok/s step 12223/19560 | loss 3.364647 (+0.48z)| norm 0.2562 (-0.07z)| lr 2.71e-04 | 4183.10 ms | 32.3% bf16 MFU | 125190 tok/s step 12224/19560 | loss 3.353635 (+0.19z)| norm 0.2580 (-0.04z)| lr 2.71e-04 | 4265.46 ms | 31.7% bf16 MFU | 125076 tok/s step 12225/19560 | loss 3.299355 (-1.20z)| norm 0.2454 (-0.27z)| lr 2.71e-04 | 4178.74 ms | 32.3% bf16 MFU | 125096 tok/s step 12226/19560 | loss 3.358871 (+0.32z)| norm 0.2902 (+0.54z)| lr 2.71e-04 | 4482.93 ms | 30.1% bf16 MFU | 124689 tok/s step 12227/19560 | loss 3.373208 (+0.68z)| norm 0.2445 (-0.29z)| lr 2.71e-04 | 4174.72 ms | 32.3% bf16 MFU | 124734 tok/s step 12228/19560 | loss 3.299054 (-1.20z)| norm 0.2582 (-0.04z)| lr 2.71e-04 | 4168.04 ms | 32.4% bf16 MFU | 124786 tok/s step 12229/19560 | loss 3.342508 (-0.09z)| norm 0.2313 (-0.53z)| lr 2.70e-04 | 4207.40 ms | 32.1% bf16 MFU | 124778 tok/s step 12230/19560 | loss 3.358222 (+0.32z)| norm 0.2351 (-0.46z)| lr 2.70e-04 | 4183.79 ms | 32.3% bf16 MFU | 124804 tok/s step 12231/19560 | loss 3.288205 (-1.45z)| norm 0.2586 (+0.15z)| lr 2.70e-04 | 4191.49 ms | 32.2% bf16 MFU | 124818 tok/s step 12232/19560 | loss 3.324935 (-0.51z)| norm 0.2577 (+0.17z)| lr 2.70e-04 | 4184.13 ms | 32.3% bf16 MFU | 124843 tok/s step 12233/19560 | loss 3.301718 (-1.10z)| norm 0.2422 (-0.74z)| lr 2.70e-04 | 4191.10 ms | 32.2% bf16 MFU | 124855 tok/s step 12234/19560 | loss 3.315753 (-0.74z)| norm 0.2531 (-0.03z)| lr 2.70e-04 | 4173.91 ms | 32.3% bf16 MFU | 124893 tok/s step 12235/19560 | loss 3.345853 (+0.05z)| norm 0.2379 (-1.09z)| lr 2.70e-04 | 4184.78 ms | 32.3% bf16 MFU | 124913 tok/s step 12236/19560 | loss 3.327731 (-0.42z)| norm 0.2662 (+0.94z)| lr 2.70e-04 | 4183.01 ms | 32.3% bf16 MFU | 124934 tok/s step 12237/19560 | loss 3.355839 (+0.32z)| norm 0.2640 (+0.86z)| lr 2.70e-04 | 4179.25 ms | 32.3% bf16 MFU | 124960 tok/s step 12238/19560 | loss 3.279243 (-1.65z)| norm 0.2625 (+0.76z)| lr 2.70e-04 | 4793.49 ms | 28.2% bf16 MFU | 124180 tok/s step 12239/19560 | loss 3.382501 (+1.02z)| norm 0.2511 (-0.11z)| lr 2.70e-04 | 4163.90 ms | 32.4% bf16 MFU | 124267 tok/s step 12240/19560 | loss 3.373619 (+0.82z)| norm 0.2515 (-0.06z)| lr 2.70e-04 | 4183.15 ms | 32.3% bf16 MFU | 124320 tok/s step 12241/19560 | loss 3.294299 (-1.25z)| norm 0.2386 (-1.08z)| lr 2.70e-04 | 4199.24 ms | 32.2% bf16 MFU | 124347 tok/s step 12242/19560 | loss 3.307745 (-0.89z)| norm 0.2533 (+0.13z)| lr 2.70e-04 | 4172.60 ms | 32.4% bf16 MFU | 124412 tok/s step 12243/19560 | loss 3.314806 (-0.69z)| norm 0.2575 (+0.49z)| lr 2.70e-04 | 5234.13 ms | 25.8% bf16 MFU | 123200 tok/s step 12244/19560 | loss 3.326324 (-0.39z)| norm 0.2720 (+1.67z)| lr 2.70e-04 | 4172.00 ms | 32.4% bf16 MFU | 123323 tok/s step 12245/19560 | loss 3.290834 (-1.29z)| norm 0.2514 (-0.03z)| lr 2.69e-04 | 4175.51 ms | 32.3% bf16 MFU | 123435 tok/s step 12246/19560 | loss 3.277428 (-1.62z)| norm 0.2751 (+1.90z)| lr 2.69e-04 | 4182.90 ms | 32.3% bf16 MFU | 123531 tok/s step 12247/19560 | loss 3.332300 (-0.19z)| norm 0.2632 (+0.91z)| lr 2.69e-04 | 4179.96 ms | 32.3% bf16 MFU | 123625 tok/s step 12248/19560 | loss 3.334154 (-0.15z)| norm 0.2838 (+2.51z)| lr 2.69e-04 | 4178.45 ms | 32.3% bf16 MFU | 123718 tok/s step 12249/19560 | loss 3.307430 (-0.83z)| norm 0.2890 (+2.81z)| lr 2.69e-04 | 4203.11 ms | 32.1% bf16 MFU | 123769 tok/s step 12250/19560 | loss 3.390668 (+1.34z)| norm 0.2660 (+1.07z)| lr 2.69e-04 | 4171.51 ms | 32.4% bf16 MFU | 123865 tok/s val loss 3.333118 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2966/10042 = 0.295359 step 12251/19560 | loss 3.319265 (-0.51z)| norm 0.2690 (+1.29z)| lr 2.69e-04 | 4178.84 ms | 32.3% bf16 MFU | 123945 tok/s step 12252/19560 | loss 3.313324 (-0.66z)| norm 0.2659 (+1.04z)| lr 2.69e-04 | 4177.59 ms | 32.3% bf16 MFU | 124022 tok/s step 12253/19560 | loss 3.336196 (-0.07z)| norm 0.2449 (-0.62z)| lr 2.69e-04 | 4204.37 ms | 32.1% bf16 MFU | 124056 tok/s step 12254/19560 | loss 3.325837 (-0.33z)| norm 0.2790 (+2.02z)| lr 2.69e-04 | 4175.20 ms | 32.3% bf16 MFU | 124132 tok/s step 12255/19560 | loss 3.328896 (-0.25z)| norm 0.2385 (-1.12z)| lr 2.69e-04 | 5085.38 ms | 26.6% bf16 MFU | 123080 tok/s step 12256/19560 | loss 3.373448 (+0.93z)| norm 0.2626 (+0.74z)| lr 2.69e-04 | 4161.78 ms | 32.4% bf16 MFU | 123225 tok/s step 12257/19560 | loss 3.371562 (+0.88z)| norm 0.2532 (+0.01z)| lr 2.69e-04 | 4171.81 ms | 32.4% bf16 MFU | 123348 tok/s step 12258/19560 | loss 3.322060 (-0.44z)| norm 0.2546 (+0.12z)| lr 2.69e-04 | 4247.81 ms | 31.8% bf16 MFU | 123351 tok/s step 12259/19560 | loss 3.360296 (+0.57z)| norm 0.2401 (-1.01z)| lr 2.69e-04 | 4326.73 ms | 31.2% bf16 MFU | 123243 tok/s step 12260/19560 | loss 3.365466 (+0.70z)| norm 0.2532 (+0.04z)| lr 2.69e-04 | 4300.24 ms | 31.4% bf16 MFU | 123176 tok/s step 12261/19560 | loss 3.382131 (+1.13z)| norm 0.2394 (-1.06z)| lr 2.68e-04 | 4185.47 ms | 32.3% bf16 MFU | 123281 tok/s step 12262/19560 | loss 3.339292 (-0.01z)| norm 0.2597 (+0.55z)| lr 2.68e-04 | 4206.34 ms | 32.1% bf16 MFU | 123349 tok/s step 12263/19560 | loss 3.318646 (-0.56z)| norm 0.2378 (-1.17z)| lr 2.68e-04 | 4185.13 ms | 32.3% bf16 MFU | 123445 tok/s step 12264/19560 | loss 3.312066 (-0.72z)| norm 0.2333 (-1.52z)| lr 2.68e-04 | 4194.43 ms | 32.2% bf16 MFU | 123523 tok/s step 12265/19560 | loss 3.280280 (-1.56z)| norm 0.2516 (-0.09z)| lr 2.68e-04 | 4222.51 ms | 32.0% bf16 MFU | 123555 tok/s step 12266/19560 | loss 3.345406 (+0.17z)| norm 0.2284 (-1.90z)| lr 2.68e-04 | 4173.52 ms | 32.4% bf16 MFU | 123658 tok/s step 12267/19560 | loss 3.403615 (+1.68z)| norm 0.2405 (-0.97z)| lr 2.68e-04 | 4317.01 ms | 31.3% bf16 MFU | 123548 tok/s step 12268/19560 | loss 3.337100 (-0.07z)| norm 0.2435 (-0.73z)| lr 2.68e-04 | 4189.64 ms | 32.2% bf16 MFU | 123627 tok/s step 12269/19560 | loss 3.333552 (-0.16z)| norm 0.2325 (-1.59z)| lr 2.68e-04 | 4172.52 ms | 32.4% bf16 MFU | 123728 tok/s step 12270/19560 | loss 3.345746 (+0.17z)| norm 0.2346 (-1.42z)| lr 2.68e-04 | 4178.72 ms | 32.3% bf16 MFU | 123815 tok/s step 12271/19560 | loss 3.339781 (-0.01z)| norm 0.2458 (-0.56z)| lr 2.68e-04 | 4168.48 ms | 32.4% bf16 MFU | 123913 tok/s step 12272/19560 | loss 3.290148 (-1.33z)| norm 0.2601 (+0.57z)| lr 2.68e-04 | 4238.99 ms | 31.9% bf16 MFU | 123902 tok/s step 12273/19560 | loss 3.378577 (+1.02z)| norm 0.2707 (+1.40z)| lr 2.68e-04 | 4371.34 ms | 30.9% bf16 MFU | 123704 tok/s step 12274/19560 | loss 3.283800 (-1.47z)| norm 0.2540 (+0.07z)| lr 2.68e-04 | 4223.09 ms | 32.0% bf16 MFU | 123726 tok/s step 12275/19560 | loss 3.304444 (-0.92z)| norm 0.2555 (+0.18z)| lr 2.68e-04 | 4175.70 ms | 32.3% bf16 MFU | 123817 tok/s step 12276/19560 | loss 3.292946 (-1.20z)| norm 0.2847 (+2.44z)| lr 2.67e-04 | 4303.70 ms | 31.4% bf16 MFU | 123718 tok/s step 12277/19560 | loss 3.357822 (+0.49z)| norm 0.2448 (-0.68z)| lr 2.67e-04 | 4337.95 ms | 31.1% bf16 MFU | 123575 tok/s step 12278/19560 | loss 3.347585 (+0.23z)| norm 0.2746 (+1.62z)| lr 2.67e-04 | 4169.16 ms | 32.4% bf16 MFU | 123684 tok/s step 12279/19560 | loss 3.351639 (+0.33z)| norm 0.2715 (+1.35z)| lr 2.67e-04 | 4172.24 ms | 32.4% bf16 MFU | 123783 tok/s step 12280/19560 | loss 3.318606 (-0.55z)| norm 0.2813 (+2.06z)| lr 2.67e-04 | 4167.68 ms | 32.4% bf16 MFU | 123883 tok/s step 12281/19560 | loss 3.330670 (-0.22z)| norm 0.2550 (+0.07z)| lr 2.67e-04 | 4212.06 ms | 32.1% bf16 MFU | 123913 tok/s step 12282/19560 | loss 3.317812 (-0.56z)| norm 0.2816 (+2.04z)| lr 2.67e-04 | 4645.47 ms | 29.1% bf16 MFU | 123360 tok/s step 12283/19560 | loss 3.347451 (+0.23z)| norm 0.2533 (-0.08z)| lr 2.67e-04 | 4225.13 ms | 32.0% bf16 MFU | 123397 tok/s step 12284/19560 | loss 3.360928 (+0.57z)| norm 0.2580 (+0.27z)| lr 2.67e-04 | 4168.78 ms | 32.4% bf16 MFU | 123515 tok/s step 12285/19560 | loss 3.323560 (-0.43z)| norm 0.2544 (-0.00z)| lr 2.67e-04 | 4178.93 ms | 32.3% bf16 MFU | 123612 tok/s step 12286/19560 | loss 3.278561 (-1.61z)| norm 0.2609 (+0.49z)| lr 2.67e-04 | 4175.54 ms | 32.3% bf16 MFU | 123710 tok/s step 12287/19560 | loss 3.351220 (+0.32z)| norm 0.2827 (+2.08z)| lr 2.67e-04 | 4174.21 ms | 32.3% bf16 MFU | 123804 tok/s step 12288/19560 | loss 3.364910 (+0.69z)| norm 0.2547 (-0.00z)| lr 2.67e-04 | 4163.23 ms | 32.4% bf16 MFU | 123911 tok/s step 12289/19560 | loss 3.299894 (-1.05z)| norm 0.2375 (-1.28z)| lr 2.67e-04 | 4166.13 ms | 32.4% bf16 MFU | 124008 tok/s step 12290/19560 | loss 3.305549 (-0.90z)| norm 0.2695 (+1.08z)| lr 2.67e-04 | 4166.25 ms | 32.4% bf16 MFU | 124099 tok/s step 12291/19560 | loss 3.328791 (-0.27z)| norm 0.2550 (-0.00z)| lr 2.67e-04 | 4170.61 ms | 32.4% bf16 MFU | 124180 tok/s step 12292/19560 | loss 3.335802 (-0.07z)| norm 0.2508 (-0.31z)| lr 2.66e-04 | 4164.55 ms | 32.4% bf16 MFU | 124265 tok/s step 12293/19560 | loss 3.363410 (+0.66z)| norm 0.2691 (+1.03z)| lr 2.66e-04 | 4169.69 ms | 32.4% bf16 MFU | 124339 tok/s step 12294/19560 | loss 3.374932 (+0.97z)| norm 0.2617 (+0.48z)| lr 2.66e-04 | 4161.37 ms | 32.4% bf16 MFU | 124422 tok/s step 12295/19560 | loss 3.363719 (+0.66z)| norm 0.2746 (+1.42z)| lr 2.66e-04 | 4182.51 ms | 32.3% bf16 MFU | 124468 tok/s step 12296/19560 | loss 3.293041 (-1.26z)| norm 0.2536 (-0.14z)| lr 2.66e-04 | 4179.03 ms | 32.3% bf16 MFU | 124518 tok/s step 12297/19560 | loss 3.392699 (+1.44z)| norm 0.2712 (+1.15z)| lr 2.66e-04 | 4171.55 ms | 32.4% bf16 MFU | 124576 tok/s step 12298/19560 | loss 3.308108 (-0.84z)| norm 0.2378 (-1.29z)| lr 2.66e-04 | 4172.94 ms | 32.4% bf16 MFU | 124629 tok/s step 12299/19560 | loss 3.352012 (+0.35z)| norm 0.2627 (+0.52z)| lr 2.66e-04 | 4170.91 ms | 32.4% bf16 MFU | 124683 tok/s step 12300/19560 | loss 3.388986 (+1.39z)| norm 0.2558 (+0.01z)| lr 2.66e-04 | 4167.96 ms | 32.4% bf16 MFU | 124738 tok/s step 12301/19560 | loss 3.352015 (+0.35z)| norm 0.2550 (-0.05z)| lr 2.66e-04 | 4175.95 ms | 32.3% bf16 MFU | 124778 tok/s step 12302/19560 | loss 3.296580 (-1.18z)| norm 0.2525 (-0.23z)| lr 2.66e-04 | 4178.27 ms | 32.3% bf16 MFU | 124814 tok/s step 12303/19560 | loss 3.278032 (-1.67z)| norm 0.2354 (-1.49z)| lr 2.66e-04 | 4171.34 ms | 32.4% bf16 MFU | 124857 tok/s step 12304/19560 | loss 3.325404 (-0.34z)| norm 0.2557 (+0.00z)| lr 2.66e-04 | 4173.82 ms | 32.3% bf16 MFU | 124895 tok/s step 12305/19560 | loss 3.330117 (-0.21z)| norm 0.2601 (+0.32z)| lr 2.66e-04 | 4170.60 ms | 32.4% bf16 MFU | 124936 tok/s step 12306/19560 | loss 3.260408 (-2.10z)| norm 0.2509 (-0.36z)| lr 2.66e-04 | 4177.02 ms | 32.3% bf16 MFU | 124965 tok/s step 12307/19560 | loss 3.362518 (+0.69z)| norm 0.2564 (+0.04z)| lr 2.65e-04 | 4187.19 ms | 32.2% bf16 MFU | 124977 tok/s step 12308/19560 | loss 3.344483 (+0.20z)| norm 0.2767 (+1.52z)| lr 2.65e-04 | 4265.70 ms | 31.7% bf16 MFU | 124874 tok/s step 12309/19560 | loss 3.303605 (-0.91z)| norm 0.2541 (-0.13z)| lr 2.65e-04 | 4184.39 ms | 32.3% bf16 MFU | 124895 tok/s step 12310/19560 | loss 3.288305 (-1.32z)| norm 0.2517 (-0.31z)| lr 2.65e-04 | 4718.48 ms | 28.6% bf16 MFU | 124206 tok/s step 12311/19560 | loss 3.313613 (-0.62z)| norm 0.2496 (-0.46z)| lr 2.65e-04 | 4605.17 ms | 29.3% bf16 MFU | 123688 tok/s step 12312/19560 | loss 3.388905 (+1.42z)| norm 0.2516 (-0.31z)| lr 2.65e-04 | 4656.15 ms | 29.0% bf16 MFU | 123134 tok/s step 12313/19560 | loss 3.337959 (+0.04z)| norm 0.2390 (-1.22z)| lr 2.65e-04 | 4499.39 ms | 30.0% bf16 MFU | 122803 tok/s step 12314/19560 | loss 3.376894 (+1.07z)| norm 0.2487 (-0.53z)| lr 2.65e-04 | 4396.08 ms | 30.7% bf16 MFU | 122626 tok/s step 12315/19560 | loss 3.352024 (+0.41z)| norm 0.2471 (-0.63z)| lr 2.65e-04 | 4426.37 ms | 30.5% bf16 MFU | 122417 tok/s step 12316/19560 | loss 3.317997 (-0.51z)| norm 0.2467 (-0.67z)| lr 2.65e-04 | 4244.58 ms | 31.8% bf16 MFU | 122472 tok/s step 12317/19560 | loss 3.361293 (+0.68z)| norm 0.2817 (+1.86z)| lr 2.65e-04 | 4248.04 ms | 31.8% bf16 MFU | 122520 tok/s step 12318/19560 | loss 3.335336 (-0.04z)| norm 0.2566 (+0.03z)| lr 2.65e-04 | 4355.14 ms | 31.0% bf16 MFU | 122413 tok/s step 12319/19560 | loss 3.333237 (-0.08z)| norm 0.2631 (+0.51z)| lr 2.65e-04 | 4306.78 ms | 31.3% bf16 MFU | 122379 tok/s step 12320/19560 | loss 3.398772 (+1.74z)| norm 0.2595 (+0.24z)| lr 2.65e-04 | 4213.64 ms | 32.0% bf16 MFU | 122481 tok/s step 12321/19560 | loss 3.388252 (+1.42z)| norm 0.2490 (-0.53z)| lr 2.65e-04 | 4218.16 ms | 32.0% bf16 MFU | 122572 tok/s step 12322/19560 | loss 3.325267 (-0.31z)| norm 0.2693 (+0.99z)| lr 2.65e-04 | 4209.96 ms | 32.1% bf16 MFU | 122670 tok/s step 12323/19560 | loss 3.359447 (+0.72z)| norm 0.2425 (-1.00z)| lr 2.64e-04 | 4160.62 ms | 32.5% bf16 MFU | 122837 tok/s step 12324/19560 | loss 3.328752 (-0.19z)| norm 0.2589 (+0.22z)| lr 2.64e-04 | 4171.10 ms | 32.4% bf16 MFU | 122980 tok/s step 12325/19560 | loss 3.367122 (+0.95z)| norm 0.2526 (-0.23z)| lr 2.64e-04 | 4172.33 ms | 32.4% bf16 MFU | 123114 tok/s step 12326/19560 | loss 3.408134 (+2.12z)| norm 0.2580 (+0.17z)| lr 2.64e-04 | 4171.46 ms | 32.4% bf16 MFU | 123242 tok/s step 12327/19560 | loss 3.295492 (-1.18z)| norm 0.2496 (-0.45z)| lr 2.64e-04 | 4163.36 ms | 32.4% bf16 MFU | 123377 tok/s step 12328/19560 | loss 3.355788 (+0.58z)| norm 0.2580 (+0.18z)| lr 2.64e-04 | 4167.04 ms | 32.4% bf16 MFU | 123499 tok/s step 12329/19560 | loss 3.392097 (+1.62z)| norm 0.2769 (+1.58z)| lr 2.64e-04 | 4238.61 ms | 31.9% bf16 MFU | 123509 tok/s step 12330/19560 | loss 3.405028 (+1.95z)| norm 0.2525 (-0.24z)| lr 2.64e-04 | 4309.71 ms | 31.3% bf16 MFU | 123416 tok/s step 12331/19560 | loss 3.303545 (-0.98z)| norm 0.2871 (+2.28z)| lr 2.64e-04 | 4173.08 ms | 32.4% bf16 MFU | 123527 tok/s step 12332/19560 | loss 3.440002 (+2.84z)| norm 0.2572 (+0.09z)| lr 2.64e-04 | 4209.99 ms | 32.1% bf16 MFU | 123577 tok/s step 12333/19560 | loss 3.347789 (+0.26z)| norm 0.2707 (+1.06z)| lr 2.64e-04 | 4184.52 ms | 32.3% bf16 MFU | 123663 tok/s step 12334/19560 | loss 3.353974 (+0.44z)| norm 0.2536 (-0.19z)| lr 2.64e-04 | 4165.71 ms | 32.4% bf16 MFU | 123773 tok/s step 12335/19560 | loss 3.342757 (+0.11z)| norm 0.2765 (+1.47z)| lr 2.64e-04 | 4204.01 ms | 32.1% bf16 MFU | 123820 tok/s step 12336/19560 | loss 3.362565 (+0.66z)| norm 0.2497 (-0.49z)| lr 2.64e-04 | 4168.88 ms | 32.4% bf16 MFU | 123917 tok/s step 12337/19560 | loss 3.364476 (+0.71z)| norm 0.2432 (-0.95z)| lr 2.64e-04 | 4167.60 ms | 32.4% bf16 MFU | 124011 tok/s step 12338/19560 | loss 3.340204 (+0.04z)| norm 0.2561 (+0.01z)| lr 2.64e-04 | 4284.77 ms | 31.5% bf16 MFU | 123928 tok/s step 12339/19560 | loss 3.379438 (+1.18z)| norm 0.2540 (-0.14z)| lr 2.63e-04 | 4174.94 ms | 32.3% bf16 MFU | 124011 tok/s step 12340/19560 | loss 3.344958 (+0.17z)| norm 0.2487 (-0.52z)| lr 2.63e-04 | 4189.18 ms | 32.2% bf16 MFU | 124068 tok/s step 12341/19560 | loss 3.313361 (-0.75z)| norm 0.2541 (-0.13z)| lr 2.63e-04 | 4173.60 ms | 32.4% bf16 MFU | 124146 tok/s step 12342/19560 | loss 3.323187 (-0.45z)| norm 0.2559 (-0.00z)| lr 2.63e-04 | 4176.34 ms | 32.3% bf16 MFU | 124215 tok/s step 12343/19560 | loss 3.471810 (+3.68z)| norm 0.2584 (+0.17z)| lr 2.63e-04 | 4262.85 ms | 31.7% bf16 MFU | 124154 tok/s step 12344/19560 | loss 3.332772 (-0.20z)| norm 0.2633 (+0.53z)| lr 2.63e-04 | 4167.79 ms | 32.4% bf16 MFU | 124236 tok/s step 12345/19560 | loss 3.344778 (+0.15z)| norm 0.2490 (-0.55z)| lr 2.63e-04 | 4159.42 ms | 32.5% bf16 MFU | 124327 tok/s step 12346/19560 | loss 3.320061 (-0.54z)| norm 0.2619 (+0.43z)| lr 2.63e-04 | 4172.38 ms | 32.4% bf16 MFU | 124393 tok/s step 12347/19560 | loss 3.314174 (-0.70z)| norm 0.2601 (+0.28z)| lr 2.63e-04 | 4308.27 ms | 31.3% bf16 MFU | 124258 tok/s step 12348/19560 | loss 3.495014 (+4.11z)| norm 0.2842 (+2.05z)| lr 2.63e-04 | 4155.39 ms | 32.5% bf16 MFU | 124354 tok/s step 12349/19560 | loss 3.307693 (-0.86z)| norm 0.2579 (+0.09z)| lr 2.63e-04 | 4198.96 ms | 32.2% bf16 MFU | 124379 tok/s step 12350/19560 | loss 3.363629 (+0.62z)| norm 0.2537 (-0.23z)| lr 2.63e-04 | 4301.74 ms | 31.4% bf16 MFU | 124254 tok/s step 12351/19560 | loss 3.406300 (+1.73z)| norm 0.2511 (-0.43z)| lr 2.63e-04 | 4255.73 ms | 31.7% bf16 MFU | 124201 tok/s step 12352/19560 | loss 3.411022 (+1.82z)| norm 0.2526 (-0.31z)| lr 2.63e-04 | 4177.60 ms | 32.3% bf16 MFU | 124266 tok/s step 12353/19560 | loss 3.331074 (-0.26z)| norm 0.2747 (+1.33z)| lr 2.63e-04 | 4167.59 ms | 32.4% bf16 MFU | 124343 tok/s step 12354/19560 | loss 3.465937 (+3.10z)| norm 0.2564 (-0.02z)| lr 2.63e-04 | 4317.80 ms | 31.3% bf16 MFU | 124197 tok/s step 12355/19560 | loss 3.352208 (+0.26z)| norm 0.2645 (+0.59z)| lr 2.62e-04 | 4173.17 ms | 32.4% bf16 MFU | 124269 tok/s step 12356/19560 | loss 3.334722 (-0.19z)| norm 0.2591 (+0.18z)| lr 2.62e-04 | 4306.72 ms | 31.4% bf16 MFU | 124142 tok/s step 12357/19560 | loss 3.429623 (+2.15z)| norm 0.2624 (+0.41z)| lr 2.62e-04 | 4321.55 ms | 31.2% bf16 MFU | 124001 tok/s step 12358/19560 | loss 3.372725 (+0.74z)| norm 0.2618 (+0.35z)| lr 2.62e-04 | 4248.83 ms | 31.8% bf16 MFU | 123971 tok/s step 12359/19560 | loss 3.347793 (+0.11z)| norm 0.2505 (-0.53z)| lr 2.62e-04 | 4168.90 ms | 32.4% bf16 MFU | 124060 tok/s step 12360/19560 | loss 3.524934 (+4.17z)| norm 0.9665 (+11.04z)| lr 2.62e-04 | 4245.47 ms | 31.8% bf16 MFU | 124032 tok/s step 12361/19560 | loss 3.378502 (+0.77z)| norm 0.3665 (+1.60z)| lr 2.62e-04 | 4183.47 ms | 32.3% bf16 MFU | 124097 tok/s step 12362/19560 | loss 3.285706 (-1.37z)| norm 0.3394 (+1.16z)| lr 2.62e-04 | 4179.61 ms | 32.3% bf16 MFU | 124164 tok/s step 12363/19560 | loss 3.338928 (-0.14z)| norm 0.3084 (+0.67z)| lr 2.62e-04 | 4172.47 ms | 32.4% bf16 MFU | 124238 tok/s step 12364/19560 | loss 3.405319 (+1.36z)| norm 0.3206 (+0.85z)| lr 2.62e-04 | 4172.65 ms | 32.4% bf16 MFU | 124309 tok/s step 12365/19560 | loss 3.336387 (-0.21z)| norm 0.2596 (-0.09z)| lr 2.62e-04 | 4178.11 ms | 32.3% bf16 MFU | 124368 tok/s step 12366/19560 | loss 3.339085 (-0.16z)| norm 0.2831 (+0.27z)| lr 2.62e-04 | 4184.98 ms | 32.3% bf16 MFU | 124413 tok/s step 12367/19560 | loss 3.334769 (-0.25z)| norm 0.2842 (+0.28z)| lr 2.62e-04 | 4168.53 ms | 32.4% bf16 MFU | 124481 tok/s step 12368/19560 | loss 3.347754 (+0.05z)| norm 0.2590 (-0.11z)| lr 2.62e-04 | 4179.93 ms | 32.3% bf16 MFU | 124529 tok/s step 12369/19560 | loss 3.395593 (+1.14z)| norm 0.2543 (-0.18z)| lr 2.62e-04 | 4161.11 ms | 32.4% bf16 MFU | 124602 tok/s step 12370/19560 | loss 3.336776 (-0.23z)| norm 0.2492 (-0.26z)| lr 2.61e-04 | 4168.87 ms | 32.4% bf16 MFU | 124660 tok/s step 12371/19560 | loss 3.390041 (+1.00z)| norm 0.2709 (+0.08z)| lr 2.61e-04 | 4270.32 ms | 31.6% bf16 MFU | 124566 tok/s step 12372/19560 | loss 3.343143 (-0.10z)| norm 0.2532 (-0.20z)| lr 2.61e-04 | 4176.33 ms | 32.3% bf16 MFU | 124614 tok/s step 12373/19560 | loss 3.381786 (+0.79z)| norm 0.2475 (-0.28z)| lr 2.61e-04 | 4173.55 ms | 32.4% bf16 MFU | 124665 tok/s step 12374/19560 | loss 3.349183 (+0.01z)| norm 0.2463 (-0.30z)| lr 2.61e-04 | 4199.65 ms | 32.1% bf16 MFU | 124674 tok/s step 12375/19560 | loss 3.340761 (-0.19z)| norm 0.2456 (-0.31z)| lr 2.61e-04 | 4227.08 ms | 31.9% bf16 MFU | 124641 tok/s step 12376/19560 | loss 3.330651 (-0.42z)| norm 0.2494 (-0.24z)| lr 2.61e-04 | 4176.08 ms | 32.3% bf16 MFU | 124687 tok/s step 12377/19560 | loss 3.365340 (+0.39z)| norm 0.2496 (-0.24z)| lr 2.61e-04 | 4349.50 ms | 31.0% bf16 MFU | 124479 tok/s step 12378/19560 | loss 3.331184 (-0.41z)| norm 0.2399 (-0.38z)| lr 2.61e-04 | 4165.51 ms | 32.4% bf16 MFU | 124548 tok/s step 12379/19560 | loss 3.350502 (+0.04z)| norm 0.2430 (-0.33z)| lr 2.61e-04 | 4167.14 ms | 32.4% bf16 MFU | 124612 tok/s step 12380/19560 | loss 3.409266 (+1.41z)| norm 0.2369 (-0.42z)| lr 2.61e-04 | 4183.62 ms | 32.3% bf16 MFU | 124647 tok/s step 12381/19560 | loss 3.333313 (-0.38z)| norm 0.2624 (-0.03z)| lr 2.61e-04 | 4208.98 ms | 32.1% bf16 MFU | 124643 tok/s step 12382/19560 | loss 3.293503 (-1.31z)| norm 0.2373 (-0.41z)| lr 2.61e-04 | 4175.65 ms | 32.3% bf16 MFU | 124689 tok/s step 12383/19560 | loss 3.339645 (-0.23z)| norm 0.2456 (-0.29z)| lr 2.61e-04 | 4218.30 ms | 32.0% bf16 MFU | 124669 tok/s step 12384/19560 | loss 3.429443 (+1.85z)| norm 0.2414 (-0.35z)| lr 2.61e-04 | 5198.27 ms | 26.0% bf16 MFU | 123478 tok/s step 12385/19560 | loss 3.375840 (+0.60z)| norm 0.2826 (+0.28z)| lr 2.61e-04 | 4269.65 ms | 31.6% bf16 MFU | 123444 tok/s step 12386/19560 | loss 3.346383 (-0.09z)| norm 0.2657 (+0.02z)| lr 2.60e-04 | 4283.54 ms | 31.5% bf16 MFU | 123392 tok/s step 12387/19560 | loss 3.405015 (+1.27z)| norm 0.2630 (-0.02z)| lr 2.60e-04 | 4167.29 ms | 32.4% bf16 MFU | 123513 tok/s step 12388/19560 | loss 3.351725 (+0.03z)| norm 0.2655 (+0.01z)| lr 2.60e-04 | 4180.80 ms | 32.3% bf16 MFU | 123607 tok/s step 12389/19560 | loss 3.408672 (+1.34z)| norm 0.2479 (-0.26z)| lr 2.60e-04 | 4177.80 ms | 32.3% bf16 MFU | 123701 tok/s step 12390/19560 | loss 3.378188 (+0.63z)| norm 0.2662 (+0.02z)| lr 2.60e-04 | 4177.83 ms | 32.3% bf16 MFU | 123791 tok/s step 12391/19560 | loss 3.365680 (+0.33z)| norm 0.2404 (-0.37z)| lr 2.60e-04 | 4183.05 ms | 32.3% bf16 MFU | 123868 tok/s step 12392/19560 | loss 3.360307 (+0.20z)| norm 0.2489 (-0.25z)| lr 2.60e-04 | 4287.00 ms | 31.5% bf16 MFU | 123790 tok/s step 12393/19560 | loss 3.371247 (+0.44z)| norm 0.2454 (-0.30z)| lr 2.60e-04 | 4172.63 ms | 32.4% bf16 MFU | 123883 tok/s step 12394/19560 | loss 3.352220 (-0.00z)| norm 0.2319 (-0.51z)| lr 2.60e-04 | 4183.55 ms | 32.3% bf16 MFU | 123955 tok/s step 12395/19560 | loss 3.408122 (+1.30z)| norm 0.2374 (-0.42z)| lr 2.60e-04 | 4180.46 ms | 32.3% bf16 MFU | 124028 tok/s step 12396/19560 | loss 3.407409 (+1.27z)| norm 0.2379 (-0.41z)| lr 2.60e-04 | 4259.06 ms | 31.7% bf16 MFU | 123981 tok/s step 12397/19560 | loss 3.383923 (+0.71z)| norm 0.2361 (-0.44z)| lr 2.60e-04 | 4188.40 ms | 32.2% bf16 MFU | 124041 tok/s step 12398/19560 | loss 3.370363 (+0.39z)| norm 0.2290 (-0.55z)| lr 2.60e-04 | 4189.03 ms | 32.2% bf16 MFU | 124097 tok/s step 12399/19560 | loss 3.344764 (-0.20z)| norm 0.2387 (-0.40z)| lr 2.60e-04 | 4269.25 ms | 31.6% bf16 MFU | 124032 tok/s step 12400/19560 | loss 3.317802 (-0.84z)| norm 0.2338 (-0.47z)| lr 2.60e-04 | 4184.57 ms | 32.3% bf16 MFU | 124095 tok/s step 12401/19560 | loss 3.365765 (+0.28z)| norm 0.2519 (-0.19z)| lr 2.60e-04 | 4204.25 ms | 32.1% bf16 MFU | 124126 tok/s step 12402/19560 | loss 3.333461 (-0.48z)| norm 0.2451 (-0.30z)| lr 2.59e-04 | 4191.75 ms | 32.2% bf16 MFU | 124173 tok/s step 12403/19560 | loss 3.406907 (+1.23z)| norm 0.2415 (-0.35z)| lr 2.59e-04 | 4304.70 ms | 31.4% bf16 MFU | 124054 tok/s step 12404/19560 | loss 3.368999 (+0.32z)| norm 0.2437 (-0.31z)| lr 2.59e-04 | 4197.75 ms | 32.2% bf16 MFU | 124096 tok/s step 12405/19560 | loss 3.376887 (+0.51z)| norm 0.2691 (+0.08z)| lr 2.59e-04 | 4164.41 ms | 32.4% bf16 MFU | 124186 tok/s step 12406/19560 | loss 3.369465 (+0.33z)| norm 0.2563 (-0.12z)| lr 2.59e-04 | 4175.25 ms | 32.3% bf16 MFU | 124256 tok/s step 12407/19560 | loss 3.355382 (-0.01z)| norm 0.2760 (+0.19z)| lr 2.59e-04 | 4264.24 ms | 31.7% bf16 MFU | 124190 tok/s step 12408/19560 | loss 3.365958 (+0.23z)| norm 0.2443 (-0.30z)| lr 2.59e-04 | 4181.34 ms | 32.3% bf16 MFU | 124250 tok/s step 12409/19560 | loss 3.353355 (-0.07z)| norm 0.2552 (-0.13z)| lr 2.59e-04 | 4211.33 ms | 32.1% bf16 MFU | 124262 tok/s step 12410/19560 | loss 3.367972 (+0.27z)| norm 0.2581 (-0.08z)| lr 2.59e-04 | 4179.61 ms | 32.3% bf16 MFU | 124321 tok/s step 12411/19560 | loss 3.448518 (+2.14z)| norm 0.2678 (+0.06z)| lr 2.59e-04 | 4176.64 ms | 32.3% bf16 MFU | 124382 tok/s step 12412/19560 | loss 3.299419 (-1.34z)| norm 0.2351 (-0.43z)| lr 2.59e-04 | 4170.55 ms | 32.4% bf16 MFU | 124448 tok/s step 12413/19560 | loss 3.332266 (-0.58z)| norm 0.2807 (+0.26z)| lr 2.59e-04 | 4198.31 ms | 32.2% bf16 MFU | 124470 tok/s step 12414/19560 | loss 3.353942 (-0.09z)| norm 0.2418 (-0.33z)| lr 2.59e-04 | 4172.47 ms | 32.4% bf16 MFU | 124529 tok/s step 12415/19560 | loss 3.333239 (-0.57z)| norm 0.2697 (+0.10z)| lr 2.59e-04 | 4309.18 ms | 31.3% bf16 MFU | 124386 tok/s step 12416/19560 | loss 3.307612 (-1.16z)| norm 0.2278 (-0.54z)| lr 2.59e-04 | 4946.01 ms | 27.3% bf16 MFU | 123467 tok/s step 12417/19560 | loss 3.342657 (-0.35z)| norm 0.2654 (+0.03z)| lr 2.59e-04 | 4167.27 ms | 32.4% bf16 MFU | 123584 tok/s step 12418/19560 | loss 3.342659 (-0.36z)| norm 0.2424 (-0.32z)| lr 2.58e-04 | 4167.22 ms | 32.4% bf16 MFU | 123695 tok/s step 12419/19560 | loss 3.340083 (-0.42z)| norm 0.2460 (-0.26z)| lr 2.58e-04 | 4181.46 ms | 32.3% bf16 MFU | 123780 tok/s step 12420/19560 | loss 3.422756 (+1.52z)| norm 0.2383 (-0.38z)| lr 2.58e-04 | 4232.66 ms | 31.9% bf16 MFU | 123784 tok/s step 12421/19560 | loss 3.345748 (-0.30z)| norm 0.2412 (-0.33z)| lr 2.58e-04 | 4172.02 ms | 32.4% bf16 MFU | 123878 tok/s step 12422/19560 | loss 3.358110 (-0.00z)| norm 0.2490 (-0.21z)| lr 2.58e-04 | 4185.27 ms | 32.3% bf16 MFU | 123948 tok/s step 12423/19560 | loss 3.309293 (-1.14z)| norm 0.2366 (-0.39z)| lr 2.58e-04 | 4171.21 ms | 32.4% bf16 MFU | 124035 tok/s step 12424/19560 | loss 3.376237 (+0.42z)| norm 0.2453 (-0.26z)| lr 2.58e-04 | 4232.30 ms | 31.9% bf16 MFU | 124027 tok/s step 12425/19560 | loss 3.398768 (+0.95z)| norm 0.2249 (-0.57z)| lr 2.58e-04 | 4170.22 ms | 32.4% bf16 MFU | 124112 tok/s step 12426/19560 | loss 3.380890 (+0.52z)| norm 0.2404 (-0.33z)| lr 2.58e-04 | 4316.56 ms | 31.3% bf16 MFU | 123979 tok/s step 12427/19560 | loss 3.357219 (-0.04z)| norm 0.2493 (-0.19z)| lr 2.58e-04 | 4335.02 ms | 31.1% bf16 MFU | 123828 tok/s step 12428/19560 | loss 3.339751 (-0.45z)| norm 0.2419 (-0.30z)| lr 2.58e-04 | 4173.38 ms | 32.4% bf16 MFU | 123917 tok/s step 12429/19560 | loss 3.334638 (-0.57z)| norm 0.2307 (-0.47z)| lr 2.58e-04 | 4258.81 ms | 31.7% bf16 MFU | 123877 tok/s step 12430/19560 | loss 3.327541 (-0.75z)| norm 0.2398 (-0.33z)| lr 2.58e-04 | 4173.61 ms | 32.4% bf16 MFU | 123964 tok/s step 12431/19560 | loss 3.332599 (-0.65z)| norm 0.2406 (-0.32z)| lr 2.58e-04 | 4261.92 ms | 31.7% bf16 MFU | 123917 tok/s step 12432/19560 | loss 3.400576 (+0.99z)| norm 0.2369 (-0.37z)| lr 2.58e-04 | 4384.64 ms | 30.8% bf16 MFU | 123700 tok/s step 12433/19560 | loss 3.316200 (-1.05z)| norm 0.2331 (-0.43z)| lr 2.57e-04 | 4169.42 ms | 32.4% bf16 MFU | 123802 tok/s step 12434/19560 | loss 3.361028 (+0.01z)| norm 0.2312 (-0.45z)| lr 2.57e-04 | 4179.56 ms | 32.3% bf16 MFU | 123884 tok/s step 12435/19560 | loss 3.424925 (+1.57z)| norm 0.2590 (-0.03z)| lr 2.57e-04 | 4171.23 ms | 32.4% bf16 MFU | 123974 tok/s step 12436/19560 | loss 3.368131 (+0.17z)| norm 0.2355 (-0.38z)| lr 2.57e-04 | 4263.07 ms | 31.7% bf16 MFU | 123925 tok/s step 12437/19560 | loss 3.337749 (-0.58z)| norm 0.2323 (-0.43z)| lr 2.57e-04 | 4255.69 ms | 31.7% bf16 MFU | 123888 tok/s step 12438/19560 | loss 3.399938 (+0.94z)| norm 0.2455 (-0.23z)| lr 2.57e-04 | 4361.16 ms | 31.0% bf16 MFU | 123705 tok/s step 12439/19560 | loss 3.336236 (-0.66z)| norm 0.2277 (-0.50z)| lr 2.57e-04 | 4187.53 ms | 32.2% bf16 MFU | 123780 tok/s step 12440/19560 | loss 3.377391 (+0.38z)| norm 0.2531 (-0.11z)| lr 2.57e-04 | 4192.01 ms | 32.2% bf16 MFU | 123844 tok/s step 12441/19560 | loss 3.338282 (-0.60z)| norm 0.2325 (-0.42z)| lr 2.57e-04 | 4171.07 ms | 32.4% bf16 MFU | 123937 tok/s step 12442/19560 | loss 3.386940 (+0.61z)| norm 0.2797 (+0.29z)| lr 2.57e-04 | 4230.73 ms | 31.9% bf16 MFU | 123936 tok/s step 12443/19560 | loss 3.409073 (+1.15z)| norm 0.2521 (-0.13z)| lr 2.57e-04 | 4171.57 ms | 32.4% bf16 MFU | 124023 tok/s step 12444/19560 | loss 3.348782 (-0.36z)| norm 0.2639 (+0.05z)| lr 2.57e-04 | 4178.75 ms | 32.3% bf16 MFU | 124095 tok/s step 12445/19560 | loss 3.294253 (-1.69z)| norm 0.2299 (-0.46z)| lr 2.57e-04 | 4178.59 ms | 32.3% bf16 MFU | 124164 tok/s step 12446/19560 | loss 3.371339 (+0.21z)| norm 0.2652 (+0.07z)| lr 2.57e-04 | 4185.50 ms | 32.3% bf16 MFU | 124219 tok/s step 12447/19560 | loss 3.342711 (-0.50z)| norm 0.2284 (-0.48z)| lr 2.57e-04 | 4212.30 ms | 32.1% bf16 MFU | 124231 tok/s step 12448/19560 | loss 3.445603 (+2.01z)| norm 0.2634 (+0.05z)| lr 2.57e-04 | 4176.41 ms | 32.3% bf16 MFU | 124297 tok/s step 12449/19560 | loss 3.344263 (-0.46z)| norm 0.2662 (+0.09z)| lr 2.56e-04 | 4180.94 ms | 32.3% bf16 MFU | 124352 tok/s step 12450/19560 | loss 3.346161 (-0.42z)| norm 0.2708 (+0.16z)| lr 2.56e-04 | 4177.06 ms | 32.3% bf16 MFU | 124410 tok/s step 12451/19560 | loss 3.378922 (+0.38z)| norm 0.2553 (-0.08z)| lr 2.56e-04 | 4186.21 ms | 32.3% bf16 MFU | 124452 tok/s step 12452/19560 | loss 3.354440 (-0.22z)| norm 0.2723 (+0.18z)| lr 2.56e-04 | 4206.90 ms | 32.1% bf16 MFU | 124460 tok/s step 12453/19560 | loss 3.362892 (-0.01z)| norm 0.2572 (-0.05z)| lr 2.56e-04 | 4166.27 ms | 32.4% bf16 MFU | 124529 tok/s step 12454/19560 | loss 3.304398 (-1.43z)| norm 0.2644 (+0.06z)| lr 2.56e-04 | 4187.17 ms | 32.2% bf16 MFU | 124564 tok/s step 12455/19560 | loss 3.481068 (+2.81z)| norm 0.2438 (-0.25z)| lr 2.56e-04 | 4181.20 ms | 32.3% bf16 MFU | 124605 tok/s step 12456/19560 | loss 3.351672 (-0.30z)| norm 0.2751 (+0.22z)| lr 2.56e-04 | 4179.59 ms | 32.3% bf16 MFU | 124647 tok/s step 12457/19560 | loss 3.330636 (-0.79z)| norm 0.2422 (-0.27z)| lr 2.56e-04 | 4223.33 ms | 32.0% bf16 MFU | 124621 tok/s step 12458/19560 | loss 3.336748 (-0.63z)| norm 0.2997 (+0.59z)| lr 2.56e-04 | 4189.62 ms | 32.2% bf16 MFU | 124647 tok/s step 12459/19560 | loss 3.340117 (-0.56z)| norm 0.2460 (-0.22z)| lr 2.56e-04 | 4179.99 ms | 32.3% bf16 MFU | 124686 tok/s step 12460/19560 | loss 3.387582 (+0.60z)| norm 0.2736 (+0.20z)| lr 2.56e-04 | 4167.83 ms | 32.4% bf16 MFU | 124742 tok/s step 12461/19560 | loss 3.322457 (-0.98z)| norm 0.2600 (-0.01z)| lr 2.56e-04 | 4185.02 ms | 32.3% bf16 MFU | 124768 tok/s step 12462/19560 | loss 3.385410 (+0.55z)| norm 0.2508 (-0.15z)| lr 2.56e-04 | 4177.01 ms | 32.3% bf16 MFU | 124806 tok/s step 12463/19560 | loss 3.327800 (-0.85z)| norm 0.2541 (-0.09z)| lr 2.56e-04 | 4186.01 ms | 32.3% bf16 MFU | 124828 tok/s step 12464/19560 | loss 3.350912 (-0.29z)| norm 0.2647 (+0.07z)| lr 2.56e-04 | 4173.15 ms | 32.4% bf16 MFU | 124868 tok/s step 12465/19560 | loss 3.376259 (+0.33z)| norm 0.2665 (+0.09z)| lr 2.55e-04 | 4165.22 ms | 32.4% bf16 MFU | 124919 tok/s step 12466/19560 | loss 3.339801 (-0.56z)| norm 0.2695 (+0.13z)| lr 2.55e-04 | 4197.73 ms | 32.2% bf16 MFU | 124918 tok/s step 12467/19560 | loss 3.331613 (-0.75z)| norm 0.2636 (+0.04z)| lr 2.55e-04 | 4201.31 ms | 32.1% bf16 MFU | 124911 tok/s step 12468/19560 | loss 3.369994 (+0.18z)| norm 0.2577 (-0.05z)| lr 2.55e-04 | 4180.85 ms | 32.3% bf16 MFU | 124936 tok/s step 12469/19560 | loss 3.387283 (+0.59z)| norm 0.2527 (-0.12z)| lr 2.55e-04 | 4186.81 ms | 32.2% bf16 MFU | 124950 tok/s step 12470/19560 | loss 3.353619 (-0.24z)| norm 0.2788 (+0.27z)| lr 2.55e-04 | 4244.60 ms | 31.8% bf16 MFU | 124879 tok/s step 12471/19560 | loss 3.320350 (-1.05z)| norm 0.2425 (-0.28z)| lr 2.55e-04 | 4167.00 ms | 32.4% bf16 MFU | 124926 tok/s step 12472/19560 | loss 3.323825 (-0.96z)| norm 0.2909 (+0.45z)| lr 2.55e-04 | 4170.79 ms | 32.4% bf16 MFU | 124965 tok/s step 12473/19560 | loss 3.343174 (-0.48z)| norm 0.2606 (-0.01z)| lr 2.55e-04 | 4173.96 ms | 32.3% bf16 MFU | 124997 tok/s step 12474/19560 | loss 3.278518 (-2.06z)| norm 0.2760 (+0.22z)| lr 2.55e-04 | 4167.12 ms | 32.4% bf16 MFU | 125038 tok/s step 12475/19560 | loss 3.366485 (+0.10z)| norm 0.2499 (-0.17z)| lr 2.55e-04 | 4177.51 ms | 32.3% bf16 MFU | 125061 tok/s step 12476/19560 | loss 3.336270 (-0.65z)| norm 0.2507 (-0.15z)| lr 2.55e-04 | 4167.32 ms | 32.4% bf16 MFU | 125098 tok/s step 12477/19560 | loss 3.329388 (-0.83z)| norm 0.2770 (+0.24z)| lr 2.55e-04 | 4171.65 ms | 32.4% bf16 MFU | 125127 tok/s step 12478/19560 | loss 3.359312 (-0.05z)| norm 0.2468 (-0.21z)| lr 2.55e-04 | 4178.07 ms | 32.3% bf16 MFU | 125145 tok/s step 12479/19560 | loss 3.339477 (-0.56z)| norm 0.2610 (-0.00z)| lr 2.55e-04 | 4166.86 ms | 32.4% bf16 MFU | 125179 tok/s step 12480/19560 | loss 3.350101 (-0.27z)| norm 0.2627 (+0.02z)| lr 2.55e-04 | 4188.50 ms | 32.2% bf16 MFU | 125179 tok/s step 12481/19560 | loss 3.306437 (-1.41z)| norm 0.2398 (-0.32z)| lr 2.54e-04 | 4193.80 ms | 32.2% bf16 MFU | 125171 tok/s step 12482/19560 | loss 3.347179 (-0.32z)| norm 0.2606 (-0.01z)| lr 2.54e-04 | 4176.81 ms | 32.3% bf16 MFU | 125188 tok/s step 12483/19560 | loss 3.339959 (-0.52z)| norm 0.2547 (-0.09z)| lr 2.54e-04 | 4173.22 ms | 32.4% bf16 MFU | 125210 tok/s step 12484/19560 | loss 3.354269 (-0.13z)| norm 0.2526 (-0.12z)| lr 2.54e-04 | 4187.01 ms | 32.2% bf16 MFU | 125211 tok/s step 12485/19560 | loss 3.324121 (-0.94z)| norm 0.2391 (-0.33z)| lr 2.54e-04 | 4164.90 ms | 32.4% bf16 MFU | 125244 tok/s step 12486/19560 | loss 3.376422 (+0.49z)| norm 0.2430 (-0.26z)| lr 2.54e-04 | 4183.79 ms | 32.3% bf16 MFU | 125248 tok/s step 12487/19560 | loss 3.341733 (-0.45z)| norm 0.2420 (-0.28z)| lr 2.54e-04 | 4184.67 ms | 32.3% bf16 MFU | 125250 tok/s step 12488/19560 | loss 3.431598 (+2.17z)| norm 0.2631 (+0.39z)| lr 2.54e-04 | 4194.53 ms | 32.2% bf16 MFU | 125237 tok/s step 12489/19560 | loss 3.385316 (+0.81z)| norm 0.2425 (-0.62z)| lr 2.54e-04 | 4175.41 ms | 32.3% bf16 MFU | 125253 tok/s step 12490/19560 | loss 3.382537 (+0.72z)| norm 0.2362 (-1.00z)| lr 2.54e-04 | 4176.59 ms | 32.3% bf16 MFU | 125267 tok/s step 12491/19560 | loss 3.384705 (+0.77z)| norm 0.2430 (-0.59z)| lr 2.54e-04 | 5347.95 ms | 25.2% bf16 MFU | 123906 tok/s step 12492/19560 | loss 3.305486 (-1.56z)| norm 0.2494 (-0.18z)| lr 2.54e-04 | 4171.99 ms | 32.4% bf16 MFU | 123994 tok/s step 12493/19560 | loss 3.321111 (-1.09z)| norm 0.2549 (+0.19z)| lr 2.54e-04 | 4171.79 ms | 32.4% bf16 MFU | 124078 tok/s step 12494/19560 | loss 3.348490 (-0.28z)| norm 0.2345 (-1.15z)| lr 2.54e-04 | 4163.36 ms | 32.4% bf16 MFU | 124170 tok/s step 12495/19560 | loss 3.356117 (-0.06z)| norm 0.2375 (-0.94z)| lr 2.54e-04 | 4187.98 ms | 32.2% bf16 MFU | 124221 tok/s step 12496/19560 | loss 3.388299 (+0.88z)| norm 0.2469 (-0.29z)| lr 2.54e-04 | 4184.97 ms | 32.3% bf16 MFU | 124274 tok/s step 12497/19560 | loss 3.342284 (-0.47z)| norm 0.2633 (+0.82z)| lr 2.53e-04 | 4181.04 ms | 32.3% bf16 MFU | 124330 tok/s step 12498/19560 | loss 3.344709 (-0.40z)| norm 0.2369 (-0.97z)| lr 2.53e-04 | 4162.66 ms | 32.4% bf16 MFU | 124411 tok/s step 12499/19560 | loss 3.316421 (-1.22z)| norm 0.2369 (-0.95z)| lr 2.53e-04 | 4160.85 ms | 32.4% bf16 MFU | 124491 tok/s step 12500/19560 | loss 3.520241 (+4.41z)| norm 0.2499 (-0.06z)| lr 2.53e-04 | 4168.93 ms | 32.4% bf16 MFU | 124555 tok/s val loss 3.327497 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2981/10042 = 0.296853 step 12501/19560 | loss 3.291255 (-1.81z)| norm 0.2449 (-0.41z)| lr 2.53e-04 | 4861.49 ms | 27.8% bf16 MFU | 123719 tok/s step 12502/19560 | loss 3.341813 (-0.44z)| norm 0.2259 (-1.67z)| lr 2.53e-04 | 4744.40 ms | 28.5% bf16 MFU | 123058 tok/s step 12503/19560 | loss 3.378121 (+0.53z)| norm 0.2434 (-0.49z)| lr 2.53e-04 | 4398.22 ms | 30.7% bf16 MFU | 122866 tok/s step 12504/19560 | loss 3.309602 (-1.31z)| norm 0.2609 (+0.68z)| lr 2.53e-04 | 4242.04 ms | 31.8% bf16 MFU | 122902 tok/s step 12505/19560 | loss 3.322761 (-0.94z)| norm 0.2561 (+0.36z)| lr 2.53e-04 | 4403.25 ms | 30.7% bf16 MFU | 122710 tok/s step 12506/19560 | loss 3.300746 (-1.51z)| norm 0.2750 (+1.59z)| lr 2.53e-04 | 4259.08 ms | 31.7% bf16 MFU | 122730 tok/s step 12507/19560 | loss 3.354557 (-0.08z)| norm 0.2619 (+0.71z)| lr 2.53e-04 | 4193.73 ms | 32.2% bf16 MFU | 122844 tok/s step 12508/19560 | loss 3.437428 (+2.09z)| norm 0.2478 (-0.23z)| lr 2.53e-04 | 4167.12 ms | 32.4% bf16 MFU | 122993 tok/s step 12509/19560 | loss 3.348250 (-0.26z)| norm 0.2649 (+0.90z)| lr 2.53e-04 | 4246.01 ms | 31.8% bf16 MFU | 123017 tok/s step 12510/19560 | loss 3.357433 (-0.03z)| norm 0.2588 (+0.49z)| lr 2.53e-04 | 4166.78 ms | 32.4% bf16 MFU | 123157 tok/s step 12511/19560 | loss 3.364487 (+0.15z)| norm 0.2616 (+0.66z)| lr 2.53e-04 | 4407.03 ms | 30.6% bf16 MFU | 122948 tok/s step 12512/19560 | loss 3.355575 (-0.07z)| norm 0.2536 (+0.13z)| lr 2.53e-04 | 4259.24 ms | 31.7% bf16 MFU | 122955 tok/s step 12513/19560 | loss 3.351772 (-0.17z)| norm 0.2721 (+1.38z)| lr 2.52e-04 | 4169.30 ms | 32.4% bf16 MFU | 123095 tok/s step 12514/19560 | loss 3.322358 (-0.96z)| norm 0.2473 (-0.28z)| lr 2.52e-04 | 4167.21 ms | 32.4% bf16 MFU | 123231 tok/s step 12515/19560 | loss 3.457328 (+2.62z)| norm 0.2502 (-0.08z)| lr 2.52e-04 | 4306.79 ms | 31.3% bf16 MFU | 123156 tok/s step 12516/19560 | loss 3.352641 (-0.15z)| norm 0.2476 (-0.25z)| lr 2.52e-04 | 4174.13 ms | 32.3% bf16 MFU | 123278 tok/s step 12517/19560 | loss 3.328630 (-0.77z)| norm 0.2368 (-0.97z)| lr 2.52e-04 | 4174.94 ms | 32.3% bf16 MFU | 123394 tok/s step 12518/19560 | loss 3.334337 (-0.61z)| norm 0.2445 (-0.44z)| lr 2.52e-04 | 4208.58 ms | 32.1% bf16 MFU | 123453 tok/s step 12519/19560 | loss 3.326475 (-0.81z)| norm 0.2428 (-0.56z)| lr 2.52e-04 | 4178.05 ms | 32.3% bf16 MFU | 123554 tok/s step 12520/19560 | loss 3.368331 (+0.30z)| norm 0.2519 (+0.06z)| lr 2.52e-04 | 4162.50 ms | 32.4% bf16 MFU | 123674 tok/s step 12521/19560 | loss 3.357421 (+0.01z)| norm 0.2479 (-0.21z)| lr 2.52e-04 | 4161.57 ms | 32.4% bf16 MFU | 123790 tok/s step 12522/19560 | loss 3.387991 (+0.82z)| norm 0.2499 (-0.09z)| lr 2.52e-04 | 4161.26 ms | 32.4% bf16 MFU | 123900 tok/s step 12523/19560 | loss 3.353314 (-0.09z)| norm 0.2477 (-0.24z)| lr 2.52e-04 | 4174.40 ms | 32.3% bf16 MFU | 123985 tok/s step 12524/19560 | loss 3.362910 (+0.17z)| norm 0.2609 (+0.65z)| lr 2.52e-04 | 4175.22 ms | 32.3% bf16 MFU | 124064 tok/s step 12525/19560 | loss 3.362024 (+0.16z)| norm 0.2436 (-0.55z)| lr 2.52e-04 | 4209.60 ms | 32.1% bf16 MFU | 124088 tok/s step 12526/19560 | loss 3.298802 (-1.52z)| norm 0.2350 (-1.15z)| lr 2.52e-04 | 4162.00 ms | 32.4% bf16 MFU | 124182 tok/s step 12527/19560 | loss 3.341943 (-0.36z)| norm 0.2429 (-0.61z)| lr 2.52e-04 | 4220.76 ms | 32.0% bf16 MFU | 124184 tok/s step 12528/19560 | loss 3.380430 (+0.65z)| norm 0.2350 (-1.16z)| lr 2.52e-04 | 4173.12 ms | 32.4% bf16 MFU | 124256 tok/s step 12529/19560 | loss 3.338798 (-0.46z)| norm 0.2501 (-0.10z)| lr 2.51e-04 | 4177.26 ms | 32.3% bf16 MFU | 124319 tok/s step 12530/19560 | loss 3.337418 (-0.49z)| norm 0.2511 (-0.04z)| lr 2.51e-04 | 4165.40 ms | 32.4% bf16 MFU | 124397 tok/s step 12531/19560 | loss 3.329547 (-0.69z)| norm 0.2520 (+0.02z)| lr 2.51e-04 | 4186.10 ms | 32.3% bf16 MFU | 124439 tok/s step 12532/19560 | loss 3.322177 (-0.88z)| norm 0.2372 (-1.01z)| lr 2.51e-04 | 4204.52 ms | 32.1% bf16 MFU | 124452 tok/s step 12533/19560 | loss 3.346806 (-0.21z)| norm 0.2372 (-0.99z)| lr 2.51e-04 | 4179.27 ms | 32.3% bf16 MFU | 124502 tok/s step 12534/19560 | loss 3.343564 (-0.29z)| norm 0.2484 (-0.21z)| lr 2.51e-04 | 4178.26 ms | 32.3% bf16 MFU | 124551 tok/s step 12535/19560 | loss 3.267889 (-2.26z)| norm 0.2505 (-0.05z)| lr 2.51e-04 | 4170.89 ms | 32.4% bf16 MFU | 124608 tok/s step 12536/19560 | loss 3.333636 (-0.52z)| norm 0.2853 (+2.34z)| lr 2.51e-04 | 4177.80 ms | 32.3% bf16 MFU | 124652 tok/s step 12537/19560 | loss 3.352946 (-0.02z)| norm 0.2798 (+1.92z)| lr 2.51e-04 | 4167.91 ms | 32.4% bf16 MFU | 124709 tok/s step 12538/19560 | loss 3.382265 (+0.75z)| norm 0.2674 (+1.07z)| lr 2.51e-04 | 4164.46 ms | 32.4% bf16 MFU | 124769 tok/s step 12539/19560 | loss 3.368023 (+0.40z)| norm 0.2496 (-0.14z)| lr 2.51e-04 | 4171.45 ms | 32.4% bf16 MFU | 124815 tok/s step 12540/19560 | loss 3.391675 (+1.02z)| norm 0.2747 (+1.55z)| lr 2.51e-04 | 4225.34 ms | 32.0% bf16 MFU | 124778 tok/s step 12541/19560 | loss 3.429340 (+1.99z)| norm 0.2366 (-1.03z)| lr 2.51e-04 | 4177.92 ms | 32.3% bf16 MFU | 124814 tok/s step 12542/19560 | loss 3.355122 (+0.01z)| norm 0.2798 (+1.90z)| lr 2.51e-04 | 4172.85 ms | 32.4% bf16 MFU | 124855 tok/s step 12543/19560 | loss 3.433534 (+2.05z)| norm 0.2443 (-0.50z)| lr 2.51e-04 | 4172.74 ms | 32.4% bf16 MFU | 124894 tok/s step 12544/19560 | loss 3.363375 (+0.20z)| norm 0.2479 (-0.27z)| lr 2.50e-04 | 4166.72 ms | 32.4% bf16 MFU | 124941 tok/s step 12545/19560 | loss 3.309498 (-1.21z)| norm 0.2390 (-0.87z)| lr 2.50e-04 | 4181.53 ms | 32.3% bf16 MFU | 124963 tok/s step 12546/19560 | loss 3.361397 (+0.15z)| norm 0.2308 (-1.42z)| lr 2.50e-04 | 4172.37 ms | 32.4% bf16 MFU | 124998 tok/s step 12547/19560 | loss 3.339174 (-0.43z)| norm 0.2562 (+0.31z)| lr 2.50e-04 | 4174.12 ms | 32.3% bf16 MFU | 125028 tok/s step 12548/19560 | loss 3.334921 (-0.53z)| norm 0.2314 (-1.37z)| lr 2.50e-04 | 4181.43 ms | 32.3% bf16 MFU | 125046 tok/s step 12549/19560 | loss 3.368678 (+0.36z)| norm 0.2949 (+2.84z)| lr 2.50e-04 | 4218.00 ms | 32.0% bf16 MFU | 125009 tok/s step 12550/19560 | loss 3.355248 (+0.00z)| norm 0.2496 (-0.16z)| lr 2.50e-04 | 4180.69 ms | 32.3% bf16 MFU | 125029 tok/s step 12551/19560 | loss 3.473933 (+3.02z)| norm 0.2477 (-0.29z)| lr 2.50e-04 | 4177.94 ms | 32.3% bf16 MFU | 125052 tok/s step 12552/19560 | loss 3.274014 (-2.06z)| norm 0.2555 (+0.22z)| lr 2.50e-04 | 4177.00 ms | 32.3% bf16 MFU | 125075 tok/s step 12553/19560 | loss 3.369324 (+0.35z)| norm 0.2427 (-0.65z)| lr 2.50e-04 | 4180.08 ms | 32.3% bf16 MFU | 125092 tok/s step 12554/19560 | loss 3.386782 (+0.79z)| norm 0.2513 (-0.07z)| lr 2.50e-04 | 4167.31 ms | 32.4% bf16 MFU | 125128 tok/s step 12555/19560 | loss 3.393587 (+0.96z)| norm 0.2586 (+0.41z)| lr 2.50e-04 | 4171.76 ms | 32.4% bf16 MFU | 125156 tok/s step 12556/19560 | loss 3.342152 (-0.34z)| norm 0.2545 (+0.13z)| lr 2.50e-04 | 4237.08 ms | 31.9% bf16 MFU | 125085 tok/s step 12557/19560 | loss 3.358332 (+0.06z)| norm 0.2396 (-0.88z)| lr 2.50e-04 | 4170.33 ms | 32.4% bf16 MFU | 125116 tok/s step 12558/19560 | loss 3.335648 (-0.52z)| norm 0.2486 (-0.28z)| lr 2.50e-04 | 4167.44 ms | 32.4% bf16 MFU | 125151 tok/s step 12559/19560 | loss 3.334455 (-0.55z)| norm 0.2557 (+0.20z)| lr 2.50e-04 | 4159.79 ms | 32.5% bf16 MFU | 125195 tok/s step 12560/19560 | loss 3.398013 (+1.06z)| norm 0.2524 (-0.03z)| lr 2.49e-04 | 4166.07 ms | 32.4% bf16 MFU | 125228 tok/s step 12561/19560 | loss 3.389229 (+0.83z)| norm 0.2404 (-0.87z)| lr 2.49e-04 | 4172.64 ms | 32.4% bf16 MFU | 125249 tok/s step 12562/19560 | loss 3.351641 (-0.12z)| norm 0.2447 (-0.58z)| lr 2.49e-04 | 4161.62 ms | 32.4% bf16 MFU | 125286 tok/s step 12563/19560 | loss 3.360680 (+0.12z)| norm 0.2582 (+0.35z)| lr 2.49e-04 | 4164.68 ms | 32.4% bf16 MFU | 125316 tok/s step 12564/19560 | loss 3.359473 (+0.09z)| norm 0.2382 (-1.04z)| lr 2.49e-04 | 4173.36 ms | 32.4% bf16 MFU | 125331 tok/s step 12565/19560 | loss 3.491352 (+3.30z)| norm 0.2907 (+2.54z)| lr 2.49e-04 | 4176.47 ms | 32.3% bf16 MFU | 125341 tok/s step 12566/19560 | loss 3.415752 (+1.43z)| norm 0.2380 (-1.05z)| lr 2.49e-04 | 4162.16 ms | 32.4% bf16 MFU | 125373 tok/s step 12567/19560 | loss 3.410702 (+1.29z)| norm 0.2436 (-0.69z)| lr 2.49e-04 | 4166.17 ms | 32.4% bf16 MFU | 125396 tok/s step 12568/19560 | loss 3.373212 (+0.38z)| norm 0.2993 (+3.00z)| lr 2.49e-04 | 4165.06 ms | 32.4% bf16 MFU | 125420 tok/s step 12569/19560 | loss 3.344094 (-0.33z)| norm 0.2726 (+1.22z)| lr 2.49e-04 | 4168.53 ms | 32.4% bf16 MFU | 125438 tok/s step 12570/19560 | loss 3.340965 (-0.40z)| norm 0.2511 (-0.20z)| lr 2.49e-04 | 4162.77 ms | 32.4% bf16 MFU | 125463 tok/s step 12571/19560 | loss 3.382337 (+0.61z)| norm 0.2691 (+1.00z)| lr 2.49e-04 | 4163.92 ms | 32.4% bf16 MFU | 125486 tok/s step 12572/19560 | loss 3.410793 (+1.29z)| norm 0.2453 (-0.58z)| lr 2.49e-04 | 4170.66 ms | 32.4% bf16 MFU | 125497 tok/s step 12573/19560 | loss 3.363925 (+0.14z)| norm 0.2508 (-0.23z)| lr 2.49e-04 | 4157.31 ms | 32.5% bf16 MFU | 125528 tok/s step 12574/19560 | loss 3.342762 (-0.38z)| norm 0.2544 (+0.02z)| lr 2.49e-04 | 4175.09 ms | 32.3% bf16 MFU | 125530 tok/s step 12575/19560 | loss 3.279977 (-1.88z)| norm 0.2490 (-0.36z)| lr 2.49e-04 | 4178.99 ms | 32.3% bf16 MFU | 125526 tok/s step 12576/19560 | loss 3.318537 (-0.94z)| norm 0.2538 (-0.03z)| lr 2.48e-04 | 4172.64 ms | 32.4% bf16 MFU | 125533 tok/s step 12577/19560 | loss 3.321928 (-0.85z)| norm 0.2428 (-0.77z)| lr 2.48e-04 | 4160.26 ms | 32.5% bf16 MFU | 125557 tok/s step 12578/19560 | loss 3.365358 (+0.22z)| norm 0.2376 (-1.11z)| lr 2.48e-04 | 4168.51 ms | 32.4% bf16 MFU | 125568 tok/s step 12579/19560 | loss 3.390669 (+0.83z)| norm 0.2398 (-0.95z)| lr 2.48e-04 | 4176.94 ms | 32.3% bf16 MFU | 125565 tok/s step 12580/19560 | loss 3.389717 (+0.80z)| norm 0.2431 (-0.71z)| lr 2.48e-04 | 4170.54 ms | 32.4% bf16 MFU | 125573 tok/s step 12581/19560 | loss 3.384530 (+0.67z)| norm 0.2461 (-0.50z)| lr 2.48e-04 | 4167.22 ms | 32.4% bf16 MFU | 125585 tok/s step 12582/19560 | loss 3.354598 (-0.07z)| norm 0.2382 (-1.02z)| lr 2.48e-04 | 4174.25 ms | 32.3% bf16 MFU | 125586 tok/s step 12583/19560 | loss 3.394742 (+0.96z)| norm 0.2450 (-0.56z)| lr 2.48e-04 | 4771.60 ms | 28.3% bf16 MFU | 124800 tok/s step 12584/19560 | loss 3.359467 (+0.07z)| norm 0.2318 (-1.44z)| lr 2.48e-04 | 4174.01 ms | 32.3% bf16 MFU | 124841 tok/s step 12585/19560 | loss 3.381530 (+0.62z)| norm 0.2418 (-0.76z)| lr 2.48e-04 | 4166.76 ms | 32.4% bf16 MFU | 124890 tok/s step 12586/19560 | loss 3.399084 (+1.05z)| norm 0.2413 (-0.79z)| lr 2.48e-04 | 4174.91 ms | 32.3% bf16 MFU | 124924 tok/s step 12587/19560 | loss 3.403536 (+1.14z)| norm 0.2310 (-1.51z)| lr 2.48e-04 | 4165.37 ms | 32.4% bf16 MFU | 124972 tok/s step 12588/19560 | loss 3.398195 (+1.00z)| norm 0.2374 (-1.03z)| lr 2.48e-04 | 4166.85 ms | 32.4% bf16 MFU | 125014 tok/s step 12589/19560 | loss 3.305286 (-1.33z)| norm 0.2548 (+0.21z)| lr 2.48e-04 | 4176.57 ms | 32.3% bf16 MFU | 125040 tok/s step 12590/19560 | loss 3.389998 (+0.80z)| norm 0.2417 (-0.72z)| lr 2.48e-04 | 4163.86 ms | 32.4% bf16 MFU | 125084 tok/s step 12591/19560 | loss 3.352155 (-0.16z)| norm 0.2510 (-0.06z)| lr 2.48e-04 | 4169.11 ms | 32.4% bf16 MFU | 125117 tok/s step 12592/19560 | loss 3.321658 (-0.92z)| norm 0.2360 (-1.11z)| lr 2.47e-04 | 4176.15 ms | 32.3% bf16 MFU | 125139 tok/s step 12593/19560 | loss 3.338683 (-0.48z)| norm 0.2527 (+0.09z)| lr 2.47e-04 | 4175.17 ms | 32.3% bf16 MFU | 125160 tok/s step 12594/19560 | loss 3.381989 (+0.59z)| norm 0.2525 (+0.08z)| lr 2.47e-04 | 4177.30 ms | 32.3% bf16 MFU | 125178 tok/s step 12595/19560 | loss 3.345641 (-0.32z)| norm 0.2571 (+0.41z)| lr 2.47e-04 | 4169.31 ms | 32.4% bf16 MFU | 125206 tok/s step 12596/19560 | loss 3.307907 (-1.25z)| norm 0.2241 (-1.92z)| lr 2.47e-04 | 4193.94 ms | 32.2% bf16 MFU | 125196 tok/s step 12597/19560 | loss 3.375851 (+0.45z)| norm 0.2386 (-0.87z)| lr 2.47e-04 | 4159.18 ms | 32.5% bf16 MFU | 125239 tok/s step 12598/19560 | loss 3.312076 (-1.13z)| norm 0.2470 (-0.27z)| lr 2.47e-04 | 4164.58 ms | 32.4% bf16 MFU | 125272 tok/s step 12599/19560 | loss 3.368241 (+0.26z)| norm 0.2300 (-1.47z)| lr 2.47e-04 | 4167.79 ms | 32.4% bf16 MFU | 125298 tok/s step 12600/19560 | loss 3.383038 (+0.62z)| norm 0.2479 (-0.17z)| lr 2.47e-04 | 4173.13 ms | 32.4% bf16 MFU | 125315 tok/s step 12601/19560 | loss 3.344196 (-0.35z)| norm 0.2349 (-1.12z)| lr 2.47e-04 | 4172.58 ms | 32.4% bf16 MFU | 125332 tok/s step 12602/19560 | loss 3.393254 (+0.86z)| norm 0.2349 (-1.10z)| lr 2.47e-04 | 4187.75 ms | 32.2% bf16 MFU | 125325 tok/s step 12603/19560 | loss 3.356476 (-0.07z)| norm 0.2661 (+1.20z)| lr 2.47e-04 | 4170.17 ms | 32.4% bf16 MFU | 125345 tok/s step 12604/19560 | loss 3.365599 (+0.16z)| norm 0.2340 (-1.15z)| lr 2.47e-04 | 4174.03 ms | 32.3% bf16 MFU | 125358 tok/s step 12605/19560 | loss 3.336746 (-0.57z)| norm 0.2550 (+0.41z)| lr 2.47e-04 | 4166.17 ms | 32.4% bf16 MFU | 125382 tok/s step 12606/19560 | loss 3.330790 (-0.72z)| norm 0.2579 (+0.62z)| lr 2.47e-04 | 4172.45 ms | 32.4% bf16 MFU | 125396 tok/s step 12607/19560 | loss 3.361537 (+0.06z)| norm 0.2350 (-1.07z)| lr 2.47e-04 | 4171.85 ms | 32.4% bf16 MFU | 125410 tok/s step 12608/19560 | loss 3.356194 (-0.08z)| norm 0.2548 (+0.41z)| lr 2.46e-04 | 4168.09 ms | 32.4% bf16 MFU | 125429 tok/s step 12609/19560 | loss 3.427474 (+1.69z)| norm 0.2630 (+1.00z)| lr 2.46e-04 | 4167.42 ms | 32.4% bf16 MFU | 125448 tok/s step 12610/19560 | loss 3.326705 (-0.84z)| norm 0.2531 (+0.27z)| lr 2.46e-04 | 4180.56 ms | 32.3% bf16 MFU | 125446 tok/s step 12611/19560 | loss 3.324045 (-0.90z)| norm 0.3073 (+4.01z)| lr 2.46e-04 | 4168.92 ms | 32.4% bf16 MFU | 125461 tok/s step 12612/19560 | loss 3.328660 (-0.78z)| norm 0.2419 (-0.56z)| lr 2.46e-04 | 4187.32 ms | 32.2% bf16 MFU | 125449 tok/s step 12613/19560 | loss 3.344279 (-0.40z)| norm 0.2518 (+0.13z)| lr 2.46e-04 | 4166.80 ms | 32.4% bf16 MFU | 125468 tok/s step 12614/19560 | loss 3.333900 (-0.65z)| norm 0.2493 (-0.04z)| lr 2.46e-04 | 4173.26 ms | 32.4% bf16 MFU | 125476 tok/s step 12615/19560 | loss 3.379239 (+0.48z)| norm 0.2593 (+0.65z)| lr 2.46e-04 | 4169.30 ms | 32.4% bf16 MFU | 125489 tok/s step 12616/19560 | loss 3.345136 (-0.36z)| norm 0.2489 (-0.07z)| lr 2.46e-04 | 4160.21 ms | 32.5% bf16 MFU | 125516 tok/s step 12617/19560 | loss 3.406939 (+1.20z)| norm 0.2556 (+0.39z)| lr 2.46e-04 | 4169.51 ms | 32.4% bf16 MFU | 125528 tok/s step 12618/19560 | loss 3.378913 (+0.49z)| norm 0.2424 (-0.54z)| lr 2.46e-04 | 4172.76 ms | 32.4% bf16 MFU | 125533 tok/s step 12619/19560 | loss 3.392365 (+0.83z)| norm 0.2509 (+0.05z)| lr 2.46e-04 | 4167.14 ms | 32.4% bf16 MFU | 125548 tok/s step 12620/19560 | loss 3.439895 (+1.99z)| norm 0.2420 (-0.57z)| lr 2.46e-04 | 4169.10 ms | 32.4% bf16 MFU | 125558 tok/s step 12621/19560 | loss 3.361917 (+0.03z)| norm 0.2611 (+0.77z)| lr 2.46e-04 | 4181.79 ms | 32.3% bf16 MFU | 125549 tok/s step 12622/19560 | loss 3.274792 (-2.11z)| norm 0.2506 (+0.02z)| lr 2.46e-04 | 4161.44 ms | 32.4% bf16 MFU | 125571 tok/s step 12623/19560 | loss 3.365109 (+0.12z)| norm 0.2423 (-0.57z)| lr 2.46e-04 | 4166.63 ms | 32.4% bf16 MFU | 125584 tok/s step 12624/19560 | loss 3.427777 (+1.64z)| norm 0.2450 (-0.37z)| lr 2.45e-04 | 4168.67 ms | 32.4% bf16 MFU | 125593 tok/s step 12625/19560 | loss 3.306984 (-1.30z)| norm 0.2575 (+0.51z)| lr 2.45e-04 | 4165.96 ms | 32.4% bf16 MFU | 125606 tok/s step 12626/19560 | loss 3.354540 (-0.14z)| norm 0.2552 (+0.34z)| lr 2.45e-04 | 4174.72 ms | 32.3% bf16 MFU | 125605 tok/s step 12627/19560 | loss 3.335000 (-0.63z)| norm 0.2383 (-0.87z)| lr 2.45e-04 | 4168.38 ms | 32.4% bf16 MFU | 125613 tok/s step 12628/19560 | loss 3.348591 (-0.28z)| norm 0.2598 (+0.66z)| lr 2.45e-04 | 4173.26 ms | 32.4% bf16 MFU | 125614 tok/s step 12629/19560 | loss 3.332603 (-0.71z)| norm 0.2490 (-0.11z)| lr 2.45e-04 | 4163.23 ms | 32.4% bf16 MFU | 125630 tok/s step 12630/19560 | loss 3.381039 (+0.56z)| norm 0.2707 (+1.41z)| lr 2.45e-04 | 4166.38 ms | 32.4% bf16 MFU | 125641 tok/s step 12631/19560 | loss 3.468920 (+2.76z)| norm 0.2767 (+1.80z)| lr 2.45e-04 | 4175.38 ms | 32.3% bf16 MFU | 125637 tok/s step 12632/19560 | loss 3.381987 (+0.53z)| norm 0.2465 (-0.32z)| lr 2.45e-04 | 4171.62 ms | 32.4% bf16 MFU | 125639 tok/s step 12633/19560 | loss 3.340047 (-0.55z)| norm 0.2717 (+1.44z)| lr 2.45e-04 | 4180.26 ms | 32.3% bf16 MFU | 125628 tok/s step 12634/19560 | loss 3.360814 (-0.03z)| norm 0.2653 (+1.00z)| lr 2.45e-04 | 4172.31 ms | 32.4% bf16 MFU | 125630 tok/s step 12635/19560 | loss 3.316909 (-1.15z)| norm 0.2424 (-0.60z)| lr 2.45e-04 | 4166.90 ms | 32.4% bf16 MFU | 125639 tok/s step 12636/19560 | loss 3.344450 (-0.43z)| norm 0.2901 (+2.66z)| lr 2.45e-04 | 4173.41 ms | 32.4% bf16 MFU | 125639 tok/s step 12637/19560 | loss 3.362683 (+0.05z)| norm 0.2423 (-0.61z)| lr 2.45e-04 | 4170.23 ms | 32.4% bf16 MFU | 125643 tok/s step 12638/19560 | loss 3.403963 (+1.12z)| norm 0.2513 (+0.02z)| lr 2.45e-04 | 4169.16 ms | 32.4% bf16 MFU | 125648 tok/s step 12639/19560 | loss 3.410318 (+1.26z)| norm 0.2465 (-0.30z)| lr 2.45e-04 | 4188.12 ms | 32.2% bf16 MFU | 125625 tok/s step 12640/19560 | loss 3.371068 (+0.24z)| norm 0.2616 (+0.73z)| lr 2.44e-04 | 4172.09 ms | 32.4% bf16 MFU | 125627 tok/s step 12641/19560 | loss 3.376199 (+0.37z)| norm 0.2586 (+0.53z)| lr 2.44e-04 | 4168.81 ms | 32.4% bf16 MFU | 125634 tok/s step 12642/19560 | loss 3.374669 (+0.32z)| norm 0.2555 (+0.32z)| lr 2.44e-04 | 4167.62 ms | 32.4% bf16 MFU | 125642 tok/s step 12643/19560 | loss 3.332680 (-0.76z)| norm 0.2493 (-0.11z)| lr 2.44e-04 | 4163.84 ms | 32.4% bf16 MFU | 125656 tok/s step 12644/19560 | loss 3.340118 (-0.56z)| norm 0.2540 (+0.21z)| lr 2.44e-04 | 4169.97 ms | 32.4% bf16 MFU | 125660 tok/s step 12645/19560 | loss 3.342832 (-0.49z)| norm 0.2425 (-0.59z)| lr 2.44e-04 | 4154.43 ms | 32.5% bf16 MFU | 125687 tok/s step 12646/19560 | loss 3.371093 (+0.25z)| norm 0.2480 (-0.21z)| lr 2.44e-04 | 4185.76 ms | 32.3% bf16 MFU | 125665 tok/s step 12647/19560 | loss 3.326408 (-0.94z)| norm 0.2525 (+0.10z)| lr 2.44e-04 | 4166.72 ms | 32.4% bf16 MFU | 125673 tok/s step 12648/19560 | loss 3.356720 (-0.13z)| norm 0.2464 (-0.33z)| lr 2.44e-04 | 4171.96 ms | 32.4% bf16 MFU | 125673 tok/s step 12649/19560 | loss 3.341547 (-0.53z)| norm 0.2618 (+0.74z)| lr 2.44e-04 | 4171.07 ms | 32.4% bf16 MFU | 125674 tok/s step 12650/19560 | loss 3.376650 (+0.41z)| norm 0.2595 (+0.57z)| lr 2.44e-04 | 4168.76 ms | 32.4% bf16 MFU | 125679 tok/s step 12651/19560 | loss 3.362944 (+0.04z)| norm 0.2368 (-0.99z)| lr 2.44e-04 | 4167.42 ms | 32.4% bf16 MFU | 125685 tok/s step 12652/19560 | loss 3.370412 (+0.24z)| norm 0.2614 (+0.70z)| lr 2.44e-04 | 4171.55 ms | 32.4% bf16 MFU | 125685 tok/s step 12653/19560 | loss 3.358245 (-0.09z)| norm 0.2339 (-1.18z)| lr 2.44e-04 | 4164.70 ms | 32.4% bf16 MFU | 125695 tok/s step 12654/19560 | loss 3.379727 (+0.48z)| norm 0.2423 (-0.62z)| lr 2.44e-04 | 4164.92 ms | 32.4% bf16 MFU | 125704 tok/s step 12655/19560 | loss 3.354191 (-0.22z)| norm 0.2375 (-0.94z)| lr 2.44e-04 | 4157.33 ms | 32.5% bf16 MFU | 125725 tok/s step 12656/19560 | loss 3.351569 (-0.28z)| norm 0.2276 (-1.61z)| lr 2.43e-04 | 4171.64 ms | 32.4% bf16 MFU | 125722 tok/s step 12657/19560 | loss 3.354812 (-0.20z)| norm 0.2358 (-1.04z)| lr 2.43e-04 | 4178.58 ms | 32.3% bf16 MFU | 125710 tok/s step 12658/19560 | loss 3.282907 (-2.10z)| norm 0.2301 (-1.40z)| lr 2.43e-04 | 4172.97 ms | 32.4% bf16 MFU | 125706 tok/s step 12659/19560 | loss 3.350968 (-0.29z)| norm 0.2605 (+0.65z)| lr 2.43e-04 | 4166.14 ms | 32.4% bf16 MFU | 125713 tok/s step 12660/19560 | loss 3.368026 (+0.16z)| norm 0.2331 (-1.20z)| lr 2.43e-04 | 4166.89 ms | 32.4% bf16 MFU | 125719 tok/s step 12661/19560 | loss 3.350043 (-0.33z)| norm 0.2327 (-1.22z)| lr 2.43e-04 | 4172.63 ms | 32.4% bf16 MFU | 125715 tok/s step 12662/19560 | loss 3.337893 (-0.65z)| norm 0.2480 (-0.19z)| lr 2.43e-04 | 4166.97 ms | 32.4% bf16 MFU | 125720 tok/s step 12663/19560 | loss 3.363065 (+0.00z)| norm 0.2501 (-0.05z)| lr 2.43e-04 | 4172.22 ms | 32.4% bf16 MFU | 125718 tok/s step 12664/19560 | loss 3.339840 (-0.64z)| norm 0.2515 (+0.07z)| lr 2.43e-04 | 4163.55 ms | 32.4% bf16 MFU | 125728 tok/s step 12665/19560 | loss 3.431332 (+1.84z)| norm 0.2561 (+0.40z)| lr 2.43e-04 | 4162.87 ms | 32.4% bf16 MFU | 125739 tok/s step 12666/19560 | loss 3.356845 (-0.18z)| norm 0.2415 (-0.61z)| lr 2.43e-04 | 4170.02 ms | 32.4% bf16 MFU | 125738 tok/s step 12667/19560 | loss 3.357965 (-0.14z)| norm 0.2492 (-0.06z)| lr 2.43e-04 | 4160.53 ms | 32.5% bf16 MFU | 125752 tok/s step 12668/19560 | loss 3.338423 (-0.67z)| norm 0.2505 (+0.04z)| lr 2.43e-04 | 4170.54 ms | 32.4% bf16 MFU | 125750 tok/s step 12669/19560 | loss 3.359157 (-0.09z)| norm 0.2537 (+0.25z)| lr 2.43e-04 | 4168.31 ms | 32.4% bf16 MFU | 125751 tok/s step 12670/19560 | loss 3.397627 (+0.96z)| norm 0.2350 (-1.07z)| lr 2.43e-04 | 4161.90 ms | 32.4% bf16 MFU | 125763 tok/s step 12671/19560 | loss 3.350359 (-0.32z)| norm 0.2520 (+0.16z)| lr 2.43e-04 | 4164.07 ms | 32.4% bf16 MFU | 125770 tok/s step 12672/19560 | loss 3.373098 (+0.31z)| norm 0.2489 (-0.07z)| lr 2.42e-04 | 4172.19 ms | 32.4% bf16 MFU | 125764 tok/s step 12673/19560 | loss 3.380164 (+0.49z)| norm 0.2472 (-0.19z)| lr 2.42e-04 | 4168.08 ms | 32.4% bf16 MFU | 125766 tok/s step 12674/19560 | loss 3.406507 (+1.22z)| norm 0.2588 (+0.63z)| lr 2.42e-04 | 4163.28 ms | 32.4% bf16 MFU | 125774 tok/s step 12675/19560 | loss 3.392093 (+0.80z)| norm 0.2669 (+1.21z)| lr 2.42e-04 | 4165.05 ms | 32.4% bf16 MFU | 125779 tok/s step 12676/19560 | loss 3.319211 (-1.23z)| norm 0.2422 (-0.59z)| lr 2.42e-04 | 4165.10 ms | 32.4% bf16 MFU | 125784 tok/s step 12677/19560 | loss 3.317940 (-1.24z)| norm 0.2680 (+1.36z)| lr 2.42e-04 | 4175.63 ms | 32.3% bf16 MFU | 125773 tok/s step 12678/19560 | loss 3.363712 (+0.02z)| norm 0.2448 (-0.39z)| lr 2.42e-04 | 4163.31 ms | 32.4% bf16 MFU | 125781 tok/s step 12679/19560 | loss 3.348835 (-0.38z)| norm 0.2672 (+1.28z)| lr 2.42e-04 | 4160.83 ms | 32.4% bf16 MFU | 125792 tok/s step 12680/19560 | loss 3.369635 (+0.20z)| norm 0.2553 (+0.38z)| lr 2.42e-04 | 4174.77 ms | 32.3% bf16 MFU | 125781 tok/s step 12681/19560 | loss 3.408321 (+1.33z)| norm 0.2722 (+1.62z)| lr 2.42e-04 | 4162.97 ms | 32.4% bf16 MFU | 125789 tok/s step 12682/19560 | loss 3.345523 (-0.50z)| norm 0.2596 (+0.68z)| lr 2.42e-04 | 4164.02 ms | 32.4% bf16 MFU | 125795 tok/s step 12683/19560 | loss 3.372675 (+0.30z)| norm 0.2294 (-1.54z)| lr 2.42e-04 | 4162.32 ms | 32.4% bf16 MFU | 125804 tok/s step 12684/19560 | loss 3.375803 (+0.38z)| norm 0.2428 (-0.54z)| lr 2.42e-04 | 4163.67 ms | 32.4% bf16 MFU | 125809 tok/s step 12685/19560 | loss 3.392566 (+0.86z)| norm 0.2316 (-1.35z)| lr 2.42e-04 | 4158.77 ms | 32.5% bf16 MFU | 125822 tok/s step 12686/19560 | loss 3.359391 (-0.11z)| norm 0.2576 (+0.54z)| lr 2.42e-04 | 4165.60 ms | 32.4% bf16 MFU | 125824 tok/s step 12687/19560 | loss 3.321791 (-1.21z)| norm 0.2282 (-1.57z)| lr 2.42e-04 | 4157.96 ms | 32.5% bf16 MFU | 125838 tok/s step 12688/19560 | loss 3.404505 (+1.21z)| norm 0.2587 (+0.63z)| lr 2.42e-04 | 4174.79 ms | 32.3% bf16 MFU | 125825 tok/s step 12689/19560 | loss 3.394176 (+0.90z)| norm 0.2742 (+1.72z)| lr 2.41e-04 | 4162.61 ms | 32.4% bf16 MFU | 125831 tok/s step 12690/19560 | loss 3.362222 (-0.03z)| norm 0.2495 (-0.05z)| lr 2.41e-04 | 4158.09 ms | 32.5% bf16 MFU | 125844 tok/s step 12691/19560 | loss 3.508950 (+3.95z)| norm 0.2587 (+0.60z)| lr 2.41e-04 | 4327.98 ms | 31.2% bf16 MFU | 125609 tok/s step 12692/19560 | loss 3.307212 (-1.54z)| norm 0.2442 (-0.44z)| lr 2.41e-04 | 4549.33 ms | 29.7% bf16 MFU | 125091 tok/s step 12693/19560 | loss 3.396621 (+0.94z)| norm 0.2497 (-0.02z)| lr 2.41e-04 | 4632.69 ms | 29.1% bf16 MFU | 124495 tok/s step 12694/19560 | loss 3.328219 (-0.98z)| norm 0.2121 (-2.73z)| lr 2.41e-04 | 4297.31 ms | 31.4% bf16 MFU | 124370 tok/s step 12695/19560 | loss 3.405004 (+1.21z)| norm 0.2598 (+0.71z)| lr 2.41e-04 | 4427.53 ms | 30.5% bf16 MFU | 124073 tok/s step 12696/19560 | loss 3.312138 (-1.42z)| norm 0.2392 (-0.78z)| lr 2.41e-04 | 4504.90 ms | 30.0% bf16 MFU | 123688 tok/s step 12697/19560 | loss 3.318005 (-1.24z)| norm 0.2536 (+0.33z)| lr 2.41e-04 | 4235.41 ms | 31.9% bf16 MFU | 123693 tok/s step 12698/19560 | loss 3.346524 (-0.44z)| norm 0.2428 (-0.49z)| lr 2.41e-04 | 4268.36 ms | 31.6% bf16 MFU | 123650 tok/s step 12699/19560 | loss 3.349192 (-0.35z)| norm 0.2571 (+0.61z)| lr 2.41e-04 | 4161.17 ms | 32.4% bf16 MFU | 123767 tok/s step 12700/19560 | loss 3.325027 (-1.02z)| norm 0.2465 (-0.20z)| lr 2.41e-04 | 4197.04 ms | 32.2% bf16 MFU | 123825 tok/s step 12701/19560 | loss 3.341268 (-0.56z)| norm 0.2477 (-0.11z)| lr 2.41e-04 | 4214.07 ms | 32.0% bf16 MFU | 123854 tok/s step 12702/19560 | loss 3.337033 (-0.67z)| norm 0.2639 (+1.13z)| lr 2.41e-04 | 4235.46 ms | 31.9% bf16 MFU | 123851 tok/s step 12703/19560 | loss 3.365947 (+0.13z)| norm 0.2498 (+0.04z)| lr 2.41e-04 | 4212.56 ms | 32.1% bf16 MFU | 123881 tok/s step 12704/19560 | loss 3.354546 (-0.21z)| norm 0.2535 (+0.33z)| lr 2.41e-04 | 4211.29 ms | 32.1% bf16 MFU | 123912 tok/s step 12705/19560 | loss 3.312515 (-1.43z)| norm 0.2547 (+0.42z)| lr 2.40e-04 | 4160.98 ms | 32.4% bf16 MFU | 124016 tok/s step 12706/19560 | loss 3.370203 (+0.25z)| norm 0.2421 (-0.56z)| lr 2.40e-04 | 4164.95 ms | 32.4% bf16 MFU | 124110 tok/s step 12707/19560 | loss 3.367471 (+0.17z)| norm 0.2632 (+1.05z)| lr 2.40e-04 | 4164.27 ms | 32.4% bf16 MFU | 124199 tok/s step 12708/19560 | loss 3.354002 (-0.21z)| norm 0.2567 (+0.55z)| lr 2.40e-04 | 4287.35 ms | 31.5% bf16 MFU | 124104 tok/s step 12709/19560 | loss 3.381529 (+0.59z)| norm 0.2409 (-0.67z)| lr 2.40e-04 | 4166.54 ms | 32.4% bf16 MFU | 124190 tok/s step 12710/19560 | loss 3.300509 (-1.74z)| norm 0.2625 (+0.98z)| lr 2.40e-04 | 4168.31 ms | 32.4% bf16 MFU | 124269 tok/s step 12711/19560 | loss 3.324229 (-1.04z)| norm 0.2426 (-0.55z)| lr 2.40e-04 | 4173.07 ms | 32.4% bf16 MFU | 124338 tok/s step 12712/19560 | loss 3.325546 (-0.99z)| norm 0.2280 (-1.67z)| lr 2.40e-04 | 4168.39 ms | 32.4% bf16 MFU | 124410 tok/s step 12713/19560 | loss 3.377615 (+0.51z)| norm 0.2723 (+1.69z)| lr 2.40e-04 | 4171.22 ms | 32.4% bf16 MFU | 124474 tok/s step 12714/19560 | loss 3.367071 (+0.21z)| norm 0.2380 (-0.90z)| lr 2.40e-04 | 4168.79 ms | 32.4% bf16 MFU | 124538 tok/s step 12715/19560 | loss 3.298222 (-1.74z)| norm 0.2632 (+0.99z)| lr 2.40e-04 | 4168.28 ms | 32.4% bf16 MFU | 124600 tok/s step 12716/19560 | loss 3.269400 (-2.49z)| norm 0.2555 (+0.39z)| lr 2.40e-04 | 4173.19 ms | 32.4% bf16 MFU | 124652 tok/s step 12717/19560 | loss 3.370323 (+0.34z)| norm 0.2467 (-0.27z)| lr 2.40e-04 | 4168.74 ms | 32.4% bf16 MFU | 124708 tok/s step 12718/19560 | loss 3.331688 (-0.75z)| norm 0.2660 (+1.18z)| lr 2.40e-04 | 4162.32 ms | 32.4% bf16 MFU | 124770 tok/s step 12719/19560 | loss 3.403200 (+1.26z)| norm 0.2673 (+1.26z)| lr 2.40e-04 | 4158.80 ms | 32.5% bf16 MFU | 124835 tok/s step 12720/19560 | loss 3.354185 (-0.13z)| norm 0.2691 (+1.38z)| lr 2.40e-04 | 4172.28 ms | 32.4% bf16 MFU | 124877 tok/s step 12721/19560 | loss 3.321645 (-1.04z)| norm 0.2785 (+2.03z)| lr 2.39e-04 | 4168.53 ms | 32.4% bf16 MFU | 124921 tok/s step 12722/19560 | loss 3.316212 (-1.17z)| norm 0.2690 (+1.31z)| lr 2.39e-04 | 4162.76 ms | 32.4% bf16 MFU | 124973 tok/s step 12723/19560 | loss 3.332272 (-0.72z)| norm 0.2704 (+1.40z)| lr 2.39e-04 | 4155.20 ms | 32.5% bf16 MFU | 125033 tok/s step 12724/19560 | loss 3.407286 (+1.37z)| norm 0.2658 (+1.05z)| lr 2.39e-04 | 4159.45 ms | 32.5% bf16 MFU | 125084 tok/s step 12725/19560 | loss 3.377844 (+0.54z)| norm 0.2760 (+1.77z)| lr 2.39e-04 | 4169.12 ms | 32.4% bf16 MFU | 125117 tok/s step 12726/19560 | loss 3.316532 (-1.18z)| norm 0.2391 (-0.93z)| lr 2.39e-04 | 4166.70 ms | 32.4% bf16 MFU | 125153 tok/s step 12727/19560 | loss 3.377319 (+0.52z)| norm 0.2787 (+1.93z)| lr 2.39e-04 | 4160.72 ms | 32.5% bf16 MFU | 125195 tok/s step 12728/19560 | loss 3.330389 (-0.78z)| norm 0.2480 (-0.31z)| lr 2.39e-04 | 4167.50 ms | 32.4% bf16 MFU | 125226 tok/s step 12729/19560 | loss 3.379426 (+0.58z)| norm 0.2606 (+0.60z)| lr 2.39e-04 | 4223.94 ms | 32.0% bf16 MFU | 125171 tok/s step 12730/19560 | loss 3.294209 (-1.77z)| norm 0.2506 (-0.14z)| lr 2.39e-04 | 4178.31 ms | 32.3% bf16 MFU | 125186 tok/s step 12731/19560 | loss 3.344208 (-0.38z)| norm 0.2920 (+2.81z)| lr 2.39e-04 | 4172.67 ms | 32.4% bf16 MFU | 125209 tok/s step 12732/19560 | loss 3.335445 (-0.61z)| norm 0.2442 (-0.62z)| lr 2.39e-04 | 4169.93 ms | 32.4% bf16 MFU | 125235 tok/s step 12733/19560 | loss 3.322165 (-0.98z)| norm 0.2964 (+3.00z)| lr 2.39e-04 | 4176.20 ms | 32.3% bf16 MFU | 125251 tok/s step 12734/19560 | loss 3.387522 (+0.82z)| norm 0.2498 (-0.23z)| lr 2.39e-04 | 4165.22 ms | 32.4% bf16 MFU | 125282 tok/s step 12735/19560 | loss 3.496305 (+3.60z)| norm 0.2763 (+1.58z)| lr 2.39e-04 | 4167.84 ms | 32.4% bf16 MFU | 125307 tok/s step 12736/19560 | loss 3.315548 (-1.12z)| norm 0.2619 (+0.58z)| lr 2.39e-04 | 4161.76 ms | 32.4% bf16 MFU | 125341 tok/s step 12737/19560 | loss 3.343193 (-0.39z)| norm 0.3092 (+3.62z)| lr 2.38e-04 | 4163.10 ms | 32.4% bf16 MFU | 125371 tok/s step 12738/19560 | loss 3.334652 (-0.62z)| norm 0.2483 (-0.36z)| lr 2.38e-04 | 4166.07 ms | 32.4% bf16 MFU | 125394 tok/s step 12739/19560 | loss 3.341901 (-0.43z)| norm 0.2623 (+0.61z)| lr 2.38e-04 | 4178.73 ms | 32.3% bf16 MFU | 125398 tok/s step 12740/19560 | loss 3.411732 (+1.40z)| norm 0.2534 (-0.01z)| lr 2.38e-04 | 4170.49 ms | 32.4% bf16 MFU | 125414 tok/s step 12741/19560 | loss 3.340728 (-0.47z)| norm 0.2704 (+1.15z)| lr 2.38e-04 | 4212.63 ms | 32.1% bf16 MFU | 125366 tok/s step 12742/19560 | loss 3.368421 (+0.25z)| norm 0.2645 (+0.73z)| lr 2.38e-04 | 4158.13 ms | 32.5% bf16 MFU | 125402 tok/s step 12743/19560 | loss 3.374627 (+0.41z)| norm 0.2552 (+0.10z)| lr 2.38e-04 | 4165.41 ms | 32.4% bf16 MFU | 125425 tok/s step 12744/19560 | loss 3.462461 (+2.64z)| norm 0.3095 (+3.59z)| lr 2.38e-04 | 4188.97 ms | 32.2% bf16 MFU | 125412 tok/s step 12745/19560 | loss 3.346020 (-0.35z)| norm 0.2433 (-0.70z)| lr 2.38e-04 | 4180.22 ms | 32.3% bf16 MFU | 125412 tok/s step 12746/19560 | loss 3.305531 (-1.37z)| norm 0.2622 (+0.51z)| lr 2.38e-04 | 4161.91 ms | 32.4% bf16 MFU | 125440 tok/s step 12747/19560 | loss 3.376746 (+0.46z)| norm 0.3057 (+3.17z)| lr 2.38e-04 | 4164.73 ms | 32.4% bf16 MFU | 125463 tok/s step 12748/19560 | loss 3.310899 (-1.22z)| norm 0.2491 (-0.36z)| lr 2.38e-04 | 4193.91 ms | 32.2% bf16 MFU | 125440 tok/s step 12749/19560 | loss 3.271367 (-2.19z)| norm 0.2369 (-1.10z)| lr 2.38e-04 | 4169.82 ms | 32.4% bf16 MFU | 125455 tok/s step 12750/19560 | loss 3.333879 (-0.61z)| norm 0.2610 (+0.39z)| lr 2.38e-04 | 4161.88 ms | 32.4% bf16 MFU | 125481 tok/s val loss 3.322140 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2967/10042 = 0.295459 step 12751/19560 | loss 3.328621 (-0.74z)| norm 0.2354 (-1.19z)| lr 2.38e-04 | 4263.75 ms | 31.7% bf16 MFU | 125355 tok/s step 12752/19560 | loss 3.387967 (+0.81z)| norm 0.2558 (+0.07z)| lr 2.38e-04 | 4199.80 ms | 32.1% bf16 MFU | 125329 tok/s step 12753/19560 | loss 3.304965 (-1.36z)| norm 0.2344 (-1.24z)| lr 2.37e-04 | 4170.89 ms | 32.4% bf16 MFU | 125348 tok/s step 12754/19560 | loss 3.315614 (-1.07z)| norm 0.2323 (-1.35z)| lr 2.37e-04 | 4195.09 ms | 32.2% bf16 MFU | 125329 tok/s step 12755/19560 | loss 3.368351 (+0.30z)| norm 0.2610 (+0.40z)| lr 2.37e-04 | 4203.05 ms | 32.1% bf16 MFU | 125300 tok/s step 12756/19560 | loss 3.377742 (+0.54z)| norm 0.2431 (-0.69z)| lr 2.37e-04 | 4203.50 ms | 32.1% bf16 MFU | 125271 tok/s step 12757/19560 | loss 3.352569 (-0.12z)| norm 0.2692 (+0.89z)| lr 2.37e-04 | 4282.08 ms | 31.5% bf16 MFU | 125129 tok/s step 12758/19560 | loss 3.368605 (+0.30z)| norm 0.2367 (-1.07z)| lr 2.37e-04 | 4161.24 ms | 32.4% bf16 MFU | 125173 tok/s step 12759/19560 | loss 3.301545 (-1.45z)| norm 0.2859 (+1.92z)| lr 2.37e-04 | 4215.71 ms | 32.0% bf16 MFU | 125132 tok/s step 12760/19560 | loss 3.301277 (-1.44z)| norm 0.2289 (-1.52z)| lr 2.37e-04 | 4324.76 ms | 31.2% bf16 MFU | 124937 tok/s step 12761/19560 | loss 3.315442 (-1.05z)| norm 0.2442 (-0.59z)| lr 2.37e-04 | 4173.78 ms | 32.3% bf16 MFU | 124971 tok/s step 12762/19560 | loss 3.327757 (-0.71z)| norm 0.2451 (-0.53z)| lr 2.37e-04 | 4171.43 ms | 32.4% bf16 MFU | 125007 tok/s step 12763/19560 | loss 3.364363 (+0.25z)| norm 0.2386 (-0.92z)| lr 2.37e-04 | 4212.73 ms | 32.0% bf16 MFU | 124979 tok/s step 12764/19560 | loss 3.356070 (+0.02z)| norm 0.2616 (+0.49z)| lr 2.37e-04 | 4167.98 ms | 32.4% bf16 MFU | 125020 tok/s step 12765/19560 | loss 3.269490 (-2.22z)| norm 0.2332 (-1.24z)| lr 2.37e-04 | 4166.04 ms | 32.4% bf16 MFU | 125061 tok/s step 12766/19560 | loss 3.326566 (-0.71z)| norm 0.2417 (-0.72z)| lr 2.37e-04 | 4185.23 ms | 32.3% bf16 MFU | 125071 tok/s step 12767/19560 | loss 3.313203 (-1.05z)| norm 0.2622 (+0.53z)| lr 2.37e-04 | 4180.14 ms | 32.3% bf16 MFU | 125089 tok/s step 12768/19560 | loss 3.293473 (-1.54z)| norm 0.2602 (+0.40z)| lr 2.37e-04 | 4164.22 ms | 32.4% bf16 MFU | 125130 tok/s step 12769/19560 | loss 3.317387 (-0.90z)| norm 0.2469 (-0.40z)| lr 2.36e-04 | 4167.23 ms | 32.4% bf16 MFU | 125164 tok/s step 12770/19560 | loss 3.332337 (-0.50z)| norm 0.2487 (-0.29z)| lr 2.36e-04 | 4171.03 ms | 32.4% bf16 MFU | 125191 tok/s step 12771/19560 | loss 3.336377 (-0.40z)| norm 0.2476 (-0.36z)| lr 2.36e-04 | 4236.71 ms | 31.9% bf16 MFU | 125118 tok/s step 12772/19560 | loss 3.358997 (+0.19z)| norm 0.2307 (-1.37z)| lr 2.36e-04 | 4319.36 ms | 31.3% bf16 MFU | 124932 tok/s step 12773/19560 | loss 3.404358 (+1.35z)| norm 0.2545 (+0.07z)| lr 2.36e-04 | 4215.03 ms | 32.0% bf16 MFU | 124904 tok/s step 12774/19560 | loss 3.325277 (-0.69z)| norm 0.2389 (-0.87z)| lr 2.36e-04 | 4159.84 ms | 32.5% bf16 MFU | 124961 tok/s step 12775/19560 | loss 3.370860 (+0.48z)| norm 0.2406 (-0.76z)| lr 2.36e-04 | 4209.21 ms | 32.1% bf16 MFU | 124941 tok/s step 12776/19560 | loss 3.329238 (-0.59z)| norm 0.2447 (-0.51z)| lr 2.36e-04 | 4161.37 ms | 32.4% bf16 MFU | 124993 tok/s step 12777/19560 | loss 3.328001 (-0.62z)| norm 0.2595 (+0.38z)| lr 2.36e-04 | 4178.53 ms | 32.3% bf16 MFU | 125017 tok/s step 12778/19560 | loss 3.467623 (+2.87z)| norm 0.2576 (+0.27z)| lr 2.36e-04 | 4163.09 ms | 32.4% bf16 MFU | 125063 tok/s step 12779/19560 | loss 3.323176 (-0.73z)| norm 0.2437 (-0.57z)| lr 2.36e-04 | 4168.12 ms | 32.4% bf16 MFU | 125099 tok/s step 12780/19560 | loss 3.366674 (+0.36z)| norm 0.2285 (-1.46z)| lr 2.36e-04 | 4165.48 ms | 32.4% bf16 MFU | 125137 tok/s step 12781/19560 | loss 3.353429 (+0.03z)| norm 0.2443 (-0.52z)| lr 2.36e-04 | 4171.37 ms | 32.4% bf16 MFU | 125165 tok/s step 12782/19560 | loss 3.316125 (-0.89z)| norm 0.2320 (-1.26z)| lr 2.36e-04 | 4169.43 ms | 32.4% bf16 MFU | 125194 tok/s step 12783/19560 | loss 3.322887 (-0.71z)| norm 0.2438 (-0.55z)| lr 2.36e-04 | 4165.42 ms | 32.4% bf16 MFU | 125228 tok/s step 12784/19560 | loss 3.318334 (-0.82z)| norm 0.2359 (-1.04z)| lr 2.36e-04 | 4171.04 ms | 32.4% bf16 MFU | 125251 tok/s step 12785/19560 | loss 3.288385 (-1.54z)| norm 0.2396 (-0.81z)| lr 2.35e-04 | 4157.94 ms | 32.5% bf16 MFU | 125293 tok/s step 12786/19560 | loss 3.404635 (+1.31z)| norm 0.2351 (-1.09z)| lr 2.35e-04 | 4158.30 ms | 32.5% bf16 MFU | 125333 tok/s step 12787/19560 | loss 3.300544 (-1.25z)| norm 0.2525 (-0.03z)| lr 2.35e-04 | 4167.96 ms | 32.4% bf16 MFU | 125356 tok/s step 12788/19560 | loss 3.392087 (+0.99z)| norm 0.2508 (-0.15z)| lr 2.35e-04 | 4168.86 ms | 32.4% bf16 MFU | 125376 tok/s step 12789/19560 | loss 3.266635 (-2.03z)| norm 0.2452 (-0.49z)| lr 2.35e-04 | 4174.58 ms | 32.3% bf16 MFU | 125387 tok/s step 12790/19560 | loss 3.341198 (-0.23z)| norm 0.2483 (-0.30z)| lr 2.35e-04 | 4166.03 ms | 32.4% bf16 MFU | 125410 tok/s step 12791/19560 | loss 3.364745 (+0.33z)| norm 0.2379 (-0.93z)| lr 2.35e-04 | 4167.97 ms | 32.4% bf16 MFU | 125429 tok/s step 12792/19560 | loss 3.317538 (-0.80z)| norm 0.2497 (-0.21z)| lr 2.35e-04 | 4173.62 ms | 32.4% bf16 MFU | 125438 tok/s step 12793/19560 | loss 3.423464 (+1.75z)| norm 0.2535 (+0.02z)| lr 2.35e-04 | 4175.34 ms | 32.3% bf16 MFU | 125445 tok/s step 12794/19560 | loss 3.363783 (+0.31z)| norm 0.2554 (+0.13z)| lr 2.35e-04 | 4175.18 ms | 32.3% bf16 MFU | 125451 tok/s step 12795/19560 | loss 3.326290 (-0.58z)| norm 0.2663 (+0.79z)| lr 2.35e-04 | 4160.82 ms | 32.4% bf16 MFU | 125479 tok/s step 12796/19560 | loss 3.336150 (-0.35z)| norm 0.2608 (+0.45z)| lr 2.35e-04 | 4169.83 ms | 32.4% bf16 MFU | 125492 tok/s step 12797/19560 | loss 3.335623 (-0.35z)| norm 0.2592 (+0.35z)| lr 2.35e-04 | 4192.32 ms | 32.2% bf16 MFU | 125470 tok/s step 12798/19560 | loss 3.268466 (-1.92z)| norm 0.2566 (+0.18z)| lr 2.35e-04 | 4195.01 ms | 32.2% bf16 MFU | 125445 tok/s step 12799/19560 | loss 3.293201 (-1.32z)| norm 0.2446 (-0.55z)| lr 2.35e-04 | 4171.94 ms | 32.4% bf16 MFU | 125457 tok/s step 12800/19560 | loss 3.317990 (-0.72z)| norm 0.2476 (-0.37z)| lr 2.35e-04 | 4278.19 ms | 31.6% bf16 MFU | 125311 tok/s step 12801/19560 | loss 3.344039 (-0.10z)| norm 0.2552 (+0.10z)| lr 2.35e-04 | 4170.49 ms | 32.4% bf16 MFU | 125331 tok/s step 12802/19560 | loss 3.283472 (-1.51z)| norm 0.2391 (-0.88z)| lr 2.34e-04 | 4178.93 ms | 32.3% bf16 MFU | 125338 tok/s step 12803/19560 | loss 3.310153 (-0.86z)| norm 0.2458 (-0.47z)| lr 2.34e-04 | 4163.97 ms | 32.4% bf16 MFU | 125366 tok/s step 12804/19560 | loss 3.253083 (-2.16z)| norm 0.2338 (-1.19z)| lr 2.34e-04 | 4165.55 ms | 32.4% bf16 MFU | 125391 tok/s step 12805/19560 | loss 3.311902 (-0.79z)| norm 0.2332 (-1.21z)| lr 2.34e-04 | 4174.85 ms | 32.3% bf16 MFU | 125401 tok/s step 12806/19560 | loss 3.380800 (+0.80z)| norm 0.2598 (+0.40z)| lr 2.34e-04 | 4164.14 ms | 32.4% bf16 MFU | 125426 tok/s step 12807/19560 | loss 3.306457 (-0.91z)| norm 0.2452 (-0.48z)| lr 2.34e-04 | 4164.19 ms | 32.4% bf16 MFU | 125450 tok/s step 12808/19560 | loss 3.328912 (-0.38z)| norm 0.2526 (-0.02z)| lr 2.34e-04 | 4162.44 ms | 32.4% bf16 MFU | 125475 tok/s step 12809/19560 | loss 3.398247 (+1.23z)| norm 0.2384 (-0.88z)| lr 2.34e-04 | 4169.21 ms | 32.4% bf16 MFU | 125489 tok/s step 12810/19560 | loss 3.301458 (-1.01z)| norm 0.2798 (+1.63z)| lr 2.34e-04 | 4168.59 ms | 32.4% bf16 MFU | 125503 tok/s step 12811/19560 | loss 3.337523 (-0.17z)| norm 0.2411 (-0.72z)| lr 2.34e-04 | 4163.45 ms | 32.4% bf16 MFU | 125524 tok/s step 12812/19560 | loss 3.290594 (-1.23z)| norm 0.2647 (+0.71z)| lr 2.34e-04 | 4176.67 ms | 32.3% bf16 MFU | 125525 tok/s step 12813/19560 | loss 3.347306 (+0.08z)| norm 0.2613 (+0.49z)| lr 2.34e-04 | 4174.39 ms | 32.3% bf16 MFU | 125528 tok/s step 12814/19560 | loss 3.325966 (-0.41z)| norm 0.2360 (-1.05z)| lr 2.34e-04 | 4169.10 ms | 32.4% bf16 MFU | 125540 tok/s step 12815/19560 | loss 3.300339 (-0.99z)| norm 0.2311 (-1.35z)| lr 2.34e-04 | 4164.20 ms | 32.4% bf16 MFU | 125558 tok/s step 12816/19560 | loss 3.353457 (+0.24z)| norm 0.2657 (+0.76z)| lr 2.34e-04 | 4176.44 ms | 32.3% bf16 MFU | 125557 tok/s step 12817/19560 | loss 3.302904 (-0.92z)| norm 0.2472 (-0.35z)| lr 2.34e-04 | 4167.71 ms | 32.4% bf16 MFU | 125569 tok/s step 12818/19560 | loss 3.387129 (+1.04z)| norm 0.2464 (-0.40z)| lr 2.33e-04 | 4165.87 ms | 32.4% bf16 MFU | 125583 tok/s step 12819/19560 | loss 3.333793 (-0.18z)| norm 0.2544 (+0.09z)| lr 2.33e-04 | 4163.57 ms | 32.4% bf16 MFU | 125600 tok/s step 12820/19560 | loss 3.333879 (-0.18z)| norm 0.2506 (-0.15z)| lr 2.33e-04 | 4179.62 ms | 32.3% bf16 MFU | 125592 tok/s step 12821/19560 | loss 3.306658 (-0.85z)| norm 0.2785 (+1.54z)| lr 2.33e-04 | 4163.23 ms | 32.4% bf16 MFU | 125609 tok/s step 12822/19560 | loss 3.314502 (-0.65z)| norm 0.2538 (+0.01z)| lr 2.33e-04 | 4165.91 ms | 32.4% bf16 MFU | 125621 tok/s step 12823/19560 | loss 3.342603 (+0.07z)| norm 0.2574 (+0.24z)| lr 2.33e-04 | 4162.71 ms | 32.4% bf16 MFU | 125637 tok/s step 12824/19560 | loss 3.332858 (-0.18z)| norm 0.2541 (+0.03z)| lr 2.33e-04 | 4169.31 ms | 32.4% bf16 MFU | 125643 tok/s step 12825/19560 | loss 3.314400 (-0.65z)| norm 0.2706 (+1.05z)| lr 2.33e-04 | 4193.19 ms | 32.2% bf16 MFU | 125613 tok/s step 12826/19560 | loss 3.349099 (+0.23z)| norm 0.2649 (+0.68z)| lr 2.33e-04 | 4166.57 ms | 32.4% bf16 MFU | 125623 tok/s step 12827/19560 | loss 3.323955 (-0.40z)| norm 0.2441 (-0.61z)| lr 2.33e-04 | 4172.20 ms | 32.4% bf16 MFU | 125625 tok/s step 12828/19560 | loss 3.350774 (+0.27z)| norm 0.2542 (+0.02z)| lr 2.33e-04 | 4169.94 ms | 32.4% bf16 MFU | 125631 tok/s step 12829/19560 | loss 3.347207 (+0.18z)| norm 0.2536 (-0.02z)| lr 2.33e-04 | 4170.61 ms | 32.4% bf16 MFU | 125635 tok/s step 12830/19560 | loss 3.328940 (-0.28z)| norm 0.2709 (+1.05z)| lr 2.33e-04 | 4174.98 ms | 32.3% bf16 MFU | 125632 tok/s step 12831/19560 | loss 3.288366 (-1.28z)| norm 0.2465 (-0.47z)| lr 2.33e-04 | 4187.58 ms | 32.2% bf16 MFU | 125610 tok/s step 12832/19560 | loss 3.310919 (-0.71z)| norm 0.2482 (-0.36z)| lr 2.33e-04 | 4171.13 ms | 32.4% bf16 MFU | 125614 tok/s step 12833/19560 | loss 3.382489 (+1.07z)| norm 0.2757 (+1.33z)| lr 2.33e-04 | 4167.20 ms | 32.4% bf16 MFU | 125624 tok/s step 12834/19560 | loss 3.354495 (+0.37z)| norm 0.2547 (+0.03z)| lr 2.32e-04 | 4159.16 ms | 32.5% bf16 MFU | 125646 tok/s step 12835/19560 | loss 3.370781 (+0.78z)| norm 0.2527 (-0.09z)| lr 2.32e-04 | 4168.34 ms | 32.4% bf16 MFU | 125653 tok/s step 12836/19560 | loss 3.338991 (-0.01z)| norm 0.2629 (+0.54z)| lr 2.32e-04 | 4165.24 ms | 32.4% bf16 MFU | 125664 tok/s step 12837/19560 | loss 3.306023 (-0.82z)| norm 0.2535 (-0.05z)| lr 2.32e-04 | 4168.32 ms | 32.4% bf16 MFU | 125669 tok/s step 12838/19560 | loss 3.337578 (-0.04z)| norm 0.2878 (+2.04z)| lr 2.32e-04 | 4188.47 ms | 32.2% bf16 MFU | 125645 tok/s step 12839/19560 | loss 3.330031 (-0.23z)| norm 0.2526 (-0.12z)| lr 2.32e-04 | 4172.19 ms | 32.4% bf16 MFU | 125646 tok/s step 12840/19560 | loss 3.333595 (-0.14z)| norm 0.2902 (+2.14z)| lr 2.32e-04 | 4169.30 ms | 32.4% bf16 MFU | 125651 tok/s step 12841/19560 | loss 3.320331 (-0.47z)| norm 0.2597 (+0.29z)| lr 2.32e-04 | 4162.69 ms | 32.4% bf16 MFU | 125666 tok/s step 12842/19560 | loss 3.328705 (-0.25z)| norm 0.2729 (+1.08z)| lr 2.32e-04 | 4164.77 ms | 32.4% bf16 MFU | 125677 tok/s step 12843/19560 | loss 3.313229 (-0.64z)| norm 0.2530 (-0.13z)| lr 2.32e-04 | 4168.83 ms | 32.4% bf16 MFU | 125681 tok/s step 12844/19560 | loss 3.302998 (-0.92z)| norm 0.2601 (+0.30z)| lr 2.32e-04 | 4157.89 ms | 32.5% bf16 MFU | 125702 tok/s step 12845/19560 | loss 3.267268 (-1.79z)| norm 0.2526 (-0.16z)| lr 2.32e-04 | 4167.86 ms | 32.4% bf16 MFU | 125706 tok/s step 12846/19560 | loss 3.331696 (-0.16z)| norm 0.2661 (+0.66z)| lr 2.32e-04 | 4161.29 ms | 32.4% bf16 MFU | 125721 tok/s step 12847/19560 | loss 3.275484 (-1.56z)| norm 0.2615 (+0.39z)| lr 2.32e-04 | 4164.84 ms | 32.4% bf16 MFU | 125729 tok/s step 12848/19560 | loss 3.335755 (-0.03z)| norm 0.2616 (+0.40z)| lr 2.32e-04 | 4163.18 ms | 32.4% bf16 MFU | 125739 tok/s step 12849/19560 | loss 3.301505 (-0.89z)| norm 0.2372 (-1.08z)| lr 2.32e-04 | 4163.29 ms | 32.4% bf16 MFU | 125749 tok/s step 12850/19560 | loss 3.322672 (-0.36z)| norm 0.2529 (-0.11z)| lr 2.31e-04 | 4175.89 ms | 32.3% bf16 MFU | 125739 tok/s step 12851/19560 | loss 3.363995 (+0.68z)| norm 0.2710 (+1.01z)| lr 2.31e-04 | 4155.59 ms | 32.5% bf16 MFU | 125760 tok/s step 12852/19560 | loss 3.301428 (-0.89z)| norm 0.2341 (-1.25z)| lr 2.31e-04 | 4165.74 ms | 32.4% bf16 MFU | 125765 tok/s step 12853/19560 | loss 3.282160 (-1.36z)| norm 0.2512 (-0.19z)| lr 2.31e-04 | 4205.62 ms | 32.1% bf16 MFU | 125710 tok/s step 12854/19560 | loss 3.346070 (+0.26z)| norm 0.2636 (+0.57z)| lr 2.31e-04 | 4168.02 ms | 32.4% bf16 MFU | 125714 tok/s step 12855/19560 | loss 3.371495 (+0.91z)| norm 0.2507 (-0.22z)| lr 2.31e-04 | 4167.13 ms | 32.4% bf16 MFU | 125719 tok/s step 12856/19560 | loss 3.348456 (+0.32z)| norm 0.2773 (+1.42z)| lr 2.31e-04 | 4164.62 ms | 32.4% bf16 MFU | 125727 tok/s step 12857/19560 | loss 3.323192 (-0.31z)| norm 0.2550 (+0.04z)| lr 2.31e-04 | 4158.92 ms | 32.5% bf16 MFU | 125744 tok/s step 12858/19560 | loss 3.332738 (-0.08z)| norm 0.2595 (+0.32z)| lr 2.31e-04 | 4165.01 ms | 32.4% bf16 MFU | 125751 tok/s step 12859/19560 | loss 3.341476 (+0.15z)| norm 0.2399 (-0.89z)| lr 2.31e-04 | 4157.37 ms | 32.5% bf16 MFU | 125769 tok/s step 12860/19560 | loss 3.343367 (+0.19z)| norm 0.2569 (+0.17z)| lr 2.31e-04 | 4165.61 ms | 32.4% bf16 MFU | 125774 tok/s step 12861/19560 | loss 3.302708 (-0.84z)| norm 0.2471 (-0.43z)| lr 2.31e-04 | 4160.78 ms | 32.5% bf16 MFU | 125785 tok/s step 12862/19560 | loss 3.331088 (-0.11z)| norm 0.2612 (+0.48z)| lr 2.31e-04 | 4165.76 ms | 32.4% bf16 MFU | 125789 tok/s step 12863/19560 | loss 3.378878 (+1.23z)| norm 0.2618 (+0.53z)| lr 2.31e-04 | 4168.55 ms | 32.4% bf16 MFU | 125788 tok/s step 12864/19560 | loss 3.347016 (+0.35z)| norm 0.2684 (+0.96z)| lr 2.31e-04 | 4198.24 ms | 32.2% bf16 MFU | 125743 tok/s step 12865/19560 | loss 3.313450 (-0.57z)| norm 0.2446 (-0.60z)| lr 2.31e-04 | 4163.81 ms | 32.4% bf16 MFU | 125751 tok/s step 12866/19560 | loss 3.331908 (-0.06z)| norm 0.2613 (+0.54z)| lr 2.31e-04 | 4165.98 ms | 32.4% bf16 MFU | 125756 tok/s step 12867/19560 | loss 3.295753 (-1.05z)| norm 0.2467 (-0.45z)| lr 2.30e-04 | 4219.63 ms | 32.0% bf16 MFU | 125681 tok/s step 12868/19560 | loss 3.288160 (-1.24z)| norm 0.2503 (-0.21z)| lr 2.30e-04 | 4164.27 ms | 32.4% bf16 MFU | 125692 tok/s step 12869/19560 | loss 3.368382 (+0.98z)| norm 0.2448 (-0.57z)| lr 2.30e-04 | 4212.53 ms | 32.1% bf16 MFU | 125630 tok/s step 12870/19560 | loss 3.356107 (+0.64z)| norm 0.2743 (+1.46z)| lr 2.30e-04 | 4159.88 ms | 32.5% bf16 MFU | 125651 tok/s step 12871/19560 | loss 3.296183 (-1.01z)| norm 0.2361 (-1.16z)| lr 2.30e-04 | 4162.69 ms | 32.4% bf16 MFU | 125665 tok/s step 12872/19560 | loss 3.320815 (-0.31z)| norm 0.2752 (+1.63z)| lr 2.30e-04 | 4185.07 ms | 32.3% bf16 MFU | 125646 tok/s step 12873/19560 | loss 3.291467 (-1.15z)| norm 0.2667 (+1.00z)| lr 2.30e-04 | 4161.01 ms | 32.4% bf16 MFU | 125664 tok/s step 12874/19560 | loss 3.332370 (+0.04z)| norm 0.2608 (+0.57z)| lr 2.30e-04 | 4210.21 ms | 32.1% bf16 MFU | 125607 tok/s step 12875/19560 | loss 3.301799 (-0.84z)| norm 0.2750 (+1.70z)| lr 2.30e-04 | 4163.59 ms | 32.4% bf16 MFU | 125623 tok/s step 12876/19560 | loss 3.316203 (-0.42z)| norm 0.2623 (+0.73z)| lr 2.30e-04 | 4161.17 ms | 32.4% bf16 MFU | 125641 tok/s step 12877/19560 | loss 3.332018 (+0.03z)| norm 0.2453 (-0.57z)| lr 2.30e-04 | 4165.92 ms | 32.4% bf16 MFU | 125652 tok/s step 12878/19560 | loss 3.346048 (+0.44z)| norm 0.2803 (+2.05z)| lr 2.30e-04 | 4166.67 ms | 32.4% bf16 MFU | 125661 tok/s step 12879/19560 | loss 3.301644 (-0.87z)| norm 0.2524 (-0.05z)| lr 2.30e-04 | 4184.18 ms | 32.3% bf16 MFU | 125643 tok/s step 12880/19560 | loss 3.291988 (-1.14z)| norm 0.2878 (+2.54z)| lr 2.30e-04 | 4170.15 ms | 32.4% bf16 MFU | 125647 tok/s step 12881/19560 | loss 3.293962 (-1.08z)| norm 0.2607 (+0.53z)| lr 2.30e-04 | 4171.45 ms | 32.4% bf16 MFU | 125649 tok/s step 12882/19560 | loss 3.313634 (-0.49z)| norm 0.2570 (+0.24z)| lr 2.30e-04 | 4641.38 ms | 29.1% bf16 MFU | 125014 tok/s step 12883/19560 | loss 3.355648 (+0.77z)| norm 0.2517 (-0.14z)| lr 2.29e-04 | 4497.99 ms | 30.0% bf16 MFU | 124592 tok/s step 12884/19560 | loss 3.326450 (-0.09z)| norm 0.2517 (-0.15z)| lr 2.29e-04 | 4411.43 ms | 30.6% bf16 MFU | 124304 tok/s step 12885/19560 | loss 3.328670 (-0.02z)| norm 0.2691 (+1.15z)| lr 2.29e-04 | 4345.96 ms | 31.1% bf16 MFU | 124121 tok/s step 12886/19560 | loss 3.353427 (+0.73z)| norm 0.2433 (-0.79z)| lr 2.29e-04 | 4270.34 ms | 31.6% bf16 MFU | 124054 tok/s step 12887/19560 | loss 3.365569 (+1.08z)| norm 0.2612 (+0.58z)| lr 2.29e-04 | 4155.28 ms | 32.5% bf16 MFU | 124160 tok/s step 12888/19560 | loss 3.324644 (-0.16z)| norm 0.2419 (-0.92z)| lr 2.29e-04 | 4185.96 ms | 32.3% bf16 MFU | 124214 tok/s step 12889/19560 | loss 3.321583 (-0.25z)| norm 0.2483 (-0.42z)| lr 2.29e-04 | 4375.73 ms | 30.9% bf16 MFU | 123994 tok/s step 12890/19560 | loss 3.362350 (+0.97z)| norm 0.2473 (-0.50z)| lr 2.29e-04 | 4176.83 ms | 32.3% bf16 MFU | 124071 tok/s step 12891/19560 | loss 3.373487 (+1.30z)| norm 0.2338 (-1.54z)| lr 2.29e-04 | 4218.72 ms | 32.0% bf16 MFU | 124081 tok/s step 12892/19560 | loss 3.352426 (+0.67z)| norm 0.2656 (+0.92z)| lr 2.29e-04 | 4157.15 ms | 32.5% bf16 MFU | 124183 tok/s step 12893/19560 | loss 3.332477 (+0.05z)| norm 0.2480 (-0.46z)| lr 2.29e-04 | 4160.06 ms | 32.5% bf16 MFU | 124275 tok/s step 12894/19560 | loss 3.317461 (-0.40z)| norm 0.2702 (+1.26z)| lr 2.29e-04 | 4202.72 ms | 32.1% bf16 MFU | 124299 tok/s step 12895/19560 | loss 3.343009 (+0.37z)| norm 0.2458 (-0.63z)| lr 2.29e-04 | 4182.77 ms | 32.3% bf16 MFU | 124351 tok/s step 12896/19560 | loss 3.337538 (+0.19z)| norm 0.2504 (-0.27z)| lr 2.29e-04 | 4157.79 ms | 32.5% bf16 MFU | 124439 tok/s step 12897/19560 | loss 3.361440 (+0.91z)| norm 0.2619 (+0.62z)| lr 2.29e-04 | 4160.16 ms | 32.5% bf16 MFU | 124518 tok/s step 12898/19560 | loss 3.253286 (-2.33z)| norm 0.2568 (+0.22z)| lr 2.29e-04 | 4161.82 ms | 32.4% bf16 MFU | 124591 tok/s step 12899/19560 | loss 3.343013 (+0.36z)| norm 0.2497 (-0.34z)| lr 2.28e-04 | 4159.99 ms | 32.5% bf16 MFU | 124663 tok/s step 12900/19560 | loss 3.279570 (-1.51z)| norm 0.2602 (+0.47z)| lr 2.28e-04 | 4155.31 ms | 32.5% bf16 MFU | 124738 tok/s step 12901/19560 | loss 3.387242 (+1.71z)| norm 0.2340 (-1.58z)| lr 2.28e-04 | 4216.15 ms | 32.0% bf16 MFU | 124719 tok/s step 12902/19560 | loss 3.319161 (-0.33z)| norm 0.2604 (+0.48z)| lr 2.28e-04 | 4160.65 ms | 32.5% bf16 MFU | 124784 tok/s step 12903/19560 | loss 3.326556 (-0.10z)| norm 0.2500 (-0.35z)| lr 2.28e-04 | 4163.16 ms | 32.4% bf16 MFU | 124841 tok/s step 12904/19560 | loss 3.291100 (-1.16z)| norm 0.2262 (-2.18z)| lr 2.28e-04 | 4158.42 ms | 32.5% bf16 MFU | 124903 tok/s step 12905/19560 | loss 3.336924 (+0.22z)| norm 0.2417 (-0.96z)| lr 2.28e-04 | 4161.05 ms | 32.4% bf16 MFU | 124958 tok/s step 12906/19560 | loss 3.373481 (+1.42z)| norm 0.2406 (-1.03z)| lr 2.28e-04 | 4161.53 ms | 32.4% bf16 MFU | 125009 tok/s step 12907/19560 | loss 3.316652 (-0.39z)| norm 0.2499 (-0.31z)| lr 2.28e-04 | 4160.65 ms | 32.5% bf16 MFU | 125059 tok/s step 12908/19560 | loss 3.344065 (+0.49z)| norm 0.2485 (-0.44z)| lr 2.28e-04 | 4158.16 ms | 32.5% bf16 MFU | 125111 tok/s step 12909/19560 | loss 3.349246 (+0.66z)| norm 0.2305 (-1.83z)| lr 2.28e-04 | 4160.93 ms | 32.4% bf16 MFU | 125155 tok/s step 12910/19560 | loss 3.380101 (+1.62z)| norm 0.2613 (+0.55z)| lr 2.28e-04 | 4163.43 ms | 32.4% bf16 MFU | 125194 tok/s step 12911/19560 | loss 3.339455 (+0.32z)| norm 0.2409 (-1.05z)| lr 2.28e-04 | 4161.94 ms | 32.4% bf16 MFU | 125233 tok/s step 12912/19560 | loss 3.299615 (-0.94z)| norm 0.2391 (-1.20z)| lr 2.28e-04 | 4159.23 ms | 32.5% bf16 MFU | 125274 tok/s step 12913/19560 | loss 3.287678 (-1.32z)| norm 0.2484 (-0.47z)| lr 2.28e-04 | 4159.39 ms | 32.5% bf16 MFU | 125313 tok/s step 12914/19560 | loss 3.322447 (-0.20z)| norm 0.2293 (-1.96z)| lr 2.28e-04 | 4160.86 ms | 32.4% bf16 MFU | 125347 tok/s step 12915/19560 | loss 3.321610 (-0.23z)| norm 0.2452 (-0.71z)| lr 2.28e-04 | 4159.09 ms | 32.5% bf16 MFU | 125383 tok/s step 12916/19560 | loss 3.212783 (-3.59z)| norm 0.2377 (-1.28z)| lr 2.27e-04 | 4158.31 ms | 32.5% bf16 MFU | 125418 tok/s step 12917/19560 | loss 3.346357 (+0.59z)| norm 0.2547 (+0.04z)| lr 2.27e-04 | 4166.18 ms | 32.4% bf16 MFU | 125439 tok/s step 12918/19560 | loss 3.319143 (-0.27z)| norm 0.2289 (-1.93z)| lr 2.27e-04 | 4158.74 ms | 32.5% bf16 MFU | 125470 tok/s step 12919/19560 | loss 3.311405 (-0.51z)| norm 0.2492 (-0.38z)| lr 2.27e-04 | 4162.55 ms | 32.4% bf16 MFU | 125495 tok/s step 12920/19560 | loss 3.380660 (+1.67z)| norm 0.2412 (-0.99z)| lr 2.27e-04 | 4156.56 ms | 32.5% bf16 MFU | 125527 tok/s step 12921/19560 | loss 3.294547 (-1.05z)| norm 0.2381 (-1.21z)| lr 2.27e-04 | 4158.96 ms | 32.5% bf16 MFU | 125553 tok/s step 12922/19560 | loss 3.315409 (-0.36z)| norm 0.2471 (-0.52z)| lr 2.27e-04 | 4162.49 ms | 32.4% bf16 MFU | 125574 tok/s step 12923/19560 | loss 3.232142 (-2.96z)| norm 0.2447 (-0.69z)| lr 2.27e-04 | 4158.57 ms | 32.5% bf16 MFU | 125599 tok/s step 12924/19560 | loss 3.311417 (-0.45z)| norm 0.2319 (-1.64z)| lr 2.27e-04 | 4159.65 ms | 32.5% bf16 MFU | 125621 tok/s step 12925/19560 | loss 3.332501 (+0.22z)| norm 0.2483 (-0.39z)| lr 2.27e-04 | 4159.88 ms | 32.5% bf16 MFU | 125641 tok/s step 12926/19560 | loss 3.316573 (-0.30z)| norm 0.2465 (-0.52z)| lr 2.27e-04 | 4161.35 ms | 32.4% bf16 MFU | 125659 tok/s step 12927/19560 | loss 3.331830 (+0.18z)| norm 0.2648 (+0.85z)| lr 2.27e-04 | 4160.86 ms | 32.4% bf16 MFU | 125676 tok/s step 12928/19560 | loss 3.352783 (+0.85z)| norm 0.2653 (+0.88z)| lr 2.27e-04 | 4160.60 ms | 32.5% bf16 MFU | 125693 tok/s step 12929/19560 | loss 3.320778 (-0.18z)| norm 0.2597 (+0.46z)| lr 2.27e-04 | 4159.92 ms | 32.5% bf16 MFU | 125710 tok/s step 12930/19560 | loss 3.332815 (+0.20z)| norm 0.2605 (+0.51z)| lr 2.27e-04 | 4162.21 ms | 32.4% bf16 MFU | 125723 tok/s step 12931/19560 | loss 3.327083 (+0.01z)| norm 0.2560 (+0.16z)| lr 2.27e-04 | 4156.35 ms | 32.5% bf16 MFU | 125744 tok/s step 12932/19560 | loss 3.345191 (+0.59z)| norm 0.2575 (+0.26z)| lr 2.26e-04 | 4158.15 ms | 32.5% bf16 MFU | 125761 tok/s step 12933/19560 | loss 3.353526 (+0.85z)| norm 0.2451 (-0.71z)| lr 2.26e-04 | 4164.22 ms | 32.4% bf16 MFU | 125768 tok/s step 12934/19560 | loss 3.322553 (-0.16z)| norm 0.2729 (+1.42z)| lr 2.26e-04 | 4156.69 ms | 32.5% bf16 MFU | 125786 tok/s step 12935/19560 | loss 3.359632 (+1.06z)| norm 0.2340 (-1.53z)| lr 2.26e-04 | 4161.84 ms | 32.4% bf16 MFU | 125795 tok/s step 12936/19560 | loss 3.356828 (+0.96z)| norm 0.2490 (-0.40z)| lr 2.26e-04 | 4157.44 ms | 32.5% bf16 MFU | 125811 tok/s step 12937/19560 | loss 3.393181 (+2.18z)| norm 0.2574 (+0.23z)| lr 2.26e-04 | 4162.20 ms | 32.4% bf16 MFU | 125819 tok/s step 12938/19560 | loss 3.295887 (-1.07z)| norm 0.2411 (-1.00z)| lr 2.26e-04 | 4158.50 ms | 32.5% bf16 MFU | 125832 tok/s step 12939/19560 | loss 3.399714 (+2.33z)| norm 0.2671 (+0.99z)| lr 2.26e-04 | 4158.40 ms | 32.5% bf16 MFU | 125844 tok/s step 12940/19560 | loss 3.336492 (+0.25z)| norm 0.2415 (-0.97z)| lr 2.26e-04 | 4163.75 ms | 32.4% bf16 MFU | 125848 tok/s step 12941/19560 | loss 3.278139 (-1.63z)| norm 0.2571 (+0.24z)| lr 2.26e-04 | 4157.28 ms | 32.5% bf16 MFU | 125861 tok/s step 12942/19560 | loss 3.349341 (+0.68z)| norm 0.2458 (-0.64z)| lr 2.26e-04 | 4159.43 ms | 32.5% bf16 MFU | 125870 tok/s step 12943/19560 | loss 3.335456 (+0.22z)| norm 0.2355 (-1.45z)| lr 2.26e-04 | 4158.40 ms | 32.5% bf16 MFU | 125881 tok/s step 12944/19560 | loss 3.295148 (-1.07z)| norm 0.2557 (+0.13z)| lr 2.26e-04 | 4196.60 ms | 32.2% bf16 MFU | 125833 tok/s step 12945/19560 | loss 3.358262 (+0.96z)| norm 0.2576 (+0.27z)| lr 2.26e-04 | 4182.66 ms | 32.3% bf16 MFU | 125809 tok/s step 12946/19560 | loss 3.222168 (-3.31z)| norm 0.2552 (+0.08z)| lr 2.26e-04 | 4155.46 ms | 32.5% bf16 MFU | 125827 tok/s step 12947/19560 | loss 3.311384 (-0.50z)| norm 0.2622 (+0.62z)| lr 2.26e-04 | 4158.06 ms | 32.5% bf16 MFU | 125840 tok/s step 12948/19560 | loss 3.296667 (-0.95z)| norm 0.2532 (-0.08z)| lr 2.25e-04 | 4156.80 ms | 32.5% bf16 MFU | 125854 tok/s step 12949/19560 | loss 3.350863 (+0.74z)| norm 0.2391 (-1.18z)| lr 2.25e-04 | 4158.81 ms | 32.5% bf16 MFU | 125865 tok/s step 12950/19560 | loss 3.388787 (+1.88z)| norm 0.2417 (-0.96z)| lr 2.25e-04 | 4159.14 ms | 32.5% bf16 MFU | 125875 tok/s step 12951/19560 | loss 3.402257 (+2.24z)| norm 0.2566 (+0.22z)| lr 2.25e-04 | 4161.93 ms | 32.4% bf16 MFU | 125880 tok/s step 12952/19560 | loss 3.317646 (-0.32z)| norm 0.2487 (-0.41z)| lr 2.25e-04 | 4160.18 ms | 32.5% bf16 MFU | 125887 tok/s step 12953/19560 | loss 3.343352 (+0.45z)| norm 0.2488 (-0.39z)| lr 2.25e-04 | 4159.69 ms | 32.5% bf16 MFU | 125894 tok/s step 12954/19560 | loss 3.309735 (-0.56z)| norm 0.2634 (+0.77z)| lr 2.25e-04 | 4159.28 ms | 32.5% bf16 MFU | 125902 tok/s step 12955/19560 | loss 3.325905 (-0.07z)| norm 0.2425 (-0.88z)| lr 2.25e-04 | 4160.19 ms | 32.5% bf16 MFU | 125909 tok/s step 12956/19560 | loss 3.303531 (-0.73z)| norm 0.2641 (+0.82z)| lr 2.25e-04 | 4160.27 ms | 32.5% bf16 MFU | 125914 tok/s step 12957/19560 | loss 3.344751 (+0.52z)| norm 0.2436 (-0.79z)| lr 2.25e-04 | 4172.31 ms | 32.4% bf16 MFU | 125901 tok/s step 12958/19560 | loss 3.342080 (+0.43z)| norm 0.2520 (-0.12z)| lr 2.25e-04 | 4160.39 ms | 32.5% bf16 MFU | 125907 tok/s step 12959/19560 | loss 3.292541 (-1.07z)| norm 0.2581 (+0.36z)| lr 2.25e-04 | 4161.55 ms | 32.4% bf16 MFU | 125911 tok/s step 12960/19560 | loss 3.330405 (+0.07z)| norm 0.2424 (-0.88z)| lr 2.25e-04 | 4159.05 ms | 32.5% bf16 MFU | 125919 tok/s step 12961/19560 | loss 3.354094 (+0.80z)| norm 0.2432 (-0.81z)| lr 2.25e-04 | 4160.31 ms | 32.5% bf16 MFU | 125924 tok/s step 12962/19560 | loss 3.305206 (-0.68z)| norm 0.2476 (-0.45z)| lr 2.25e-04 | 4143.39 ms | 32.6% bf16 MFU | 125954 tok/s step 12963/19560 | loss 3.427086 (+2.95z)| norm 0.2788 (+2.00z)| lr 2.25e-04 | 4148.25 ms | 32.5% bf16 MFU | 125976 tok/s step 12964/19560 | loss 3.326681 (-0.03z)| norm 0.2451 (-0.65z)| lr 2.25e-04 | 4152.29 ms | 32.5% bf16 MFU | 125990 tok/s step 12965/19560 | loss 3.324736 (-0.09z)| norm 0.2482 (-0.40z)| lr 2.24e-04 | 4151.96 ms | 32.5% bf16 MFU | 126005 tok/s step 12966/19560 | loss 3.302119 (-0.76z)| norm 0.2584 (+0.44z)| lr 2.24e-04 | 4156.62 ms | 32.5% bf16 MFU | 126011 tok/s step 12967/19560 | loss 3.383937 (+1.65z)| norm 0.2565 (+0.28z)| lr 2.24e-04 | 4155.72 ms | 32.5% bf16 MFU | 126019 tok/s step 12968/19560 | loss 3.323457 (-0.13z)| norm 0.2510 (-0.15z)| lr 2.24e-04 | 4154.19 ms | 32.5% bf16 MFU | 126028 tok/s step 12969/19560 | loss 3.401835 (+2.12z)| norm 0.2521 (-0.05z)| lr 2.24e-04 | 4156.34 ms | 32.5% bf16 MFU | 126034 tok/s step 12970/19560 | loss 3.278983 (-1.41z)| norm 0.2517 (-0.07z)| lr 2.24e-04 | 4151.33 ms | 32.5% bf16 MFU | 126047 tok/s step 12971/19560 | loss 3.270557 (-1.63z)| norm 0.2430 (-0.80z)| lr 2.24e-04 | 4152.81 ms | 32.5% bf16 MFU | 126057 tok/s step 12972/19560 | loss 3.319545 (-0.24z)| norm 0.2505 (-0.16z)| lr 2.24e-04 | 4152.60 ms | 32.5% bf16 MFU | 126067 tok/s step 12973/19560 | loss 3.315320 (-0.38z)| norm 0.2611 (+0.74z)| lr 2.24e-04 | 4154.95 ms | 32.5% bf16 MFU | 126073 tok/s step 12974/19560 | loss 3.221724 (-2.94z)| norm 0.2388 (-1.14z)| lr 2.24e-04 | 4153.92 ms | 32.5% bf16 MFU | 126080 tok/s step 12975/19560 | loss 3.380340 (+1.45z)| norm 0.2500 (-0.19z)| lr 2.24e-04 | 4158.87 ms | 32.5% bf16 MFU | 126079 tok/s step 12976/19560 | loss 3.379082 (+1.39z)| norm 0.2450 (-0.59z)| lr 2.24e-04 | 4153.31 ms | 32.5% bf16 MFU | 126087 tok/s step 12977/19560 | loss 3.383060 (+1.47z)| norm 0.2438 (-0.70z)| lr 2.24e-04 | 4158.73 ms | 32.5% bf16 MFU | 126086 tok/s step 12978/19560 | loss 3.317613 (-0.32z)| norm 0.2824 (+2.52z)| lr 2.24e-04 | 4153.12 ms | 32.5% bf16 MFU | 126094 tok/s step 12979/19560 | loss 3.309970 (-0.52z)| norm 0.2377 (-1.21z)| lr 2.24e-04 | 4157.83 ms | 32.5% bf16 MFU | 126094 tok/s step 12980/19560 | loss 3.386518 (+1.55z)| norm 0.2976 (+3.61z)| lr 2.24e-04 | 4156.82 ms | 32.5% bf16 MFU | 126095 tok/s step 12981/19560 | loss 3.338624 (+0.24z)| norm 0.2370 (-1.22z)| lr 2.23e-04 | 4156.00 ms | 32.5% bf16 MFU | 126098 tok/s step 12982/19560 | loss 3.337224 (+0.20z)| norm 0.2730 (+1.62z)| lr 2.23e-04 | 4155.27 ms | 32.5% bf16 MFU | 126102 tok/s step 12983/19560 | loss 3.308034 (-0.59z)| norm 0.2374 (-1.18z)| lr 2.23e-04 | 4158.99 ms | 32.5% bf16 MFU | 126100 tok/s step 12984/19560 | loss 3.341859 (+0.35z)| norm 0.2688 (+1.31z)| lr 2.23e-04 | 4157.72 ms | 32.5% bf16 MFU | 126100 tok/s step 12985/19560 | loss 3.340554 (+0.31z)| norm 0.2786 (+2.04z)| lr 2.23e-04 | 4155.19 ms | 32.5% bf16 MFU | 126104 tok/s step 12986/19560 | loss 3.343287 (+0.38z)| norm 0.2891 (+2.76z)| lr 2.23e-04 | 4155.20 ms | 32.5% bf16 MFU | 126107 tok/s step 12987/19560 | loss 3.355167 (+0.70z)| norm 0.2610 (+0.62z)| lr 2.23e-04 | 4159.72 ms | 32.5% bf16 MFU | 126104 tok/s step 12988/19560 | loss 3.397430 (+1.83z)| norm 0.2879 (+2.57z)| lr 2.23e-04 | 4156.77 ms | 32.5% bf16 MFU | 126105 tok/s step 12989/19560 | loss 3.290539 (-1.07z)| norm 0.2392 (-1.02z)| lr 2.23e-04 | 4160.42 ms | 32.5% bf16 MFU | 126101 tok/s step 12990/19560 | loss 3.362343 (+0.87z)| norm 0.2540 (+0.07z)| lr 2.23e-04 | 4155.19 ms | 32.5% bf16 MFU | 126105 tok/s step 12991/19560 | loss 3.362796 (+0.89z)| norm 0.2301 (-1.66z)| lr 2.23e-04 | 4156.73 ms | 32.5% bf16 MFU | 126106 tok/s step 12992/19560 | loss 3.349378 (+0.52z)| norm 0.2574 (+0.34z)| lr 2.23e-04 | 4164.01 ms | 32.4% bf16 MFU | 126096 tok/s step 12993/19560 | loss 3.314187 (-0.43z)| norm 0.2504 (-0.17z)| lr 2.23e-04 | 4159.02 ms | 32.5% bf16 MFU | 126094 tok/s step 12994/19560 | loss 3.305845 (-0.65z)| norm 0.2504 (-0.17z)| lr 2.23e-04 | 4157.62 ms | 32.5% bf16 MFU | 126095 tok/s step 12995/19560 | loss 3.320960 (-0.25z)| norm 0.2444 (-0.61z)| lr 2.23e-04 | 4256.41 ms | 31.7% bf16 MFU | 125949 tok/s step 12996/19560 | loss 3.306605 (-0.65z)| norm 0.2592 (+0.48z)| lr 2.23e-04 | 4285.13 ms | 31.5% bf16 MFU | 125769 tok/s step 12997/19560 | loss 3.317368 (-0.34z)| norm 0.2411 (-0.85z)| lr 2.23e-04 | 4324.28 ms | 31.2% bf16 MFU | 125543 tok/s step 12998/19560 | loss 3.260284 (-1.86z)| norm 0.2561 (+0.27z)| lr 2.22e-04 | 4193.30 ms | 32.2% bf16 MFU | 125517 tok/s step 12999/19560 | loss 3.369322 (+1.07z)| norm 0.2474 (-0.39z)| lr 2.22e-04 | 4158.26 ms | 32.5% bf16 MFU | 125545 tok/s step 13000/19560 | loss 3.304787 (-0.67z)| norm 0.2429 (-0.71z)| lr 2.22e-04 | 4184.75 ms | 32.3% bf16 MFU | 125532 tok/s val loss 3.317390 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2953/10042 = 0.294065 step 13001/19560 | loss 3.302956 (-0.72z)| norm 0.2485 (-0.28z)| lr 2.22e-04 | 4159.94 ms | 32.5% bf16 MFU | 125557 tok/s step 13002/19560 | loss 3.325162 (-0.12z)| norm 0.2558 (+0.27z)| lr 2.22e-04 | 4198.46 ms | 32.2% bf16 MFU | 125523 tok/s step 13003/19560 | loss 3.310920 (-0.51z)| norm 0.2445 (-0.57z)| lr 2.22e-04 | 4157.82 ms | 32.5% bf16 MFU | 125552 tok/s step 13004/19560 | loss 3.336902 (+0.19z)| norm 0.2588 (+0.53z)| lr 2.22e-04 | 4159.03 ms | 32.5% bf16 MFU | 125577 tok/s step 13005/19560 | loss 3.403985 (+1.96z)| norm 0.2717 (+1.48z)| lr 2.22e-04 | 4171.54 ms | 32.4% bf16 MFU | 125583 tok/s step 13006/19560 | loss 3.377223 (+1.23z)| norm 0.2416 (-0.79z)| lr 2.22e-04 | 4163.06 ms | 32.4% bf16 MFU | 125600 tok/s step 13007/19560 | loss 3.324961 (-0.16z)| norm 0.2555 (+0.28z)| lr 2.22e-04 | 4161.20 ms | 32.4% bf16 MFU | 125620 tok/s step 13008/19560 | loss 3.286806 (-1.17z)| norm 0.2385 (-1.03z)| lr 2.22e-04 | 4161.21 ms | 32.4% bf16 MFU | 125639 tok/s step 13009/19560 | loss 3.329833 (-0.03z)| norm 0.2608 (+0.74z)| lr 2.22e-04 | 4161.76 ms | 32.4% bf16 MFU | 125656 tok/s step 13010/19560 | loss 3.308141 (-0.61z)| norm 0.2510 (-0.03z)| lr 2.22e-04 | 4161.42 ms | 32.4% bf16 MFU | 125672 tok/s step 13011/19560 | loss 3.300211 (-0.81z)| norm 0.2407 (-0.84z)| lr 2.22e-04 | 4186.60 ms | 32.2% bf16 MFU | 125650 tok/s step 13012/19560 | loss 3.289881 (-1.07z)| norm 0.2417 (-0.75z)| lr 2.22e-04 | 4162.53 ms | 32.4% bf16 MFU | 125665 tok/s step 13013/19560 | loss 3.268056 (-1.62z)| norm 0.2595 (+0.67z)| lr 2.22e-04 | 4155.73 ms | 32.5% bf16 MFU | 125690 tok/s step 13014/19560 | loss 3.313453 (-0.42z)| norm 0.2458 (-0.43z)| lr 2.21e-04 | 4162.63 ms | 32.4% bf16 MFU | 125703 tok/s step 13015/19560 | loss 3.318947 (-0.27z)| norm 0.2481 (-0.24z)| lr 2.21e-04 | 4159.80 ms | 32.5% bf16 MFU | 125720 tok/s step 13016/19560 | loss 3.319887 (-0.24z)| norm 0.2518 (+0.05z)| lr 2.21e-04 | 4168.42 ms | 32.4% bf16 MFU | 125723 tok/s step 13017/19560 | loss 3.308132 (-0.55z)| norm 0.2441 (-0.56z)| lr 2.21e-04 | 4157.67 ms | 32.5% bf16 MFU | 125742 tok/s step 13018/19560 | loss 3.339776 (+0.29z)| norm 0.2513 (+0.01z)| lr 2.21e-04 | 4160.86 ms | 32.4% bf16 MFU | 125755 tok/s step 13019/19560 | loss 3.334543 (+0.16z)| norm 0.2690 (+1.40z)| lr 2.21e-04 | 4159.76 ms | 32.5% bf16 MFU | 125769 tok/s step 13020/19560 | loss 3.341201 (+0.34z)| norm 0.2406 (-0.85z)| lr 2.21e-04 | 4156.82 ms | 32.5% bf16 MFU | 125787 tok/s step 13021/19560 | loss 3.390179 (+1.61z)| norm 0.2580 (+0.54z)| lr 2.21e-04 | 4157.22 ms | 32.5% bf16 MFU | 125803 tok/s step 13022/19560 | loss 3.326674 (-0.06z)| norm 0.2831 (+2.49z)| lr 2.21e-04 | 4160.47 ms | 32.5% bf16 MFU | 125814 tok/s step 13023/19560 | loss 3.330512 (+0.04z)| norm 0.2521 (+0.05z)| lr 2.21e-04 | 4157.14 ms | 32.5% bf16 MFU | 125829 tok/s step 13024/19560 | loss 3.288512 (-1.05z)| norm 0.2481 (-0.26z)| lr 2.21e-04 | 4163.62 ms | 32.4% bf16 MFU | 125834 tok/s step 13025/19560 | loss 3.337405 (+0.24z)| norm 0.2554 (+0.31z)| lr 2.21e-04 | 4158.67 ms | 32.5% bf16 MFU | 125846 tok/s step 13026/19560 | loss 3.369593 (+1.07z)| norm 0.2652 (+1.08z)| lr 2.21e-04 | 4158.23 ms | 32.5% bf16 MFU | 125858 tok/s step 13027/19560 | loss 3.286843 (-1.11z)| norm 0.2644 (+1.00z)| lr 2.21e-04 | 4160.10 ms | 32.5% bf16 MFU | 125866 tok/s step 13028/19560 | loss 3.344977 (+0.42z)| norm 0.2547 (+0.25z)| lr 2.21e-04 | 4157.24 ms | 32.5% bf16 MFU | 125878 tok/s step 13029/19560 | loss 3.267631 (-1.61z)| norm 0.2455 (-0.49z)| lr 2.21e-04 | 4160.00 ms | 32.5% bf16 MFU | 125886 tok/s step 13030/19560 | loss 3.315565 (-0.34z)| norm 0.2413 (-0.80z)| lr 2.21e-04 | 4158.27 ms | 32.5% bf16 MFU | 125896 tok/s step 13031/19560 | loss 3.368099 (+1.04z)| norm 0.2380 (-1.05z)| lr 2.20e-04 | 4157.60 ms | 32.5% bf16 MFU | 125906 tok/s step 13032/19560 | loss 3.316175 (-0.34z)| norm 0.2544 (+0.22z)| lr 2.20e-04 | 4159.28 ms | 32.5% bf16 MFU | 125914 tok/s step 13033/19560 | loss 3.379867 (+1.33z)| norm 0.2505 (-0.09z)| lr 2.20e-04 | 4155.38 ms | 32.5% bf16 MFU | 125926 tok/s step 13034/19560 | loss 3.274874 (-1.41z)| norm 0.2415 (-0.82z)| lr 2.20e-04 | 4159.87 ms | 32.5% bf16 MFU | 125932 tok/s step 13035/19560 | loss 3.288278 (-1.04z)| norm 0.2447 (-0.56z)| lr 2.20e-04 | 4155.91 ms | 32.5% bf16 MFU | 125943 tok/s step 13036/19560 | loss 3.329066 (+0.03z)| norm 0.2432 (-0.67z)| lr 2.20e-04 | 4158.23 ms | 32.5% bf16 MFU | 125950 tok/s step 13037/19560 | loss 3.307395 (-0.53z)| norm 0.2594 (+0.61z)| lr 2.20e-04 | 4158.39 ms | 32.5% bf16 MFU | 125957 tok/s step 13038/19560 | loss 3.354187 (+0.70z)| norm 0.2551 (+0.26z)| lr 2.20e-04 | 4159.44 ms | 32.5% bf16 MFU | 125961 tok/s step 13039/19560 | loss 3.318354 (-0.24z)| norm 0.2477 (-0.33z)| lr 2.20e-04 | 4156.13 ms | 32.5% bf16 MFU | 125970 tok/s step 13040/19560 | loss 3.348926 (+0.56z)| norm 0.2387 (-1.07z)| lr 2.20e-04 | 4159.14 ms | 32.5% bf16 MFU | 125975 tok/s step 13041/19560 | loss 3.290446 (-0.98z)| norm 0.2341 (-1.41z)| lr 2.20e-04 | 4157.79 ms | 32.5% bf16 MFU | 125981 tok/s step 13042/19560 | loss 3.304450 (-0.61z)| norm 0.2408 (-0.89z)| lr 2.20e-04 | 4158.04 ms | 32.5% bf16 MFU | 125986 tok/s step 13043/19560 | loss 3.287796 (-1.04z)| norm 0.2334 (-1.47z)| lr 2.20e-04 | 4156.40 ms | 32.5% bf16 MFU | 125994 tok/s step 13044/19560 | loss 3.282328 (-1.23z)| norm 0.2391 (-1.02z)| lr 2.20e-04 | 4156.35 ms | 32.5% bf16 MFU | 126001 tok/s step 13045/19560 | loss 3.335172 (+0.20z)| norm 0.2418 (-0.79z)| lr 2.20e-04 | 4158.52 ms | 32.5% bf16 MFU | 126005 tok/s step 13046/19560 | loss 3.301470 (-0.71z)| norm 0.2510 (-0.07z)| lr 2.20e-04 | 4159.58 ms | 32.5% bf16 MFU | 126007 tok/s step 13047/19560 | loss 3.421670 (+2.46z)| norm 0.2379 (-1.12z)| lr 2.19e-04 | 4156.49 ms | 32.5% bf16 MFU | 126014 tok/s step 13048/19560 | loss 3.277256 (-1.33z)| norm 0.2368 (-1.20z)| lr 2.19e-04 | 4158.57 ms | 32.5% bf16 MFU | 126017 tok/s step 13049/19560 | loss 3.309964 (-0.47z)| norm 0.2594 (+0.61z)| lr 2.19e-04 | 4156.73 ms | 32.5% bf16 MFU | 126022 tok/s step 13050/19560 | loss 3.330405 (+0.06z)| norm 0.2597 (+0.63z)| lr 2.19e-04 | 4156.99 ms | 32.5% bf16 MFU | 126027 tok/s step 13051/19560 | loss 3.279205 (-1.33z)| norm 0.2377 (-1.14z)| lr 2.19e-04 | 4156.91 ms | 32.5% bf16 MFU | 126032 tok/s step 13052/19560 | loss 3.314825 (-0.37z)| norm 0.2597 (+0.62z)| lr 2.19e-04 | 4159.61 ms | 32.5% bf16 MFU | 126033 tok/s step 13053/19560 | loss 3.287787 (-1.08z)| norm 0.2562 (+0.33z)| lr 2.19e-04 | 4156.83 ms | 32.5% bf16 MFU | 126037 tok/s step 13054/19560 | loss 3.303635 (-0.65z)| norm 0.2282 (-1.92z)| lr 2.19e-04 | 4156.22 ms | 32.5% bf16 MFU | 126043 tok/s step 13055/19560 | loss 3.282000 (-1.22z)| norm 0.2570 (+0.41z)| lr 2.19e-04 | 4158.86 ms | 32.5% bf16 MFU | 126044 tok/s step 13056/19560 | loss 3.319339 (-0.21z)| norm 0.2368 (-1.20z)| lr 2.19e-04 | 4159.37 ms | 32.5% bf16 MFU | 126044 tok/s step 13057/19560 | loss 3.290664 (-0.97z)| norm 0.2417 (-0.80z)| lr 2.19e-04 | 4162.39 ms | 32.4% bf16 MFU | 126040 tok/s step 13058/19560 | loss 3.352703 (+0.68z)| norm 0.2433 (-0.66z)| lr 2.19e-04 | 4156.43 ms | 32.5% bf16 MFU | 126045 tok/s step 13059/19560 | loss 3.364055 (+0.97z)| norm 0.2601 (+0.69z)| lr 2.19e-04 | 4156.58 ms | 32.5% bf16 MFU | 126049 tok/s step 13060/19560 | loss 3.320478 (-0.18z)| norm 0.2496 (-0.15z)| lr 2.19e-04 | 4160.33 ms | 32.5% bf16 MFU | 126048 tok/s step 13061/19560 | loss 3.322060 (-0.13z)| norm 0.2424 (-0.73z)| lr 2.19e-04 | 4156.15 ms | 32.5% bf16 MFU | 126053 tok/s step 13062/19560 | loss 3.391405 (+1.68z)| norm 0.2440 (-0.59z)| lr 2.19e-04 | 4156.60 ms | 32.5% bf16 MFU | 126057 tok/s step 13063/19560 | loss 3.383746 (+1.46z)| norm 0.2570 (+0.46z)| lr 2.19e-04 | 4157.34 ms | 32.5% bf16 MFU | 126060 tok/s step 13064/19560 | loss 3.342406 (+0.39z)| norm 0.2555 (+0.34z)| lr 2.18e-04 | 4154.53 ms | 32.5% bf16 MFU | 126066 tok/s step 13065/19560 | loss 3.371342 (+1.15z)| norm 0.2550 (+0.29z)| lr 2.18e-04 | 4159.94 ms | 32.5% bf16 MFU | 126065 tok/s step 13066/19560 | loss 3.376162 (+1.26z)| norm 0.2693 (+1.44z)| lr 2.18e-04 | 4156.69 ms | 32.5% bf16 MFU | 126068 tok/s step 13067/19560 | loss 3.268483 (-1.54z)| norm 0.2396 (-0.96z)| lr 2.18e-04 | 4157.47 ms | 32.5% bf16 MFU | 126070 tok/s step 13068/19560 | loss 3.358721 (+0.83z)| norm 0.2830 (+2.49z)| lr 2.18e-04 | 4155.98 ms | 32.5% bf16 MFU | 126074 tok/s step 13069/19560 | loss 3.327929 (+0.01z)| norm 0.2465 (-0.41z)| lr 2.18e-04 | 4157.46 ms | 32.5% bf16 MFU | 126076 tok/s step 13070/19560 | loss 3.329994 (+0.07z)| norm 0.2654 (+1.08z)| lr 2.18e-04 | 4155.57 ms | 32.5% bf16 MFU | 126080 tok/s step 13071/19560 | loss 3.313721 (-0.36z)| norm 0.2662 (+1.13z)| lr 2.18e-04 | 4158.69 ms | 32.5% bf16 MFU | 126080 tok/s step 13072/19560 | loss 3.317876 (-0.26z)| norm 0.2524 (+0.03z)| lr 2.18e-04 | 4157.92 ms | 32.5% bf16 MFU | 126081 tok/s step 13073/19560 | loss 3.369771 (+1.12z)| norm 0.2595 (+0.60z)| lr 2.18e-04 | 4763.88 ms | 28.3% bf16 MFU | 125279 tok/s step 13074/19560 | loss 3.382391 (+1.45z)| norm 0.2490 (-0.24z)| lr 2.18e-04 | 4582.00 ms | 29.5% bf16 MFU | 124736 tok/s step 13075/19560 | loss 3.301090 (-0.75z)| norm 0.2478 (-0.33z)| lr 2.18e-04 | 4515.64 ms | 29.9% bf16 MFU | 124305 tok/s step 13076/19560 | loss 3.392982 (+1.70z)| norm 0.2653 (+1.06z)| lr 2.18e-04 | 4394.42 ms | 30.7% bf16 MFU | 124055 tok/s step 13077/19560 | loss 3.364695 (+0.94z)| norm 0.2346 (-1.37z)| lr 2.18e-04 | 4378.04 ms | 30.8% bf16 MFU | 123840 tok/s step 13078/19560 | loss 3.293067 (-0.97z)| norm 0.2615 (+0.74z)| lr 2.18e-04 | 4283.50 ms | 31.5% bf16 MFU | 123768 tok/s step 13079/19560 | loss 3.361829 (+0.91z)| norm 0.2549 (+0.22z)| lr 2.18e-04 | 4332.30 ms | 31.2% bf16 MFU | 123630 tok/s step 13080/19560 | loss 3.306696 (-0.59z)| norm 0.2312 (-1.63z)| lr 2.17e-04 | 4356.24 ms | 31.0% bf16 MFU | 123467 tok/s step 13081/19560 | loss 3.325269 (-0.08z)| norm 0.2610 (+0.70z)| lr 2.17e-04 | 4522.58 ms | 29.9% bf16 MFU | 123090 tok/s step 13082/19560 | loss 3.322969 (-0.15z)| norm 0.2512 (-0.06z)| lr 2.17e-04 | 4262.94 ms | 31.7% bf16 MFU | 123084 tok/s step 13083/19560 | loss 3.348581 (+0.54z)| norm 0.2550 (+0.23z)| lr 2.17e-04 | 4164.06 ms | 32.4% bf16 MFU | 123226 tok/s step 13084/19560 | loss 3.281278 (-1.28z)| norm 0.2501 (-0.14z)| lr 2.17e-04 | 4208.15 ms | 32.1% bf16 MFU | 123294 tok/s step 13085/19560 | loss 3.292264 (-0.97z)| norm 0.2663 (+1.12z)| lr 2.17e-04 | 4280.08 ms | 31.5% bf16 MFU | 123254 tok/s step 13086/19560 | loss 3.393935 (+1.75z)| norm 0.2588 (+0.52z)| lr 2.17e-04 | 4198.44 ms | 32.2% bf16 MFU | 123335 tok/s step 13087/19560 | loss 3.400168 (+1.88z)| norm 0.2484 (-0.29z)| lr 2.17e-04 | 4205.43 ms | 32.1% bf16 MFU | 123402 tok/s step 13088/19560 | loss 3.316102 (-0.35z)| norm 0.2783 (+2.01z)| lr 2.17e-04 | 4239.93 ms | 31.8% bf16 MFU | 123414 tok/s step 13089/19560 | loss 3.342231 (+0.35z)| norm 0.2414 (-0.85z)| lr 2.17e-04 | 4164.91 ms | 32.4% bf16 MFU | 123538 tok/s step 13090/19560 | loss 3.321602 (-0.20z)| norm 0.2567 (+0.33z)| lr 2.17e-04 | 4175.86 ms | 32.3% bf16 MFU | 123638 tok/s step 13091/19560 | loss 3.310176 (-0.49z)| norm 0.2454 (-0.53z)| lr 2.17e-04 | 4349.37 ms | 31.0% bf16 MFU | 123484 tok/s step 13092/19560 | loss 3.388927 (+1.62z)| norm 0.2511 (-0.09z)| lr 2.17e-04 | 4160.25 ms | 32.5% bf16 MFU | 123611 tok/s step 13093/19560 | loss 3.305608 (-0.62z)| norm 0.2605 (+0.65z)| lr 2.17e-04 | 4156.70 ms | 32.5% bf16 MFU | 123737 tok/s step 13094/19560 | loss 3.390503 (+1.63z)| norm 0.2519 (-0.03z)| lr 2.17e-04 | 4168.00 ms | 32.4% bf16 MFU | 123839 tok/s step 13095/19560 | loss 3.352077 (+0.62z)| norm 0.2594 (+0.56z)| lr 2.17e-04 | 4180.71 ms | 32.3% bf16 MFU | 123918 tok/s step 13096/19560 | loss 3.346587 (+0.47z)| norm 0.2467 (-0.44z)| lr 2.17e-04 | 4827.83 ms | 28.0% bf16 MFU | 123152 tok/s step 13097/19560 | loss 3.298840 (-0.81z)| norm 0.2431 (-0.71z)| lr 2.16e-04 | 4159.32 ms | 32.5% bf16 MFU | 123297 tok/s step 13098/19560 | loss 3.243438 (-2.27z)| norm 0.2481 (-0.31z)| lr 2.16e-04 | 4223.60 ms | 32.0% bf16 MFU | 123338 tok/s step 13099/19560 | loss 3.399921 (+1.89z)| norm 0.2597 (+0.58z)| lr 2.16e-04 | 4167.67 ms | 32.4% bf16 MFU | 123461 tok/s step 13100/19560 | loss 3.300297 (-0.77z)| norm 0.2316 (-1.60z)| lr 2.16e-04 | 4178.31 ms | 32.3% bf16 MFU | 123562 tok/s step 13101/19560 | loss 3.387231 (+1.52z)| norm 0.2607 (+0.66z)| lr 2.16e-04 | 4166.00 ms | 32.4% bf16 MFU | 123677 tok/s step 13102/19560 | loss 3.268127 (-1.67z)| norm 0.2245 (-2.11z)| lr 2.16e-04 | 4167.72 ms | 32.4% bf16 MFU | 123783 tok/s step 13103/19560 | loss 3.334998 (+0.15z)| norm 0.2463 (-0.44z)| lr 2.16e-04 | 4166.32 ms | 32.4% bf16 MFU | 123885 tok/s step 13104/19560 | loss 3.298337 (-0.84z)| norm 0.2334 (-1.41z)| lr 2.16e-04 | 4163.64 ms | 32.4% bf16 MFU | 123987 tok/s step 13105/19560 | loss 3.356126 (+0.75z)| norm 0.2413 (-0.80z)| lr 2.16e-04 | 4158.68 ms | 32.5% bf16 MFU | 124091 tok/s step 13106/19560 | loss 3.312891 (-0.44z)| norm 0.2513 (-0.03z)| lr 2.16e-04 | 4178.21 ms | 32.3% bf16 MFU | 124161 tok/s step 13107/19560 | loss 3.366740 (+1.03z)| norm 0.2310 (-1.58z)| lr 2.16e-04 | 4204.75 ms | 32.1% bf16 MFU | 124187 tok/s step 13108/19560 | loss 3.333368 (+0.13z)| norm 0.2528 (+0.13z)| lr 2.16e-04 | 4164.16 ms | 32.4% bf16 MFU | 124273 tok/s step 13109/19560 | loss 3.380746 (+1.42z)| norm 0.2501 (-0.10z)| lr 2.16e-04 | 4168.40 ms | 32.4% bf16 MFU | 124348 tok/s step 13110/19560 | loss 3.344697 (+0.43z)| norm 0.2398 (-0.93z)| lr 2.16e-04 | 4169.87 ms | 32.4% bf16 MFU | 124418 tok/s step 13111/19560 | loss 3.361498 (+0.88z)| norm 0.2575 (+0.52z)| lr 2.16e-04 | 4202.32 ms | 32.1% bf16 MFU | 124435 tok/s step 13112/19560 | loss 3.314570 (-0.40z)| norm 0.2478 (-0.27z)| lr 2.16e-04 | 4183.77 ms | 32.3% bf16 MFU | 124479 tok/s step 13113/19560 | loss 3.346501 (+0.47z)| norm 0.2465 (-0.37z)| lr 2.16e-04 | 4228.40 ms | 31.9% bf16 MFU | 124454 tok/s step 13114/19560 | loss 3.360029 (+0.83z)| norm 0.2457 (-0.43z)| lr 2.15e-04 | 4166.16 ms | 32.4% bf16 MFU | 124524 tok/s step 13115/19560 | loss 3.321034 (-0.23z)| norm 0.2334 (-1.49z)| lr 2.15e-04 | 4167.27 ms | 32.4% bf16 MFU | 124588 tok/s step 13116/19560 | loss 3.353266 (+0.67z)| norm 0.2713 (+1.93z)| lr 2.15e-04 | 4163.59 ms | 32.4% bf16 MFU | 124655 tok/s step 13117/19560 | loss 3.334202 (+0.14z)| norm 0.2406 (-0.87z)| lr 2.15e-04 | 4164.26 ms | 32.4% bf16 MFU | 124717 tok/s step 13118/19560 | loss 3.333853 (+0.13z)| norm 0.2477 (-0.22z)| lr 2.15e-04 | 4183.19 ms | 32.3% bf16 MFU | 124748 tok/s step 13119/19560 | loss 3.324431 (-0.12z)| norm 0.2397 (-0.96z)| lr 2.15e-04 | 4170.11 ms | 32.4% bf16 MFU | 124797 tok/s step 13120/19560 | loss 3.360089 (+0.87z)| norm 0.2404 (-0.88z)| lr 2.15e-04 | 4171.41 ms | 32.4% bf16 MFU | 124841 tok/s step 13121/19560 | loss 3.306920 (-0.61z)| norm 0.2408 (-0.84z)| lr 2.15e-04 | 4167.11 ms | 32.4% bf16 MFU | 124890 tok/s step 13122/19560 | loss 3.329284 (+0.01z)| norm 0.2459 (-0.37z)| lr 2.15e-04 | 4166.88 ms | 32.4% bf16 MFU | 124937 tok/s step 13123/19560 | loss 3.343423 (+0.40z)| norm 0.2213 (-2.55z)| lr 2.15e-04 | 4171.95 ms | 32.4% bf16 MFU | 124973 tok/s step 13124/19560 | loss 3.368791 (+1.09z)| norm 0.2464 (-0.29z)| lr 2.15e-04 | 4179.50 ms | 32.3% bf16 MFU | 124997 tok/s step 13125/19560 | loss 3.299891 (-0.82z)| norm 0.2318 (-1.58z)| lr 2.15e-04 | 4169.23 ms | 32.4% bf16 MFU | 125035 tok/s step 13126/19560 | loss 3.320637 (-0.26z)| norm 0.2335 (-1.41z)| lr 2.15e-04 | 4173.57 ms | 32.4% bf16 MFU | 125064 tok/s step 13127/19560 | loss 3.300722 (-0.81z)| norm 0.2295 (-1.73z)| lr 2.15e-04 | 4182.56 ms | 32.3% bf16 MFU | 125078 tok/s step 13128/19560 | loss 3.322023 (-0.21z)| norm 0.2463 (-0.26z)| lr 2.15e-04 | 4214.60 ms | 32.0% bf16 MFU | 125044 tok/s step 13129/19560 | loss 3.327008 (-0.08z)| norm 0.2445 (-0.42z)| lr 2.15e-04 | 4177.27 ms | 32.3% bf16 MFU | 125068 tok/s step 13130/19560 | loss 3.384814 (+1.54z)| norm 0.2570 (+0.68z)| lr 2.14e-04 | 4171.40 ms | 32.4% bf16 MFU | 125098 tok/s step 13131/19560 | loss 3.320041 (-0.29z)| norm 0.2571 (+0.68z)| lr 2.14e-04 | 4195.55 ms | 32.2% bf16 MFU | 125092 tok/s step 13132/19560 | loss 3.387113 (+1.57z)| norm 0.2666 (+1.49z)| lr 2.14e-04 | 4169.93 ms | 32.4% bf16 MFU | 125124 tok/s step 13133/19560 | loss 3.292997 (-1.04z)| norm 0.2566 (+0.64z)| lr 2.14e-04 | 4195.38 ms | 32.2% bf16 MFU | 125116 tok/s step 13134/19560 | loss 3.323710 (-0.16z)| norm 0.2720 (+1.95z)| lr 2.14e-04 | 4180.26 ms | 32.3% bf16 MFU | 125131 tok/s step 13135/19560 | loss 3.308548 (-0.59z)| norm 0.2464 (-0.26z)| lr 2.14e-04 | 4159.50 ms | 32.5% bf16 MFU | 125177 tok/s step 13136/19560 | loss 3.374199 (+1.26z)| norm 0.2499 (+0.03z)| lr 2.14e-04 | 4165.94 ms | 32.4% bf16 MFU | 125211 tok/s step 13137/19560 | loss 3.372217 (+1.18z)| norm 0.2790 (+2.50z)| lr 2.14e-04 | 4167.99 ms | 32.4% bf16 MFU | 125239 tok/s step 13138/19560 | loss 3.296429 (-0.95z)| norm 0.2601 (+0.88z)| lr 2.14e-04 | 4161.08 ms | 32.4% bf16 MFU | 125277 tok/s step 13139/19560 | loss 3.308960 (-0.60z)| norm 0.2484 (-0.13z)| lr 2.14e-04 | 4167.68 ms | 32.4% bf16 MFU | 125303 tok/s step 13140/19560 | loss 3.276092 (-1.52z)| norm 0.2558 (+0.50z)| lr 2.14e-04 | 4165.11 ms | 32.4% bf16 MFU | 125332 tok/s step 13141/19560 | loss 3.388636 (+1.62z)| norm 0.2622 (+1.04z)| lr 2.14e-04 | 4171.57 ms | 32.4% bf16 MFU | 125350 tok/s step 13142/19560 | loss 3.289967 (-1.15z)| norm 0.2631 (+1.10z)| lr 2.14e-04 | 4177.22 ms | 32.3% bf16 MFU | 125358 tok/s step 13143/19560 | loss 3.349326 (+0.51z)| norm 0.2481 (-0.17z)| lr 2.14e-04 | 4165.89 ms | 32.4% bf16 MFU | 125382 tok/s step 13144/19560 | loss 3.322050 (-0.25z)| norm 0.2586 (+0.72z)| lr 2.14e-04 | 4164.91 ms | 32.4% bf16 MFU | 125407 tok/s step 13145/19560 | loss 3.337686 (+0.18z)| norm 0.2279 (-1.85z)| lr 2.14e-04 | 4163.15 ms | 32.4% bf16 MFU | 125434 tok/s step 13146/19560 | loss 3.273125 (-1.60z)| norm 0.2605 (+0.87z)| lr 2.14e-04 | 4163.41 ms | 32.4% bf16 MFU | 125458 tok/s step 13147/19560 | loss 3.360274 (+0.81z)| norm 0.2432 (-0.57z)| lr 2.13e-04 | 4167.69 ms | 32.4% bf16 MFU | 125475 tok/s step 13148/19560 | loss 3.296596 (-0.94z)| norm 0.2660 (+1.33z)| lr 2.13e-04 | 4180.06 ms | 32.3% bf16 MFU | 125473 tok/s step 13149/19560 | loss 3.367285 (+1.02z)| norm 0.2608 (+0.90z)| lr 2.13e-04 | 4170.00 ms | 32.4% bf16 MFU | 125486 tok/s step 13150/19560 | loss 3.363624 (+0.91z)| norm 0.2464 (-0.29z)| lr 2.13e-04 | 4175.66 ms | 32.3% bf16 MFU | 125489 tok/s step 13151/19560 | loss 3.317388 (-0.37z)| norm 0.2644 (+1.24z)| lr 2.13e-04 | 4179.62 ms | 32.3% bf16 MFU | 125487 tok/s step 13152/19560 | loss 3.326029 (-0.14z)| norm 0.2572 (+0.62z)| lr 2.13e-04 | 4168.84 ms | 32.4% bf16 MFU | 125501 tok/s step 13153/19560 | loss 3.362350 (+0.87z)| norm 0.2499 (-0.00z)| lr 2.13e-04 | 4165.77 ms | 32.4% bf16 MFU | 125518 tok/s step 13154/19560 | loss 3.357995 (+0.75z)| norm 0.2558 (+0.51z)| lr 2.13e-04 | 4174.00 ms | 32.3% bf16 MFU | 125523 tok/s step 13155/19560 | loss 3.358282 (+0.74z)| norm 0.2414 (-0.72z)| lr 2.13e-04 | 4172.00 ms | 32.4% bf16 MFU | 125530 tok/s step 13156/19560 | loss 3.343544 (+0.33z)| norm 0.2423 (-0.63z)| lr 2.13e-04 | 4162.79 ms | 32.4% bf16 MFU | 125551 tok/s step 13157/19560 | loss 3.307016 (-0.70z)| norm 0.2595 (+0.84z)| lr 2.13e-04 | 4169.31 ms | 32.4% bf16 MFU | 125561 tok/s step 13158/19560 | loss 3.325859 (-0.17z)| norm 0.2458 (-0.34z)| lr 2.13e-04 | 4168.01 ms | 32.4% bf16 MFU | 125572 tok/s step 13159/19560 | loss 3.297380 (-0.96z)| norm 0.2432 (-0.57z)| lr 2.13e-04 | 4164.45 ms | 32.4% bf16 MFU | 125588 tok/s step 13160/19560 | loss 3.340317 (+0.24z)| norm 0.2575 (+0.66z)| lr 2.13e-04 | 4162.93 ms | 32.4% bf16 MFU | 125606 tok/s step 13161/19560 | loss 3.312092 (-0.54z)| norm 0.2338 (-1.36z)| lr 2.13e-04 | 4163.71 ms | 32.4% bf16 MFU | 125622 tok/s step 13162/19560 | loss 3.359159 (+0.78z)| norm 0.2699 (+1.71z)| lr 2.13e-04 | 4162.36 ms | 32.4% bf16 MFU | 125639 tok/s step 13163/19560 | loss 3.394166 (+1.76z)| norm 0.2396 (-0.87z)| lr 2.13e-04 | 4169.46 ms | 32.4% bf16 MFU | 125644 tok/s step 13164/19560 | loss 3.346471 (+0.39z)| norm 0.2632 (+1.12z)| lr 2.12e-04 | 4208.02 ms | 32.1% bf16 MFU | 125591 tok/s step 13165/19560 | loss 3.306294 (-0.75z)| norm 0.2541 (+0.35z)| lr 2.12e-04 | 4166.64 ms | 32.4% bf16 MFU | 125603 tok/s step 13166/19560 | loss 3.316092 (-0.47z)| norm 0.2627 (+1.07z)| lr 2.12e-04 | 4171.67 ms | 32.4% bf16 MFU | 125607 tok/s step 13167/19560 | loss 3.358219 (+0.72z)| norm 0.2969 (+3.71z)| lr 2.12e-04 | 4170.17 ms | 32.4% bf16 MFU | 125613 tok/s step 13168/19560 | loss 3.360985 (+0.80z)| norm 0.2494 (-0.09z)| lr 2.12e-04 | 4178.87 ms | 32.3% bf16 MFU | 125605 tok/s step 13169/19560 | loss 3.365523 (+0.91z)| norm 0.2854 (+2.70z)| lr 2.12e-04 | 4164.94 ms | 32.4% bf16 MFU | 125619 tok/s step 13170/19560 | loss 3.367141 (+0.95z)| norm 0.2295 (-1.65z)| lr 2.12e-04 | 4199.70 ms | 32.1% bf16 MFU | 125580 tok/s step 13171/19560 | loss 3.287529 (-1.32z)| norm 0.2692 (+1.40z)| lr 2.12e-04 | 4175.22 ms | 32.3% bf16 MFU | 125580 tok/s step 13172/19560 | loss 3.294963 (-1.12z)| norm 0.2459 (-0.41z)| lr 2.12e-04 | 4184.73 ms | 32.3% bf16 MFU | 125565 tok/s step 13173/19560 | loss 3.331163 (-0.08z)| norm 0.2449 (-0.49z)| lr 2.12e-04 | 4163.60 ms | 32.4% bf16 MFU | 125583 tok/s step 13174/19560 | loss 3.348970 (+0.42z)| norm 0.2518 (+0.05z)| lr 2.12e-04 | 4166.33 ms | 32.4% bf16 MFU | 125596 tok/s step 13175/19560 | loss 3.341280 (+0.22z)| norm 0.2493 (-0.15z)| lr 2.12e-04 | 4177.16 ms | 32.3% bf16 MFU | 125592 tok/s step 13176/19560 | loss 3.338633 (+0.13z)| norm 0.2573 (+0.46z)| lr 2.12e-04 | 4179.01 ms | 32.3% bf16 MFU | 125585 tok/s step 13177/19560 | loss 3.345863 (+0.34z)| norm 0.2401 (-0.87z)| lr 2.12e-04 | 4174.91 ms | 32.3% bf16 MFU | 125585 tok/s step 13178/19560 | loss 3.324536 (-0.30z)| norm 0.2696 (+1.42z)| lr 2.12e-04 | 4185.53 ms | 32.3% bf16 MFU | 125569 tok/s step 13179/19560 | loss 3.297659 (-1.11z)| norm 0.2481 (-0.26z)| lr 2.12e-04 | 4175.29 ms | 32.3% bf16 MFU | 125569 tok/s step 13180/19560 | loss 3.310714 (-0.71z)| norm 0.2486 (-0.22z)| lr 2.12e-04 | 4164.61 ms | 32.4% bf16 MFU | 125585 tok/s step 13181/19560 | loss 3.392776 (+1.71z)| norm 0.2438 (-0.59z)| lr 2.11e-04 | 4168.83 ms | 32.4% bf16 MFU | 125594 tok/s step 13182/19560 | loss 3.357034 (+0.63z)| norm 0.2471 (-0.34z)| lr 2.11e-04 | 4170.97 ms | 32.4% bf16 MFU | 125599 tok/s step 13183/19560 | loss 3.343930 (+0.23z)| norm 0.2653 (+1.09z)| lr 2.11e-04 | 4162.02 ms | 32.4% bf16 MFU | 125617 tok/s step 13184/19560 | loss 3.370035 (+1.00z)| norm 0.2602 (+0.67z)| lr 2.11e-04 | 4173.40 ms | 32.4% bf16 MFU | 125618 tok/s step 13185/19560 | loss 3.241160 (-2.79z)| norm 0.2482 (-0.27z)| lr 2.11e-04 | 4167.79 ms | 32.4% bf16 MFU | 125627 tok/s step 13186/19560 | loss 3.305787 (-0.88z)| norm 0.2451 (-0.52z)| lr 2.11e-04 | 4160.64 ms | 32.5% bf16 MFU | 125646 tok/s step 13187/19560 | loss 3.319073 (-0.48z)| norm 0.2410 (-0.84z)| lr 2.11e-04 | 4177.09 ms | 32.3% bf16 MFU | 125639 tok/s step 13188/19560 | loss 3.421194 (+2.43z)| norm 0.2611 (+0.75z)| lr 2.11e-04 | 4179.01 ms | 32.3% bf16 MFU | 125630 tok/s step 13189/19560 | loss 3.400779 (+1.81z)| norm 0.2458 (-0.46z)| lr 2.11e-04 | 4170.25 ms | 32.4% bf16 MFU | 125635 tok/s step 13190/19560 | loss 3.348115 (+0.33z)| norm 0.2599 (+0.64z)| lr 2.11e-04 | 4170.97 ms | 32.4% bf16 MFU | 125638 tok/s step 13191/19560 | loss 3.274473 (-1.74z)| norm 0.2484 (-0.27z)| lr 2.11e-04 | 4171.26 ms | 32.4% bf16 MFU | 125641 tok/s step 13192/19560 | loss 3.353236 (+0.49z)| norm 0.2533 (+0.12z)| lr 2.11e-04 | 4177.91 ms | 32.3% bf16 MFU | 125633 tok/s step 13193/19560 | loss 3.363099 (+0.78z)| norm 0.2482 (-0.27z)| lr 2.11e-04 | 4166.26 ms | 32.4% bf16 MFU | 125644 tok/s step 13194/19560 | loss 3.303281 (-0.91z)| norm 0.2485 (-0.24z)| lr 2.11e-04 | 4170.34 ms | 32.4% bf16 MFU | 125647 tok/s step 13195/19560 | loss 3.425015 (+2.50z)| norm 0.2545 (+0.24z)| lr 2.11e-04 | 4164.34 ms | 32.4% bf16 MFU | 125660 tok/s step 13196/19560 | loss 3.353877 (+0.49z)| norm 0.3007 (+3.79z)| lr 2.11e-04 | 4183.03 ms | 32.3% bf16 MFU | 125644 tok/s step 13197/19560 | loss 3.309291 (-0.76z)| norm 0.2635 (+0.90z)| lr 2.10e-04 | 4172.15 ms | 32.4% bf16 MFU | 125645 tok/s step 13198/19560 | loss 3.267663 (-1.89z)| norm 0.2658 (+1.08z)| lr 2.10e-04 | 4187.27 ms | 32.2% bf16 MFU | 125623 tok/s step 13199/19560 | loss 3.325649 (-0.28z)| norm 0.2603 (+0.66z)| lr 2.10e-04 | 4164.48 ms | 32.4% bf16 MFU | 125637 tok/s step 13200/19560 | loss 3.329026 (-0.19z)| norm 0.2483 (-0.27z)| lr 2.10e-04 | 4165.52 ms | 32.4% bf16 MFU | 125648 tok/s step 13201/19560 | loss 3.326257 (-0.26z)| norm 0.2393 (-0.96z)| lr 2.10e-04 | 4164.28 ms | 32.4% bf16 MFU | 125661 tok/s step 13202/19560 | loss 3.306863 (-0.79z)| norm 0.2542 (+0.19z)| lr 2.10e-04 | 4177.27 ms | 32.3% bf16 MFU | 125653 tok/s step 13203/19560 | loss 3.359528 (+0.67z)| norm 0.2287 (-1.74z)| lr 2.10e-04 | 4159.93 ms | 32.5% bf16 MFU | 125672 tok/s step 13204/19560 | loss 3.322679 (-0.35z)| norm 0.2709 (+1.47z)| lr 2.10e-04 | 4167.49 ms | 32.4% bf16 MFU | 125679 tok/s step 13205/19560 | loss 3.288312 (-1.30z)| norm 0.2398 (-0.91z)| lr 2.10e-04 | 4172.65 ms | 32.4% bf16 MFU | 125677 tok/s step 13206/19560 | loss 3.326962 (-0.22z)| norm 0.2463 (-0.39z)| lr 2.10e-04 | 4170.28 ms | 32.4% bf16 MFU | 125679 tok/s step 13207/19560 | loss 3.325574 (-0.25z)| norm 0.2312 (-1.52z)| lr 2.10e-04 | 4171.73 ms | 32.4% bf16 MFU | 125679 tok/s step 13208/19560 | loss 3.425451 (+2.51z)| norm 0.2594 (+0.60z)| lr 2.10e-04 | 4165.64 ms | 32.4% bf16 MFU | 125688 tok/s step 13209/19560 | loss 3.301365 (-0.93z)| norm 0.2513 (-0.01z)| lr 2.10e-04 | 4179.42 ms | 32.3% bf16 MFU | 125676 tok/s step 13210/19560 | loss 3.300569 (-0.95z)| norm 0.2447 (-0.52z)| lr 2.10e-04 | 4174.82 ms | 32.3% bf16 MFU | 125671 tok/s step 13211/19560 | loss 3.327146 (-0.21z)| norm 0.2473 (-0.31z)| lr 2.10e-04 | 4172.78 ms | 32.4% bf16 MFU | 125670 tok/s step 13212/19560 | loss 3.278037 (-1.57z)| norm 0.2424 (-0.68z)| lr 2.10e-04 | 4177.21 ms | 32.3% bf16 MFU | 125662 tok/s step 13213/19560 | loss 3.324339 (-0.30z)| norm 0.2418 (-0.71z)| lr 2.10e-04 | 4167.42 ms | 32.4% bf16 MFU | 125669 tok/s step 13214/19560 | loss 3.381620 (+1.30z)| norm 0.2552 (+0.32z)| lr 2.09e-04 | 4169.49 ms | 32.4% bf16 MFU | 125673 tok/s step 13215/19560 | loss 3.364522 (+0.84z)| norm 0.2495 (-0.12z)| lr 2.09e-04 | 4173.94 ms | 32.3% bf16 MFU | 125670 tok/s step 13216/19560 | loss 3.398855 (+1.77z)| norm 0.2516 (+0.06z)| lr 2.09e-04 | 4168.04 ms | 32.4% bf16 MFU | 125676 tok/s step 13217/19560 | loss 3.306565 (-0.79z)| norm 0.2534 (+0.19z)| lr 2.09e-04 | 4161.80 ms | 32.4% bf16 MFU | 125691 tok/s step 13218/19560 | loss 3.338562 (+0.10z)| norm 0.2592 (+0.64z)| lr 2.09e-04 | 4162.22 ms | 32.4% bf16 MFU | 125704 tok/s step 13219/19560 | loss 3.354254 (+0.52z)| norm 0.2406 (-0.81z)| lr 2.09e-04 | 4173.57 ms | 32.4% bf16 MFU | 125700 tok/s step 13220/19560 | loss 3.410643 (+2.07z)| norm 0.2582 (+0.56z)| lr 2.09e-04 | 4172.40 ms | 32.4% bf16 MFU | 125698 tok/s step 13221/19560 | loss 3.299236 (-1.00z)| norm 0.2569 (+0.46z)| lr 2.09e-04 | 4165.12 ms | 32.4% bf16 MFU | 125707 tok/s step 13222/19560 | loss 3.335927 (+0.02z)| norm 0.2492 (-0.14z)| lr 2.09e-04 | 4173.61 ms | 32.4% bf16 MFU | 125703 tok/s step 13223/19560 | loss 3.314054 (-0.58z)| norm 0.2728 (+1.68z)| lr 2.09e-04 | 4178.86 ms | 32.3% bf16 MFU | 125691 tok/s step 13224/19560 | loss 3.359504 (+0.68z)| norm 0.2368 (-1.09z)| lr 2.09e-04 | 4164.37 ms | 32.4% bf16 MFU | 125701 tok/s step 13225/19560 | loss 3.283349 (-1.42z)| norm 0.2484 (-0.20z)| lr 2.09e-04 | 4170.28 ms | 32.4% bf16 MFU | 125702 tok/s step 13226/19560 | loss 3.332183 (-0.09z)| norm 0.2384 (-0.96z)| lr 2.09e-04 | 4169.03 ms | 32.4% bf16 MFU | 125705 tok/s step 13227/19560 | loss 3.312856 (-0.63z)| norm 0.2443 (-0.50z)| lr 2.09e-04 | 4169.77 ms | 32.4% bf16 MFU | 125706 tok/s step 13228/19560 | loss 3.295752 (-1.12z)| norm 0.2336 (-1.33z)| lr 2.09e-04 | 4166.99 ms | 32.4% bf16 MFU | 125712 tok/s step 13229/19560 | loss 3.380074 (+1.30z)| norm 0.2501 (-0.05z)| lr 2.09e-04 | 4166.64 ms | 32.4% bf16 MFU | 125718 tok/s step 13230/19560 | loss 3.419355 (+2.38z)| norm 0.2348 (-1.25z)| lr 2.09e-04 | 4172.68 ms | 32.4% bf16 MFU | 125714 tok/s step 13231/19560 | loss 3.325571 (-0.29z)| norm 0.2401 (-0.83z)| lr 2.08e-04 | 4173.62 ms | 32.4% bf16 MFU | 125710 tok/s step 13232/19560 | loss 3.407959 (+2.01z)| norm 0.2601 (+0.71z)| lr 2.08e-04 | 4168.00 ms | 32.4% bf16 MFU | 125714 tok/s step 13233/19560 | loss 3.389894 (+1.48z)| norm 0.2376 (-1.04z)| lr 2.08e-04 | 4176.14 ms | 32.3% bf16 MFU | 125705 tok/s step 13234/19560 | loss 3.370224 (+0.92z)| norm 0.2453 (-0.44z)| lr 2.08e-04 | 4191.47 ms | 32.2% bf16 MFU | 125674 tok/s step 13235/19560 | loss 3.314340 (-0.63z)| norm 0.2445 (-0.51z)| lr 2.08e-04 | 4161.65 ms | 32.4% bf16 MFU | 125689 tok/s step 13236/19560 | loss 3.420920 (+2.28z)| norm 0.2488 (-0.17z)| lr 2.08e-04 | 4163.77 ms | 32.4% bf16 MFU | 125701 tok/s step 13237/19560 | loss 3.315182 (-0.60z)| norm 0.2402 (-0.84z)| lr 2.08e-04 | 4162.90 ms | 32.4% bf16 MFU | 125713 tok/s step 13238/19560 | loss 3.325238 (-0.32z)| norm 0.2373 (-1.06z)| lr 2.08e-04 | 4180.51 ms | 32.3% bf16 MFU | 125698 tok/s step 13239/19560 | loss 3.326927 (-0.27z)| norm 0.2536 (+0.22z)| lr 2.08e-04 | 4168.47 ms | 32.4% bf16 MFU | 125702 tok/s step 13240/19560 | loss 3.280772 (-1.52z)| norm 0.2349 (-1.24z)| lr 2.08e-04 | 4169.50 ms | 32.4% bf16 MFU | 125704 tok/s step 13241/19560 | loss 3.307606 (-0.78z)| norm 0.2502 (-0.05z)| lr 2.08e-04 | 4166.31 ms | 32.4% bf16 MFU | 125711 tok/s step 13242/19560 | loss 3.280680 (-1.48z)| norm 0.2638 (+1.00z)| lr 2.08e-04 | 4161.32 ms | 32.4% bf16 MFU | 125725 tok/s step 13243/19560 | loss 3.316927 (-0.50z)| norm 0.2656 (+1.12z)| lr 2.08e-04 | 4178.17 ms | 32.3% bf16 MFU | 125713 tok/s step 13244/19560 | loss 3.308981 (-0.71z)| norm 0.2605 (+0.73z)| lr 2.08e-04 | 4159.05 ms | 32.5% bf16 MFU | 125730 tok/s step 13245/19560 | loss 3.328628 (-0.18z)| norm 0.2639 (+0.99z)| lr 2.08e-04 | 4170.30 ms | 32.4% bf16 MFU | 125729 tok/s step 13246/19560 | loss 3.445821 (+2.87z)| norm 0.2655 (+1.10z)| lr 2.08e-04 | 4175.39 ms | 32.3% bf16 MFU | 125721 tok/s step 13247/19560 | loss 3.306508 (-0.77z)| norm 0.2678 (+1.26z)| lr 2.08e-04 | 4163.71 ms | 32.4% bf16 MFU | 125731 tok/s step 13248/19560 | loss 3.306583 (-0.75z)| norm 0.2475 (-0.32z)| lr 2.07e-04 | 4159.25 ms | 32.5% bf16 MFU | 125747 tok/s step 13249/19560 | loss 3.376175 (+1.04z)| norm 0.2489 (-0.22z)| lr 2.07e-04 | 4168.36 ms | 32.4% bf16 MFU | 125749 tok/s step 13250/19560 | loss 3.337575 (+0.04z)| norm 0.2907 (+2.91z)| lr 2.07e-04 | 4165.61 ms | 32.4% bf16 MFU | 125754 tok/s val loss 3.313047 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2967/10042 = 0.295459 step 13251/19560 | loss 3.338703 (+0.07z)| norm 0.2646 (+0.94z)| lr 2.07e-04 | 4172.13 ms | 32.4% bf16 MFU | 125750 tok/s step 13252/19560 | loss 3.294741 (-1.06z)| norm 0.2655 (+0.99z)| lr 2.07e-04 | 4171.45 ms | 32.4% bf16 MFU | 125747 tok/s step 13253/19560 | loss 3.368728 (+0.85z)| norm 0.2650 (+0.94z)| lr 2.07e-04 | 4195.86 ms | 32.2% bf16 MFU | 125707 tok/s step 13254/19560 | loss 3.396973 (+1.55z)| norm 0.2653 (+0.95z)| lr 2.07e-04 | 4166.82 ms | 32.4% bf16 MFU | 125713 tok/s step 13255/19560 | loss 3.368223 (+0.80z)| norm 0.2704 (+1.33z)| lr 2.07e-04 | 4178.57 ms | 32.3% bf16 MFU | 125701 tok/s step 13256/19560 | loss 3.287843 (-1.25z)| norm 0.2549 (+0.11z)| lr 2.07e-04 | 4157.66 ms | 32.5% bf16 MFU | 125721 tok/s step 13257/19560 | loss 3.314670 (-0.56z)| norm 0.2820 (+2.17z)| lr 2.07e-04 | 4168.76 ms | 32.4% bf16 MFU | 125723 tok/s step 13258/19560 | loss 3.315712 (-0.53z)| norm 0.2482 (-0.43z)| lr 2.07e-04 | 4164.48 ms | 32.4% bf16 MFU | 125732 tok/s step 13259/19560 | loss 3.360839 (+0.62z)| norm 0.2511 (-0.20z)| lr 2.07e-04 | 4161.20 ms | 32.4% bf16 MFU | 125745 tok/s step 13260/19560 | loss 3.343801 (+0.20z)| norm 0.2770 (+1.77z)| lr 2.07e-04 | 4206.84 ms | 32.1% bf16 MFU | 125689 tok/s step 13261/19560 | loss 3.407502 (+1.81z)| norm 0.2781 (+1.82z)| lr 2.07e-04 | 4164.61 ms | 32.4% bf16 MFU | 125699 tok/s step 13262/19560 | loss 3.303893 (-0.85z)| norm 0.2721 (+1.37z)| lr 2.07e-04 | 4158.38 ms | 32.5% bf16 MFU | 125718 tok/s step 13263/19560 | loss 3.327029 (-0.26z)| norm 0.2477 (-0.47z)| lr 2.07e-04 | 4296.52 ms | 31.4% bf16 MFU | 125533 tok/s step 13264/19560 | loss 3.366609 (+0.76z)| norm 0.2423 (-0.87z)| lr 2.07e-04 | 4813.58 ms | 28.0% bf16 MFU | 124703 tok/s step 13265/19560 | loss 3.269076 (-1.71z)| norm 0.2639 (+0.77z)| lr 2.06e-04 | 4563.26 ms | 29.6% bf16 MFU | 124212 tok/s step 13266/19560 | loss 3.310220 (-0.67z)| norm 0.2358 (-1.35z)| lr 2.06e-04 | 4370.96 ms | 30.9% bf16 MFU | 123999 tok/s step 13267/19560 | loss 3.261984 (-1.86z)| norm 0.2510 (-0.20z)| lr 2.06e-04 | 4261.02 ms | 31.7% bf16 MFU | 123951 tok/s step 13268/19560 | loss 3.328870 (-0.19z)| norm 0.2383 (-1.14z)| lr 2.06e-04 | 4211.04 ms | 32.1% bf16 MFU | 123979 tok/s step 13269/19560 | loss 3.336293 (+0.01z)| norm 0.2351 (-1.36z)| lr 2.06e-04 | 4334.57 ms | 31.1% bf16 MFU | 123828 tok/s step 13270/19560 | loss 3.351092 (+0.38z)| norm 0.2462 (-0.52z)| lr 2.06e-04 | 4517.58 ms | 29.9% bf16 MFU | 123439 tok/s step 13271/19560 | loss 3.357448 (+0.54z)| norm 0.2327 (-1.50z)| lr 2.06e-04 | 4542.25 ms | 29.7% bf16 MFU | 123038 tok/s step 13272/19560 | loss 3.291592 (-1.14z)| norm 0.2497 (-0.24z)| lr 2.06e-04 | 4164.01 ms | 32.4% bf16 MFU | 123182 tok/s step 13273/19560 | loss 3.301924 (-0.87z)| norm 0.2280 (-1.85z)| lr 2.06e-04 | 4224.21 ms | 32.0% bf16 MFU | 123229 tok/s step 13274/19560 | loss 3.308192 (-0.72z)| norm 0.2539 (+0.07z)| lr 2.06e-04 | 4209.17 ms | 32.1% bf16 MFU | 123295 tok/s step 13275/19560 | loss 3.354388 (+0.47z)| norm 0.2360 (-1.24z)| lr 2.06e-04 | 4163.61 ms | 32.4% bf16 MFU | 123426 tok/s step 13276/19560 | loss 3.290775 (-1.17z)| norm 0.2515 (-0.09z)| lr 2.06e-04 | 4203.39 ms | 32.1% bf16 MFU | 123492 tok/s step 13277/19560 | loss 3.331114 (-0.12z)| norm 0.2381 (-1.07z)| lr 2.06e-04 | 4214.66 ms | 32.0% bf16 MFU | 123537 tok/s step 13278/19560 | loss 3.376605 (+1.04z)| norm 0.2529 (+0.02z)| lr 2.06e-04 | 4168.59 ms | 32.4% bf16 MFU | 123648 tok/s step 13279/19560 | loss 3.385354 (+1.25z)| norm 0.2462 (-0.46z)| lr 2.06e-04 | 4174.32 ms | 32.3% bf16 MFU | 123746 tok/s step 13280/19560 | loss 3.346349 (+0.25z)| norm 0.2596 (+0.53z)| lr 2.06e-04 | 4172.00 ms | 32.4% bf16 MFU | 123842 tok/s step 13281/19560 | loss 3.306852 (-0.75z)| norm 0.2366 (-1.16z)| lr 2.05e-04 | 4174.65 ms | 32.3% bf16 MFU | 123929 tok/s step 13282/19560 | loss 3.302029 (-0.86z)| norm 0.2433 (-0.66z)| lr 2.05e-04 | 4213.58 ms | 32.0% bf16 MFU | 123954 tok/s step 13283/19560 | loss 3.268294 (-1.69z)| norm 0.2611 (+0.64z)| lr 2.05e-04 | 4169.93 ms | 32.4% bf16 MFU | 124043 tok/s step 13284/19560 | loss 3.301089 (-0.85z)| norm 0.2343 (-1.33z)| lr 2.05e-04 | 4232.11 ms | 31.9% bf16 MFU | 124035 tok/s step 13285/19560 | loss 3.285898 (-1.22z)| norm 0.2362 (-1.16z)| lr 2.05e-04 | 4229.95 ms | 31.9% bf16 MFU | 124031 tok/s step 13286/19560 | loss 3.313832 (-0.52z)| norm 0.2450 (-0.52z)| lr 2.05e-04 | 4175.89 ms | 32.3% bf16 MFU | 124107 tok/s step 13287/19560 | loss 3.309636 (-0.63z)| norm 0.2343 (-1.29z)| lr 2.05e-04 | 4165.65 ms | 32.4% bf16 MFU | 124194 tok/s step 13288/19560 | loss 3.312228 (-0.56z)| norm 0.2364 (-1.13z)| lr 2.05e-04 | 4175.08 ms | 32.3% bf16 MFU | 124263 tok/s step 13289/19560 | loss 3.313048 (-0.54z)| norm 0.2307 (-1.53z)| lr 2.05e-04 | 4165.75 ms | 32.4% bf16 MFU | 124343 tok/s step 13290/19560 | loss 3.294366 (-0.99z)| norm 0.2454 (-0.46z)| lr 2.05e-04 | 4177.68 ms | 32.3% bf16 MFU | 124401 tok/s step 13291/19560 | loss 3.268972 (-1.60z)| norm 0.2377 (-1.02z)| lr 2.05e-04 | 4171.94 ms | 32.4% bf16 MFU | 124464 tok/s step 13292/19560 | loss 3.322882 (-0.25z)| norm 0.2448 (-0.49z)| lr 2.05e-04 | 4167.23 ms | 32.4% bf16 MFU | 124532 tok/s step 13293/19560 | loss 3.267704 (-1.61z)| norm 0.2333 (-1.31z)| lr 2.05e-04 | 4198.40 ms | 32.2% bf16 MFU | 124549 tok/s step 13294/19560 | loss 3.309519 (-0.57z)| norm 0.2512 (-0.00z)| lr 2.05e-04 | 4160.43 ms | 32.5% bf16 MFU | 124622 tok/s step 13295/19560 | loss 3.314303 (-0.44z)| norm 0.2420 (-0.67z)| lr 2.05e-04 | 4165.44 ms | 32.4% bf16 MFU | 124685 tok/s step 13296/19560 | loss 3.265783 (-1.61z)| norm 0.2606 (+0.73z)| lr 2.05e-04 | 4159.94 ms | 32.5% bf16 MFU | 124752 tok/s step 13297/19560 | loss 3.348165 (+0.42z)| norm 0.2477 (-0.23z)| lr 2.05e-04 | 4166.36 ms | 32.4% bf16 MFU | 124806 tok/s step 13298/19560 | loss 3.327105 (-0.09z)| norm 0.2613 (+0.82z)| lr 2.04e-04 | 4168.85 ms | 32.4% bf16 MFU | 124854 tok/s step 13299/19560 | loss 3.334766 (+0.09z)| norm 0.2552 (+0.35z)| lr 2.04e-04 | 4162.43 ms | 32.4% bf16 MFU | 124909 tok/s step 13300/19560 | loss 3.280872 (-1.24z)| norm 0.2328 (-1.40z)| lr 2.04e-04 | 4165.27 ms | 32.4% bf16 MFU | 124957 tok/s step 13301/19560 | loss 3.325064 (-0.15z)| norm 0.2561 (+0.42z)| lr 2.04e-04 | 4175.98 ms | 32.3% bf16 MFU | 124987 tok/s step 13302/19560 | loss 3.315162 (-0.39z)| norm 0.2241 (-2.04z)| lr 2.04e-04 | 4165.43 ms | 32.4% bf16 MFU | 125031 tok/s step 13303/19560 | loss 3.304216 (-0.65z)| norm 0.2382 (-0.94z)| lr 2.04e-04 | 4178.83 ms | 32.3% bf16 MFU | 125053 tok/s step 13304/19560 | loss 3.356222 (+0.63z)| norm 0.2518 (+0.10z)| lr 2.04e-04 | 4170.95 ms | 32.4% bf16 MFU | 125085 tok/s step 13305/19560 | loss 3.267919 (-1.52z)| norm 0.2620 (+0.88z)| lr 2.04e-04 | 4171.99 ms | 32.4% bf16 MFU | 125114 tok/s step 13306/19560 | loss 3.311160 (-0.46z)| norm 0.2722 (+1.65z)| lr 2.04e-04 | 4169.28 ms | 32.4% bf16 MFU | 125146 tok/s step 13307/19560 | loss 3.365944 (+0.87z)| norm 0.2537 (+0.23z)| lr 2.04e-04 | 4209.36 ms | 32.1% bf16 MFU | 125116 tok/s step 13308/19560 | loss 3.316389 (-0.35z)| norm 0.3002 (+3.57z)| lr 2.04e-04 | 4167.98 ms | 32.4% bf16 MFU | 125150 tok/s step 13309/19560 | loss 3.299366 (-0.75z)| norm 0.2502 (-0.07z)| lr 2.04e-04 | 4172.86 ms | 32.4% bf16 MFU | 125175 tok/s step 13310/19560 | loss 3.293219 (-0.89z)| norm 0.2867 (+2.50z)| lr 2.04e-04 | 4191.09 ms | 32.2% bf16 MFU | 125171 tok/s step 13311/19560 | loss 3.319160 (-0.24z)| norm 0.2802 (+2.01z)| lr 2.04e-04 | 4177.49 ms | 32.3% bf16 MFU | 125187 tok/s step 13312/19560 | loss 3.363848 (+0.86z)| norm 0.2830 (+2.15z)| lr 2.04e-04 | 4161.75 ms | 32.4% bf16 MFU | 125227 tok/s step 13313/19560 | loss 3.369908 (+1.00z)| norm 0.2728 (+1.43z)| lr 2.04e-04 | 4171.33 ms | 32.4% bf16 MFU | 125250 tok/s step 13314/19560 | loss 3.298058 (-0.80z)| norm 0.2741 (+1.49z)| lr 2.04e-04 | 4173.63 ms | 32.4% bf16 MFU | 125268 tok/s step 13315/19560 | loss 3.303413 (-0.66z)| norm 0.2642 (+0.80z)| lr 2.03e-04 | 4166.67 ms | 32.4% bf16 MFU | 125296 tok/s step 13316/19560 | loss 3.356962 (+0.70z)| norm 0.2725 (+1.35z)| lr 2.03e-04 | 4168.35 ms | 32.4% bf16 MFU | 125320 tok/s step 13317/19560 | loss 3.365700 (+0.94z)| norm 0.2725 (+1.33z)| lr 2.03e-04 | 4175.58 ms | 32.3% bf16 MFU | 125332 tok/s step 13318/19560 | loss 3.297697 (-0.79z)| norm 0.2495 (-0.21z)| lr 2.03e-04 | 4169.11 ms | 32.4% bf16 MFU | 125354 tok/s step 13319/19560 | loss 3.291935 (-0.95z)| norm 0.2534 (+0.05z)| lr 2.03e-04 | 4189.31 ms | 32.2% bf16 MFU | 125343 tok/s step 13320/19560 | loss 3.328553 (-0.00z)| norm 0.2611 (+0.57z)| lr 2.03e-04 | 4170.77 ms | 32.4% bf16 MFU | 125361 tok/s step 13321/19560 | loss 3.295914 (-0.83z)| norm 0.2471 (-0.37z)| lr 2.03e-04 | 4164.62 ms | 32.4% bf16 MFU | 125388 tok/s step 13322/19560 | loss 3.321616 (-0.17z)| norm 0.2492 (-0.23z)| lr 2.03e-04 | 4190.39 ms | 32.2% bf16 MFU | 125374 tok/s step 13323/19560 | loss 3.311143 (-0.43z)| norm 0.2575 (+0.32z)| lr 2.03e-04 | 4167.09 ms | 32.4% bf16 MFU | 125396 tok/s step 13324/19560 | loss 3.285008 (-1.10z)| norm 0.2415 (-0.75z)| lr 2.03e-04 | 4170.80 ms | 32.4% bf16 MFU | 125412 tok/s step 13325/19560 | loss 3.315208 (-0.31z)| norm 0.2584 (+0.43z)| lr 2.03e-04 | 4175.01 ms | 32.3% bf16 MFU | 125420 tok/s step 13326/19560 | loss 3.351225 (+0.63z)| norm 0.2580 (+0.41z)| lr 2.03e-04 | 4179.09 ms | 32.3% bf16 MFU | 125422 tok/s step 13327/19560 | loss 3.359679 (+0.85z)| norm 0.2482 (-0.27z)| lr 2.03e-04 | 4163.53 ms | 32.4% bf16 MFU | 125447 tok/s step 13328/19560 | loss 3.328012 (+0.01z)| norm 0.2521 (+0.01z)| lr 2.03e-04 | 4170.20 ms | 32.4% bf16 MFU | 125461 tok/s step 13329/19560 | loss 3.257061 (-1.84z)| norm 0.2417 (-0.73z)| lr 2.03e-04 | 4164.89 ms | 32.4% bf16 MFU | 125482 tok/s step 13330/19560 | loss 3.311185 (-0.42z)| norm 0.2470 (-0.35z)| lr 2.03e-04 | 4161.49 ms | 32.4% bf16 MFU | 125507 tok/s step 13331/19560 | loss 3.384250 (+1.48z)| norm 0.2531 (+0.06z)| lr 2.03e-04 | 4172.19 ms | 32.4% bf16 MFU | 125515 tok/s step 13332/19560 | loss 3.300029 (-0.71z)| norm 0.2444 (-0.54z)| lr 2.02e-04 | 4165.58 ms | 32.4% bf16 MFU | 125532 tok/s step 13333/19560 | loss 3.294203 (-0.86z)| norm 0.2382 (-0.98z)| lr 2.02e-04 | 4234.27 ms | 31.9% bf16 MFU | 125447 tok/s step 13334/19560 | loss 3.322176 (-0.13z)| norm 0.2555 (+0.24z)| lr 2.02e-04 | 4162.54 ms | 32.4% bf16 MFU | 125472 tok/s step 13335/19560 | loss 3.355108 (+0.72z)| norm 0.2518 (-0.03z)| lr 2.02e-04 | 4174.68 ms | 32.3% bf16 MFU | 125478 tok/s step 13336/19560 | loss 3.306754 (-0.53z)| norm 0.2498 (-0.17z)| lr 2.02e-04 | 4175.32 ms | 32.3% bf16 MFU | 125482 tok/s step 13337/19560 | loss 3.342796 (+0.42z)| norm 0.2860 (+2.36z)| lr 2.02e-04 | 4175.08 ms | 32.3% bf16 MFU | 125487 tok/s step 13338/19560 | loss 3.367306 (+1.06z)| norm 0.2422 (-0.71z)| lr 2.02e-04 | 4166.21 ms | 32.4% bf16 MFU | 125505 tok/s step 13339/19560 | loss 3.313692 (-0.36z)| norm 0.2838 (+2.15z)| lr 2.02e-04 | 4162.20 ms | 32.4% bf16 MFU | 125528 tok/s step 13340/19560 | loss 3.352059 (+0.65z)| norm 0.2510 (-0.12z)| lr 2.02e-04 | 4163.93 ms | 32.4% bf16 MFU | 125547 tok/s step 13341/19560 | loss 3.330903 (+0.08z)| norm 0.2851 (+2.18z)| lr 2.02e-04 | 4169.72 ms | 32.4% bf16 MFU | 125556 tok/s step 13342/19560 | loss 3.300211 (-0.73z)| norm 0.2459 (-0.48z)| lr 2.02e-04 | 4171.39 ms | 32.4% bf16 MFU | 125563 tok/s step 13343/19560 | loss 3.290108 (-0.99z)| norm 0.2971 (+2.87z)| lr 2.02e-04 | 4166.09 ms | 32.4% bf16 MFU | 125577 tok/s step 13344/19560 | loss 3.309035 (-0.46z)| norm 0.2520 (-0.09z)| lr 2.02e-04 | 4165.15 ms | 32.4% bf16 MFU | 125592 tok/s step 13345/19560 | loss 3.305237 (-0.57z)| norm 0.2747 (+1.38z)| lr 2.02e-04 | 4168.84 ms | 32.4% bf16 MFU | 125601 tok/s step 13346/19560 | loss 3.278398 (-1.28z)| norm 0.2489 (-0.30z)| lr 2.02e-04 | 4190.68 ms | 32.2% bf16 MFU | 125576 tok/s step 13347/19560 | loss 3.341599 (+0.44z)| norm 0.2608 (+0.47z)| lr 2.02e-04 | 4181.87 ms | 32.3% bf16 MFU | 125566 tok/s step 13348/19560 | loss 3.293515 (-0.86z)| norm 0.2786 (+1.61z)| lr 2.02e-04 | 4170.52 ms | 32.4% bf16 MFU | 125573 tok/s step 13349/19560 | loss 3.317332 (-0.20z)| norm 0.2520 (-0.11z)| lr 2.01e-04 | 4164.41 ms | 32.4% bf16 MFU | 125589 tok/s step 13350/19560 | loss 3.355772 (+0.86z)| norm 0.3069 (+3.27z)| lr 2.01e-04 | 4165.76 ms | 32.4% bf16 MFU | 125603 tok/s step 13351/19560 | loss 3.313805 (-0.31z)| norm 0.2699 (+0.98z)| lr 2.01e-04 | 4168.64 ms | 32.4% bf16 MFU | 125611 tok/s step 13352/19560 | loss 3.327815 (+0.09z)| norm 0.2919 (+2.28z)| lr 2.01e-04 | 4171.47 ms | 32.4% bf16 MFU | 125615 tok/s step 13353/19560 | loss 3.307110 (-0.49z)| norm 0.2662 (+0.70z)| lr 2.01e-04 | 4170.61 ms | 32.4% bf16 MFU | 125619 tok/s step 13354/19560 | loss 3.288099 (-1.01z)| norm 0.2577 (+0.17z)| lr 2.01e-04 | 4166.53 ms | 32.4% bf16 MFU | 125630 tok/s step 13355/19560 | loss 3.371501 (+1.29z)| norm 0.2828 (+1.67z)| lr 2.01e-04 | 4154.97 ms | 32.5% bf16 MFU | 125658 tok/s step 13356/19560 | loss 3.276264 (-1.33z)| norm 0.2798 (+1.47z)| lr 2.01e-04 | 4165.66 ms | 32.4% bf16 MFU | 125668 tok/s step 13357/19560 | loss 3.309713 (-0.40z)| norm 0.2604 (+0.29z)| lr 2.01e-04 | 4174.22 ms | 32.3% bf16 MFU | 125665 tok/s step 13358/19560 | loss 3.371257 (+1.34z)| norm 0.2618 (+0.37z)| lr 2.01e-04 | 4167.74 ms | 32.4% bf16 MFU | 125671 tok/s step 13359/19560 | loss 3.327358 (+0.10z)| norm 0.2664 (+0.63z)| lr 2.01e-04 | 4177.43 ms | 32.3% bf16 MFU | 125663 tok/s step 13360/19560 | loss 3.332675 (+0.27z)| norm 0.2674 (+0.69z)| lr 2.01e-04 | 4170.51 ms | 32.4% bf16 MFU | 125665 tok/s step 13361/19560 | loss 3.333686 (+0.32z)| norm 0.2880 (+1.90z)| lr 2.01e-04 | 4164.69 ms | 32.4% bf16 MFU | 125677 tok/s step 13362/19560 | loss 3.321235 (-0.04z)| norm 0.2659 (+0.56z)| lr 2.01e-04 | 4190.87 ms | 32.2% bf16 MFU | 125648 tok/s step 13363/19560 | loss 3.375563 (+1.54z)| norm 0.2764 (+1.17z)| lr 2.01e-04 | 4170.57 ms | 32.4% bf16 MFU | 125651 tok/s step 13364/19560 | loss 3.401045 (+2.32z)| norm 0.2570 (+0.00z)| lr 2.01e-04 | 4169.09 ms | 32.4% bf16 MFU | 125656 tok/s step 13365/19560 | loss 3.351766 (+0.85z)| norm 0.2760 (+1.13z)| lr 2.01e-04 | 4168.82 ms | 32.4% bf16 MFU | 125662 tok/s step 13366/19560 | loss 3.311184 (-0.35z)| norm 0.2350 (-1.33z)| lr 2.00e-04 | 4188.47 ms | 32.2% bf16 MFU | 125637 tok/s step 13367/19560 | loss 3.348517 (+0.75z)| norm 0.2819 (+1.46z)| lr 2.00e-04 | 4166.27 ms | 32.4% bf16 MFU | 125647 tok/s step 13368/19560 | loss 3.279395 (-1.29z)| norm 0.2382 (-1.15z)| lr 2.00e-04 | 4166.65 ms | 32.4% bf16 MFU | 125657 tok/s step 13369/19560 | loss 3.352483 (+0.86z)| norm 0.2762 (+1.10z)| lr 2.00e-04 | 4174.05 ms | 32.3% bf16 MFU | 125654 tok/s step 13370/19560 | loss 3.368364 (+1.30z)| norm 0.2549 (-0.16z)| lr 2.00e-04 | 4173.80 ms | 32.3% bf16 MFU | 125652 tok/s step 13371/19560 | loss 3.310133 (-0.41z)| norm 0.2642 (+0.39z)| lr 2.00e-04 | 4168.97 ms | 32.4% bf16 MFU | 125657 tok/s step 13372/19560 | loss 3.318067 (-0.18z)| norm 0.2719 (+0.84z)| lr 2.00e-04 | 4167.70 ms | 32.4% bf16 MFU | 125664 tok/s step 13373/19560 | loss 3.281010 (-1.25z)| norm 0.2737 (+0.94z)| lr 2.00e-04 | 4170.28 ms | 32.4% bf16 MFU | 125667 tok/s step 13374/19560 | loss 3.272036 (-1.54z)| norm 0.2455 (-0.71z)| lr 2.00e-04 | 4164.04 ms | 32.4% bf16 MFU | 125679 tok/s step 13375/19560 | loss 3.344252 (+0.66z)| norm 0.2608 (+0.19z)| lr 2.00e-04 | 4164.71 ms | 32.4% bf16 MFU | 125690 tok/s step 13376/19560 | loss 3.277356 (-1.37z)| norm 0.2386 (-1.12z)| lr 2.00e-04 | 4161.45 ms | 32.4% bf16 MFU | 125705 tok/s step 13377/19560 | loss 3.305836 (-0.49z)| norm 0.2394 (-1.06z)| lr 2.00e-04 | 4169.56 ms | 32.4% bf16 MFU | 125706 tok/s step 13378/19560 | loss 3.317133 (-0.14z)| norm 0.2480 (-0.54z)| lr 2.00e-04 | 4170.14 ms | 32.4% bf16 MFU | 125707 tok/s step 13379/19560 | loss 3.283255 (-1.16z)| norm 0.2548 (-0.13z)| lr 2.00e-04 | 4169.23 ms | 32.4% bf16 MFU | 125710 tok/s step 13380/19560 | loss 3.340051 (+0.56z)| norm 0.2526 (-0.25z)| lr 2.00e-04 | 4173.00 ms | 32.4% bf16 MFU | 125706 tok/s step 13381/19560 | loss 3.298544 (-0.69z)| norm 0.2506 (-0.37z)| lr 2.00e-04 | 4181.61 ms | 32.3% bf16 MFU | 125690 tok/s step 13382/19560 | loss 3.281104 (-1.22z)| norm 0.2547 (-0.12z)| lr 2.00e-04 | 4177.43 ms | 32.3% bf16 MFU | 125680 tok/s step 13383/19560 | loss 3.311923 (-0.25z)| norm 0.2660 (+0.56z)| lr 1.99e-04 | 4168.84 ms | 32.4% bf16 MFU | 125685 tok/s step 13384/19560 | loss 3.320531 (+0.02z)| norm 0.2592 (+0.15z)| lr 1.99e-04 | 4171.55 ms | 32.4% bf16 MFU | 125684 tok/s step 13385/19560 | loss 3.376111 (+1.74z)| norm 0.2581 (+0.09z)| lr 1.99e-04 | 4174.19 ms | 32.3% bf16 MFU | 125680 tok/s step 13386/19560 | loss 3.254838 (-2.01z)| norm 0.2402 (-0.98z)| lr 1.99e-04 | 4165.47 ms | 32.4% bf16 MFU | 125690 tok/s step 13387/19560 | loss 3.302788 (-0.52z)| norm 0.2649 (+0.50z)| lr 1.99e-04 | 4166.59 ms | 32.4% bf16 MFU | 125697 tok/s step 13388/19560 | loss 3.374803 (+1.68z)| norm 0.2543 (-0.13z)| lr 1.99e-04 | 4169.06 ms | 32.4% bf16 MFU | 125700 tok/s step 13389/19560 | loss 3.291539 (-0.86z)| norm 0.2369 (-1.16z)| lr 1.99e-04 | 4170.95 ms | 32.4% bf16 MFU | 125700 tok/s step 13390/19560 | loss 3.304636 (-0.45z)| norm 0.2504 (-0.33z)| lr 1.99e-04 | 4163.41 ms | 32.4% bf16 MFU | 125711 tok/s step 13391/19560 | loss 3.339593 (+0.65z)| norm 0.2458 (-0.61z)| lr 1.99e-04 | 4164.95 ms | 32.4% bf16 MFU | 125720 tok/s step 13392/19560 | loss 3.335977 (+0.54z)| norm 0.2386 (-1.04z)| lr 1.99e-04 | 4168.54 ms | 32.4% bf16 MFU | 125722 tok/s step 13393/19560 | loss 3.271948 (-1.49z)| norm 0.2509 (-0.29z)| lr 1.99e-04 | 4175.92 ms | 32.3% bf16 MFU | 125714 tok/s step 13394/19560 | loss 3.351727 (+1.03z)| norm 0.2482 (-0.47z)| lr 1.99e-04 | 4172.63 ms | 32.4% bf16 MFU | 125710 tok/s step 13395/19560 | loss 3.358121 (+1.22z)| norm 0.2363 (-1.18z)| lr 1.99e-04 | 4166.55 ms | 32.4% bf16 MFU | 125717 tok/s step 13396/19560 | loss 3.346295 (+0.83z)| norm 0.2575 (+0.10z)| lr 1.99e-04 | 4171.96 ms | 32.4% bf16 MFU | 125714 tok/s step 13397/19560 | loss 3.332447 (+0.40z)| norm 0.2353 (-1.26z)| lr 1.99e-04 | 4172.93 ms | 32.4% bf16 MFU | 125710 tok/s step 13398/19560 | loss 3.323352 (+0.11z)| norm 0.2477 (-0.50z)| lr 1.99e-04 | 4163.96 ms | 32.4% bf16 MFU | 125720 tok/s step 13399/19560 | loss 3.302243 (-0.55z)| norm 0.2499 (-0.38z)| lr 1.99e-04 | 4168.28 ms | 32.4% bf16 MFU | 125724 tok/s step 13400/19560 | loss 3.279574 (-1.27z)| norm 0.2306 (-1.54z)| lr 1.98e-04 | 4168.68 ms | 32.4% bf16 MFU | 125726 tok/s step 13401/19560 | loss 3.318531 (-0.03z)| norm 0.2430 (-0.80z)| lr 1.98e-04 | 4166.42 ms | 32.4% bf16 MFU | 125731 tok/s step 13402/19560 | loss 3.276581 (-1.35z)| norm 0.2417 (-0.87z)| lr 1.98e-04 | 4172.91 ms | 32.4% bf16 MFU | 125727 tok/s step 13403/19560 | loss 3.335133 (+0.51z)| norm 0.2375 (-1.13z)| lr 1.98e-04 | 4166.16 ms | 32.4% bf16 MFU | 125733 tok/s step 13404/19560 | loss 3.362195 (+1.35z)| norm 0.2569 (+0.06z)| lr 1.98e-04 | 4175.35 ms | 32.3% bf16 MFU | 125724 tok/s step 13405/19560 | loss 3.416134 (+2.94z)| norm 0.2739 (+1.08z)| lr 1.98e-04 | 4165.35 ms | 32.4% bf16 MFU | 125732 tok/s step 13406/19560 | loss 3.333414 (+0.42z)| norm 0.2390 (-1.05z)| lr 1.98e-04 | 4171.58 ms | 32.4% bf16 MFU | 125729 tok/s step 13407/19560 | loss 3.285641 (-1.05z)| norm 0.2504 (-0.35z)| lr 1.98e-04 | 4165.44 ms | 32.4% bf16 MFU | 125736 tok/s step 13408/19560 | loss 3.264663 (-1.67z)| norm 0.2382 (-1.08z)| lr 1.98e-04 | 4161.98 ms | 32.4% bf16 MFU | 125748 tok/s step 13409/19560 | loss 3.309566 (-0.28z)| norm 0.2571 (+0.06z)| lr 1.98e-04 | 4176.54 ms | 32.3% bf16 MFU | 125737 tok/s step 13410/19560 | loss 3.366403 (+1.46z)| norm 0.2452 (-0.67z)| lr 1.98e-04 | 4164.81 ms | 32.4% bf16 MFU | 125744 tok/s step 13411/19560 | loss 3.316671 (-0.08z)| norm 0.2491 (-0.43z)| lr 1.98e-04 | 4165.93 ms | 32.4% bf16 MFU | 125750 tok/s step 13412/19560 | loss 3.302170 (-0.54z)| norm 0.2533 (-0.18z)| lr 1.98e-04 | 4173.96 ms | 32.3% bf16 MFU | 125743 tok/s step 13413/19560 | loss 3.382164 (+1.91z)| norm 0.2466 (-0.60z)| lr 1.98e-04 | 4164.08 ms | 32.4% bf16 MFU | 125751 tok/s step 13414/19560 | loss 3.310251 (-0.30z)| norm 0.2574 (+0.06z)| lr 1.98e-04 | 4165.35 ms | 32.4% bf16 MFU | 125757 tok/s step 13415/19560 | loss 3.280612 (-1.20z)| norm 0.2553 (-0.08z)| lr 1.98e-04 | 4165.11 ms | 32.4% bf16 MFU | 125763 tok/s step 13416/19560 | loss 3.350275 (+0.92z)| norm 0.2416 (-0.94z)| lr 1.98e-04 | 4168.55 ms | 32.4% bf16 MFU | 125763 tok/s step 13417/19560 | loss 3.428724 (+3.16z)| norm 0.2532 (-0.23z)| lr 1.98e-04 | 4164.12 ms | 32.4% bf16 MFU | 125770 tok/s step 13418/19560 | loss 3.341085 (+0.58z)| norm 0.2568 (-0.00z)| lr 1.97e-04 | 4180.93 ms | 32.3% bf16 MFU | 125752 tok/s step 13419/19560 | loss 3.330639 (+0.26z)| norm 0.2470 (-0.63z)| lr 1.97e-04 | 4174.96 ms | 32.3% bf16 MFU | 125743 tok/s step 13420/19560 | loss 3.292416 (-0.87z)| norm 0.2360 (-1.33z)| lr 1.97e-04 | 4168.22 ms | 32.4% bf16 MFU | 125745 tok/s step 13421/19560 | loss 3.274685 (-1.40z)| norm 0.2627 (+0.36z)| lr 1.97e-04 | 4166.40 ms | 32.4% bf16 MFU | 125750 tok/s step 13422/19560 | loss 3.298161 (-0.70z)| norm 0.2326 (-1.54z)| lr 1.97e-04 | 4174.90 ms | 32.3% bf16 MFU | 125741 tok/s step 13423/19560 | loss 3.358361 (+1.07z)| norm 0.2474 (-0.61z)| lr 1.97e-04 | 4167.10 ms | 32.4% bf16 MFU | 125745 tok/s step 13424/19560 | loss 3.332807 (+0.31z)| norm 0.2255 (-1.95z)| lr 1.97e-04 | 4158.51 ms | 32.5% bf16 MFU | 125762 tok/s step 13425/19560 | loss 3.274045 (-1.42z)| norm 0.2444 (-0.77z)| lr 1.97e-04 | 4165.79 ms | 32.4% bf16 MFU | 125766 tok/s step 13426/19560 | loss 3.372064 (+1.46z)| norm 0.2578 (+0.07z)| lr 1.97e-04 | 4166.51 ms | 32.4% bf16 MFU | 125770 tok/s step 13427/19560 | loss 3.346287 (+0.70z)| norm 0.2482 (-0.53z)| lr 1.97e-04 | 4173.42 ms | 32.4% bf16 MFU | 125762 tok/s step 13428/19560 | loss 3.367320 (+1.30z)| norm 0.2561 (-0.04z)| lr 1.97e-04 | 4170.55 ms | 32.4% bf16 MFU | 125760 tok/s step 13429/19560 | loss 3.321675 (-0.04z)| norm 0.2407 (-1.01z)| lr 1.97e-04 | 4160.83 ms | 32.4% bf16 MFU | 125772 tok/s step 13430/19560 | loss 3.265275 (-1.66z)| norm 0.2580 (+0.06z)| lr 1.97e-04 | 4163.82 ms | 32.4% bf16 MFU | 125779 tok/s step 13431/19560 | loss 3.288861 (-0.97z)| norm 0.2466 (-0.67z)| lr 1.97e-04 | 4167.97 ms | 32.4% bf16 MFU | 125780 tok/s step 13432/19560 | loss 3.283766 (-1.10z)| norm 0.2658 (+0.56z)| lr 1.97e-04 | 4165.38 ms | 32.4% bf16 MFU | 125784 tok/s step 13433/19560 | loss 3.326087 (+0.11z)| norm 0.2346 (-1.42z)| lr 1.97e-04 | 4165.70 ms | 32.4% bf16 MFU | 125788 tok/s step 13434/19560 | loss 3.351915 (+0.85z)| norm 0.2577 (+0.06z)| lr 1.97e-04 | 4165.14 ms | 32.4% bf16 MFU | 125792 tok/s step 13435/19560 | loss 3.259079 (-1.82z)| norm 0.2370 (-1.25z)| lr 1.96e-04 | 4173.52 ms | 32.4% bf16 MFU | 125784 tok/s step 13436/19560 | loss 3.305081 (-0.48z)| norm 0.2677 (+0.74z)| lr 1.96e-04 | 4172.24 ms | 32.4% bf16 MFU | 125778 tok/s step 13437/19560 | loss 3.303357 (-0.53z)| norm 0.2407 (-1.02z)| lr 1.96e-04 | 4160.54 ms | 32.5% bf16 MFU | 125790 tok/s step 13438/19560 | loss 3.357537 (+1.02z)| norm 0.2490 (-0.47z)| lr 1.96e-04 | 4165.45 ms | 32.4% bf16 MFU | 125793 tok/s step 13439/19560 | loss 3.319748 (-0.07z)| norm 0.2531 (-0.18z)| lr 1.96e-04 | 4163.50 ms | 32.4% bf16 MFU | 125800 tok/s step 13440/19560 | loss 3.297217 (-0.71z)| norm 0.2474 (-0.55z)| lr 1.96e-04 | 4163.80 ms | 32.4% bf16 MFU | 125806 tok/s step 13441/19560 | loss 3.315463 (-0.17z)| norm 0.2781 (+1.51z)| lr 1.96e-04 | 4160.03 ms | 32.5% bf16 MFU | 125817 tok/s step 13442/19560 | loss 3.270776 (-1.46z)| norm 0.2324 (-1.53z)| lr 1.96e-04 | 4180.22 ms | 32.3% bf16 MFU | 125797 tok/s step 13443/19560 | loss 3.302310 (-0.55z)| norm 0.2697 (+0.96z)| lr 1.96e-04 | 4168.63 ms | 32.4% bf16 MFU | 125796 tok/s step 13444/19560 | loss 3.313188 (-0.22z)| norm 0.2417 (-0.89z)| lr 1.96e-04 | 4175.27 ms | 32.3% bf16 MFU | 125784 tok/s step 13445/19560 | loss 3.311026 (-0.27z)| norm 0.2642 (+0.62z)| lr 1.96e-04 | 4166.75 ms | 32.4% bf16 MFU | 125787 tok/s step 13446/19560 | loss 3.325657 (+0.15z)| norm 0.2535 (-0.10z)| lr 1.96e-04 | 4172.07 ms | 32.4% bf16 MFU | 125781 tok/s step 13447/19560 | loss 3.344723 (+0.70z)| norm 0.2768 (+1.44z)| lr 1.96e-04 | 4172.42 ms | 32.4% bf16 MFU | 125774 tok/s step 13448/19560 | loss 3.323272 (+0.07z)| norm 0.2503 (-0.33z)| lr 1.96e-04 | 4163.85 ms | 32.4% bf16 MFU | 125781 tok/s step 13449/19560 | loss 3.293228 (-0.81z)| norm 0.2586 (+0.23z)| lr 1.96e-04 | 4165.83 ms | 32.4% bf16 MFU | 125785 tok/s step 13450/19560 | loss 3.271978 (-1.42z)| norm 0.2433 (-0.79z)| lr 1.96e-04 | 4163.70 ms | 32.4% bf16 MFU | 125792 tok/s step 13451/19560 | loss 3.270184 (-1.45z)| norm 0.2607 (+0.37z)| lr 1.96e-04 | 4163.08 ms | 32.4% bf16 MFU | 125799 tok/s step 13452/19560 | loss 3.335638 (+0.43z)| norm 0.2662 (+0.72z)| lr 1.95e-04 | 4170.33 ms | 32.4% bf16 MFU | 125795 tok/s step 13453/19560 | loss 3.306832 (-0.40z)| norm 0.2537 (-0.11z)| lr 1.95e-04 | 4166.38 ms | 32.4% bf16 MFU | 125797 tok/s step 13454/19560 | loss 3.331750 (+0.33z)| norm 0.2474 (-0.52z)| lr 1.95e-04 | 4640.17 ms | 29.1% bf16 MFU | 125157 tok/s step 13455/19560 | loss 3.285716 (-0.99z)| norm 0.2355 (-1.31z)| lr 1.95e-04 | 4768.17 ms | 28.3% bf16 MFU | 124397 tok/s step 13456/19560 | loss 3.344379 (+0.71z)| norm 0.2532 (-0.13z)| lr 1.95e-04 | 4463.36 ms | 30.3% bf16 MFU | 124050 tok/s step 13457/19560 | loss 3.281895 (-1.12z)| norm 0.2339 (-1.40z)| lr 1.95e-04 | 4484.73 ms | 30.1% bf16 MFU | 123693 tok/s step 13458/19560 | loss 3.310262 (-0.29z)| norm 0.2615 (+0.41z)| lr 1.95e-04 | 4384.22 ms | 30.8% bf16 MFU | 123487 tok/s step 13459/19560 | loss 3.340468 (+0.61z)| norm 0.2355 (-1.29z)| lr 1.95e-04 | 4618.25 ms | 29.2% bf16 MFU | 122989 tok/s step 13460/19560 | loss 3.365623 (+1.34z)| norm 0.2473 (-0.51z)| lr 1.95e-04 | 4340.62 ms | 31.1% bf16 MFU | 122879 tok/s step 13461/19560 | loss 3.281226 (-1.15z)| norm 0.2669 (+0.76z)| lr 1.95e-04 | 4480.93 ms | 30.1% bf16 MFU | 122585 tok/s step 13462/19560 | loss 3.306472 (-0.40z)| norm 0.2229 (-2.08z)| lr 1.95e-04 | 4544.65 ms | 29.7% bf16 MFU | 122224 tok/s step 13463/19560 | loss 3.349070 (+0.85z)| norm 0.2837 (+1.81z)| lr 1.95e-04 | 4376.03 ms | 30.9% bf16 MFU | 122104 tok/s step 13464/19560 | loss 3.343988 (+0.69z)| norm 0.2348 (-1.30z)| lr 1.95e-04 | 4198.17 ms | 32.2% bf16 MFU | 122243 tok/s step 13465/19560 | loss 3.308762 (-0.33z)| norm 0.2692 (+0.91z)| lr 1.95e-04 | 4286.79 ms | 31.5% bf16 MFU | 122246 tok/s step 13466/19560 | loss 3.319734 (+0.00z)| norm 0.2369 (-1.16z)| lr 1.95e-04 | 4242.28 ms | 31.8% bf16 MFU | 122313 tok/s step 13467/19560 | loss 3.295444 (-0.71z)| norm 0.2432 (-0.75z)| lr 1.95e-04 | 4203.79 ms | 32.1% bf16 MFU | 122433 tok/s step 13468/19560 | loss 3.395581 (+2.20z)| norm 0.2557 (+0.06z)| lr 1.95e-04 | 4170.37 ms | 32.4% bf16 MFU | 122597 tok/s step 13469/19560 | loss 3.273609 (-1.33z)| norm 0.2754 (+1.36z)| lr 1.94e-04 | 4167.59 ms | 32.4% bf16 MFU | 122757 tok/s step 13470/19560 | loss 3.313895 (-0.16z)| norm 0.2558 (+0.07z)| lr 1.94e-04 | 4205.21 ms | 32.1% bf16 MFU | 122853 tok/s step 13471/19560 | loss 3.247223 (-2.06z)| norm 0.2640 (+0.64z)| lr 1.94e-04 | 4225.57 ms | 32.0% bf16 MFU | 122914 tok/s step 13472/19560 | loss 3.362718 (+1.22z)| norm 0.2640 (+0.63z)| lr 1.94e-04 | 4232.66 ms | 31.9% bf16 MFU | 122962 tok/s step 13473/19560 | loss 3.340173 (+0.57z)| norm 0.2311 (-1.56z)| lr 1.94e-04 | 4168.41 ms | 32.4% bf16 MFU | 123103 tok/s step 13474/19560 | loss 3.279832 (-1.14z)| norm 0.2590 (+0.32z)| lr 1.94e-04 | 4211.34 ms | 32.1% bf16 MFU | 123172 tok/s step 13475/19560 | loss 3.301253 (-0.52z)| norm 0.2438 (-0.70z)| lr 1.94e-04 | 4175.11 ms | 32.3% bf16 MFU | 123292 tok/s step 13476/19560 | loss 3.298801 (-0.59z)| norm 0.2533 (-0.05z)| lr 1.94e-04 | 4166.54 ms | 32.4% bf16 MFU | 123419 tok/s step 13477/19560 | loss 3.319755 (+0.00z)| norm 0.2680 (+0.94z)| lr 1.94e-04 | 4189.97 ms | 32.2% bf16 MFU | 123505 tok/s step 13478/19560 | loss 3.356093 (+1.03z)| norm 0.2533 (-0.03z)| lr 1.94e-04 | 4167.70 ms | 32.4% bf16 MFU | 123620 tok/s step 13479/19560 | loss 3.332930 (+0.37z)| norm 0.2702 (+1.17z)| lr 1.94e-04 | 4174.84 ms | 32.3% bf16 MFU | 123718 tok/s step 13480/19560 | loss 3.290685 (-0.82z)| norm 0.2531 (-0.03z)| lr 1.94e-04 | 4250.07 ms | 31.8% bf16 MFU | 123700 tok/s step 13481/19560 | loss 3.338628 (+0.53z)| norm 0.2457 (-0.56z)| lr 1.94e-04 | 4177.51 ms | 32.3% bf16 MFU | 123790 tok/s step 13482/19560 | loss 3.319794 (-0.01z)| norm 0.2511 (-0.15z)| lr 1.94e-04 | 4171.86 ms | 32.4% bf16 MFU | 123884 tok/s step 13483/19560 | loss 3.345222 (+0.73z)| norm 0.2412 (-0.88z)| lr 1.94e-04 | 4179.95 ms | 32.3% bf16 MFU | 123961 tok/s step 13484/19560 | loss 3.408763 (+2.47z)| norm 0.2709 (+1.37z)| lr 1.94e-04 | 4175.14 ms | 32.3% bf16 MFU | 124042 tok/s step 13485/19560 | loss 3.380280 (+1.64z)| norm 0.2338 (-1.41z)| lr 1.94e-04 | 4175.13 ms | 32.3% bf16 MFU | 124119 tok/s step 13486/19560 | loss 3.342137 (+0.59z)| norm 0.2556 (+0.23z)| lr 1.93e-04 | 4176.27 ms | 32.3% bf16 MFU | 124190 tok/s step 13487/19560 | loss 3.339060 (+0.50z)| norm 0.2586 (+0.46z)| lr 1.93e-04 | 4217.76 ms | 32.0% bf16 MFU | 124195 tok/s step 13488/19560 | loss 3.349233 (+0.78z)| norm 0.2322 (-1.50z)| lr 1.93e-04 | 4175.96 ms | 32.3% bf16 MFU | 124263 tok/s step 13489/19560 | loss 3.377440 (+1.54z)| norm 0.2789 (+2.03z)| lr 1.93e-04 | 4172.59 ms | 32.4% bf16 MFU | 124332 tok/s step 13490/19560 | loss 3.325911 (+0.11z)| norm 0.2437 (-0.63z)| lr 1.93e-04 | 4166.92 ms | 32.4% bf16 MFU | 124407 tok/s step 13491/19560 | loss 3.363309 (+1.15z)| norm 0.2671 (+1.17z)| lr 1.93e-04 | 4177.63 ms | 32.3% bf16 MFU | 124462 tok/s step 13492/19560 | loss 3.330520 (+0.26z)| norm 0.2408 (-0.84z)| lr 1.93e-04 | 4157.76 ms | 32.5% bf16 MFU | 124543 tok/s step 13493/19560 | loss 3.301064 (-0.56z)| norm 0.2592 (+0.58z)| lr 1.93e-04 | 4163.18 ms | 32.4% bf16 MFU | 124613 tok/s step 13494/19560 | loss 3.369832 (+1.37z)| norm 0.2479 (-0.30z)| lr 1.93e-04 | 4359.52 ms | 31.0% bf16 MFU | 124395 tok/s step 13495/19560 | loss 3.333947 (+0.36z)| norm 0.2661 (+1.15z)| lr 1.93e-04 | 4220.60 ms | 32.0% bf16 MFU | 124387 tok/s step 13496/19560 | loss 3.306315 (-0.42z)| norm 0.2937 (+3.20z)| lr 1.93e-04 | 4223.11 ms | 32.0% bf16 MFU | 124375 tok/s step 13497/19560 | loss 3.315751 (-0.15z)| norm 0.2644 (+0.96z)| lr 1.93e-04 | 4174.22 ms | 32.3% bf16 MFU | 124436 tok/s step 13498/19560 | loss 3.374751 (+1.52z)| norm 0.2702 (+1.39z)| lr 1.93e-04 | 4173.03 ms | 32.4% bf16 MFU | 124496 tok/s step 13499/19560 | loss 3.361451 (+1.13z)| norm 0.2692 (+1.31z)| lr 1.93e-04 | 4176.09 ms | 32.3% bf16 MFU | 124549 tok/s step 13500/19560 | loss 3.339004 (+0.49z)| norm 0.3028 (+3.68z)| lr 1.93e-04 | 4272.51 ms | 31.6% bf16 MFU | 124457 tok/s val loss 3.309352 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2999/10042 = 0.298646 step 13501/19560 | loss 3.283669 (-1.07z)| norm 0.2514 (-0.06z)| lr 1.93e-04 | 4208.97 ms | 32.1% bf16 MFU | 124462 tok/s step 13502/19560 | loss 3.347617 (+0.72z)| norm 0.2681 (+1.16z)| lr 1.93e-04 | 4223.32 ms | 32.0% bf16 MFU | 124446 tok/s step 13503/19560 | loss 3.394301 (+2.00z)| norm 0.2723 (+1.45z)| lr 1.93e-04 | 4228.38 ms | 31.9% bf16 MFU | 124423 tok/s step 13504/19560 | loss 3.300346 (-0.63z)| norm 0.2555 (+0.21z)| lr 1.92e-04 | 4171.10 ms | 32.4% bf16 MFU | 124487 tok/s step 13505/19560 | loss 3.293214 (-0.83z)| norm 0.2955 (+3.01z)| lr 1.92e-04 | 4189.39 ms | 32.2% bf16 MFU | 124520 tok/s step 13506/19560 | loss 3.316815 (-0.17z)| norm 0.2572 (+0.29z)| lr 1.92e-04 | 4229.89 ms | 31.9% bf16 MFU | 124491 tok/s step 13507/19560 | loss 3.370177 (+1.31z)| norm 0.2717 (+1.30z)| lr 1.92e-04 | 4173.55 ms | 32.4% bf16 MFU | 124548 tok/s step 13508/19560 | loss 3.321474 (-0.05z)| norm 0.2637 (+0.73z)| lr 1.92e-04 | 4167.60 ms | 32.4% bf16 MFU | 124611 tok/s step 13509/19560 | loss 3.263819 (-1.64z)| norm 0.2639 (+0.73z)| lr 1.92e-04 | 4172.39 ms | 32.4% bf16 MFU | 124663 tok/s step 13510/19560 | loss 3.283380 (-1.10z)| norm 0.2336 (-1.37z)| lr 1.92e-04 | 4257.51 ms | 31.7% bf16 MFU | 124587 tok/s step 13511/19560 | loss 3.353812 (+0.84z)| norm 0.2579 (+0.33z)| lr 1.92e-04 | 4210.90 ms | 32.1% bf16 MFU | 124583 tok/s step 13512/19560 | loss 3.317442 (-0.16z)| norm 0.2425 (-0.73z)| lr 1.92e-04 | 4187.74 ms | 32.2% bf16 MFU | 124614 tok/s step 13513/19560 | loss 3.361931 (+1.08z)| norm 0.2383 (-1.02z)| lr 1.92e-04 | 4263.86 ms | 31.7% bf16 MFU | 124531 tok/s step 13514/19560 | loss 3.295460 (-0.79z)| norm 0.2448 (-0.57z)| lr 1.92e-04 | 4185.87 ms | 32.3% bf16 MFU | 124567 tok/s step 13515/19560 | loss 3.290314 (-0.93z)| norm 0.2258 (-1.84z)| lr 1.92e-04 | 4194.31 ms | 32.2% bf16 MFU | 124589 tok/s step 13516/19560 | loss 3.319012 (-0.11z)| norm 0.2389 (-0.93z)| lr 1.92e-04 | 4179.13 ms | 32.3% bf16 MFU | 124632 tok/s step 13517/19560 | loss 3.360900 (+1.06z)| norm 0.2594 (+0.46z)| lr 1.92e-04 | 4177.28 ms | 32.3% bf16 MFU | 124676 tok/s step 13518/19560 | loss 3.349649 (+0.73z)| norm 0.2555 (+0.19z)| lr 1.92e-04 | 4171.64 ms | 32.4% bf16 MFU | 124726 tok/s step 13519/19560 | loss 3.293775 (-0.84z)| norm 0.2611 (+0.57z)| lr 1.92e-04 | 4168.13 ms | 32.4% bf16 MFU | 124779 tok/s step 13520/19560 | loss 3.343987 (+0.58z)| norm 0.2562 (+0.22z)| lr 1.92e-04 | 4192.26 ms | 32.2% bf16 MFU | 124793 tok/s step 13521/19560 | loss 3.297718 (-0.74z)| norm 0.2381 (-1.02z)| lr 1.91e-04 | 4176.96 ms | 32.3% bf16 MFU | 124829 tok/s step 13522/19560 | loss 3.319938 (-0.10z)| norm 0.2597 (+0.46z)| lr 1.91e-04 | 4174.53 ms | 32.3% bf16 MFU | 124867 tok/s step 13523/19560 | loss 3.298008 (-0.71z)| norm 0.2442 (-0.61z)| lr 1.91e-04 | 4160.80 ms | 32.4% bf16 MFU | 124924 tok/s step 13524/19560 | loss 3.276889 (-1.29z)| norm 0.2424 (-0.72z)| lr 1.91e-04 | 4227.40 ms | 31.9% bf16 MFU | 124879 tok/s step 13525/19560 | loss 3.360996 (+1.08z)| norm 0.2439 (-0.63z)| lr 1.91e-04 | 4177.24 ms | 32.3% bf16 MFU | 124911 tok/s step 13526/19560 | loss 3.298513 (-0.68z)| norm 0.2436 (-0.65z)| lr 1.91e-04 | 4171.45 ms | 32.4% bf16 MFU | 124950 tok/s step 13527/19560 | loss 3.376045 (+1.48z)| norm 0.2357 (-1.18z)| lr 1.91e-04 | 4167.18 ms | 32.4% bf16 MFU | 124993 tok/s step 13528/19560 | loss 3.327752 (+0.12z)| norm 0.2640 (+0.75z)| lr 1.91e-04 | 4165.97 ms | 32.4% bf16 MFU | 125036 tok/s step 13529/19560 | loss 3.339806 (+0.45z)| norm 0.2395 (-0.94z)| lr 1.91e-04 | 4175.67 ms | 32.3% bf16 MFU | 125062 tok/s step 13530/19560 | loss 3.335547 (+0.32z)| norm 0.2749 (+1.48z)| lr 1.91e-04 | 4167.09 ms | 32.4% bf16 MFU | 125099 tok/s step 13531/19560 | loss 3.325472 (+0.04z)| norm 0.2449 (-0.58z)| lr 1.91e-04 | 4186.54 ms | 32.3% bf16 MFU | 125106 tok/s step 13532/19560 | loss 3.356024 (+0.91z)| norm 0.2646 (+0.77z)| lr 1.91e-04 | 4201.78 ms | 32.1% bf16 MFU | 125090 tok/s step 13533/19560 | loss 3.372927 (+1.42z)| norm 0.2417 (-0.79z)| lr 1.91e-04 | 4200.85 ms | 32.1% bf16 MFU | 125075 tok/s step 13534/19560 | loss 3.341582 (+0.52z)| norm 0.2818 (+1.93z)| lr 1.91e-04 | 4166.17 ms | 32.4% bf16 MFU | 125114 tok/s step 13535/19560 | loss 3.283574 (-1.16z)| norm 0.2401 (-0.91z)| lr 1.91e-04 | 4212.79 ms | 32.0% bf16 MFU | 125081 tok/s step 13536/19560 | loss 3.336851 (+0.37z)| norm 0.2578 (+0.29z)| lr 1.91e-04 | 4232.33 ms | 31.9% bf16 MFU | 125021 tok/s step 13537/19560 | loss 3.329989 (+0.16z)| norm 0.2608 (+0.49z)| lr 1.91e-04 | 4162.81 ms | 32.4% bf16 MFU | 125067 tok/s step 13538/19560 | loss 3.320359 (-0.11z)| norm 0.2392 (-0.98z)| lr 1.90e-04 | 4170.62 ms | 32.4% bf16 MFU | 125099 tok/s step 13539/19560 | loss 3.379641 (+1.60z)| norm 0.2496 (-0.27z)| lr 1.90e-04 | 4178.42 ms | 32.3% bf16 MFU | 125118 tok/s step 13540/19560 | loss 3.249065 (-2.15z)| norm 0.2427 (-0.73z)| lr 1.90e-04 | 4165.25 ms | 32.4% bf16 MFU | 125155 tok/s step 13541/19560 | loss 3.351935 (+0.81z)| norm 0.2477 (-0.40z)| lr 1.90e-04 | 4182.51 ms | 32.3% bf16 MFU | 125165 tok/s step 13542/19560 | loss 3.335226 (+0.32z)| norm 0.2595 (+0.40z)| lr 1.90e-04 | 4171.64 ms | 32.4% bf16 MFU | 125191 tok/s step 13543/19560 | loss 3.347017 (+0.65z)| norm 0.2503 (-0.22z)| lr 1.90e-04 | 4180.32 ms | 32.3% bf16 MFU | 125202 tok/s step 13544/19560 | loss 3.299185 (-0.73z)| norm 0.2420 (-0.78z)| lr 1.90e-04 | 4185.10 ms | 32.3% bf16 MFU | 125206 tok/s step 13545/19560 | loss 3.317358 (-0.18z)| norm 0.2525 (-0.07z)| lr 1.90e-04 | 4183.81 ms | 32.3% bf16 MFU | 125211 tok/s step 13546/19560 | loss 3.303371 (-0.59z)| norm 0.2547 (+0.08z)| lr 1.90e-04 | 4170.25 ms | 32.4% bf16 MFU | 125237 tok/s step 13547/19560 | loss 3.282820 (-1.19z)| norm 0.2418 (-0.79z)| lr 1.90e-04 | 4180.95 ms | 32.3% bf16 MFU | 125245 tok/s step 13548/19560 | loss 3.355112 (+0.95z)| norm 0.2445 (-0.61z)| lr 1.90e-04 | 4169.58 ms | 32.4% bf16 MFU | 125270 tok/s step 13549/19560 | loss 3.346712 (+0.69z)| norm 0.2514 (-0.13z)| lr 1.90e-04 | 4167.52 ms | 32.4% bf16 MFU | 125296 tok/s step 13550/19560 | loss 3.364333 (+1.20z)| norm 0.2258 (-1.87z)| lr 1.90e-04 | 4179.48 ms | 32.3% bf16 MFU | 125304 tok/s step 13551/19560 | loss 3.342013 (+0.54z)| norm 0.2544 (+0.07z)| lr 1.90e-04 | 4173.09 ms | 32.4% bf16 MFU | 125320 tok/s step 13552/19560 | loss 3.286535 (-1.11z)| norm 0.2391 (-0.99z)| lr 1.90e-04 | 4183.67 ms | 32.3% bf16 MFU | 125320 tok/s step 13553/19560 | loss 3.292213 (-0.95z)| norm 0.2642 (+0.72z)| lr 1.90e-04 | 4184.99 ms | 32.3% bf16 MFU | 125318 tok/s step 13554/19560 | loss 3.373014 (+1.48z)| norm 0.2351 (-1.26z)| lr 1.90e-04 | 4174.84 ms | 32.3% bf16 MFU | 125331 tok/s step 13555/19560 | loss 3.298965 (-0.74z)| norm 0.2498 (-0.25z)| lr 1.90e-04 | 4186.25 ms | 32.3% bf16 MFU | 125327 tok/s step 13556/19560 | loss 3.316542 (-0.20z)| norm 0.2438 (-0.66z)| lr 1.89e-04 | 4208.40 ms | 32.1% bf16 MFU | 125290 tok/s step 13557/19560 | loss 3.449744 (+3.60z)| norm 0.2871 (+2.23z)| lr 1.89e-04 | 4173.03 ms | 32.4% bf16 MFU | 125307 tok/s step 13558/19560 | loss 3.314347 (-0.30z)| norm 0.2765 (+1.50z)| lr 1.89e-04 | 4177.32 ms | 32.3% bf16 MFU | 125317 tok/s step 13559/19560 | loss 3.311704 (-0.38z)| norm 0.2456 (-0.55z)| lr 1.89e-04 | 4172.53 ms | 32.4% bf16 MFU | 125334 tok/s step 13560/19560 | loss 3.316909 (-0.24z)| norm 0.2524 (-0.09z)| lr 1.89e-04 | 4169.60 ms | 32.4% bf16 MFU | 125354 tok/s step 13561/19560 | loss 3.329570 (+0.13z)| norm 0.2700 (+1.06z)| lr 1.89e-04 | 4162.85 ms | 32.4% bf16 MFU | 125384 tok/s step 13562/19560 | loss 3.356276 (+0.91z)| norm 0.2553 (+0.08z)| lr 1.89e-04 | 4166.60 ms | 32.4% bf16 MFU | 125406 tok/s step 13563/19560 | loss 3.292668 (-0.96z)| norm 0.2490 (-0.34z)| lr 1.89e-04 | 4176.77 ms | 32.3% bf16 MFU | 125412 tok/s step 13564/19560 | loss 3.276434 (-1.43z)| norm 0.2443 (-0.65z)| lr 1.89e-04 | 4172.58 ms | 32.4% bf16 MFU | 125424 tok/s step 13565/19560 | loss 3.340554 (+0.45z)| norm 0.2495 (-0.31z)| lr 1.89e-04 | 4170.85 ms | 32.4% bf16 MFU | 125438 tok/s step 13566/19560 | loss 3.301701 (-0.68z)| norm 0.2426 (-0.76z)| lr 1.89e-04 | 4182.39 ms | 32.3% bf16 MFU | 125434 tok/s step 13567/19560 | loss 3.349468 (+0.71z)| norm 0.2664 (+0.82z)| lr 1.89e-04 | 4170.15 ms | 32.4% bf16 MFU | 125448 tok/s step 13568/19560 | loss 3.307886 (-0.51z)| norm 0.2421 (-0.80z)| lr 1.89e-04 | 4175.21 ms | 32.3% bf16 MFU | 125454 tok/s step 13569/19560 | loss 3.326987 (+0.05z)| norm 0.2619 (+0.53z)| lr 1.89e-04 | 4173.14 ms | 32.4% bf16 MFU | 125463 tok/s step 13570/19560 | loss 3.289364 (-1.07z)| norm 0.2370 (-1.15z)| lr 1.89e-04 | 4174.75 ms | 32.3% bf16 MFU | 125470 tok/s step 13571/19560 | loss 3.325681 (+0.00z)| norm 0.2677 (+0.93z)| lr 1.89e-04 | 4170.90 ms | 32.4% bf16 MFU | 125481 tok/s step 13572/19560 | loss 3.305041 (-0.61z)| norm 0.2330 (-1.41z)| lr 1.89e-04 | 4167.07 ms | 32.4% bf16 MFU | 125498 tok/s step 13573/19560 | loss 3.333859 (+0.24z)| norm 0.2409 (-0.86z)| lr 1.88e-04 | 4177.69 ms | 32.3% bf16 MFU | 125498 tok/s step 13574/19560 | loss 3.323860 (-0.06z)| norm 0.2470 (-0.45z)| lr 1.88e-04 | 4183.84 ms | 32.3% bf16 MFU | 125489 tok/s step 13575/19560 | loss 3.283271 (-1.24z)| norm 0.2248 (-1.90z)| lr 1.88e-04 | 4180.91 ms | 32.3% bf16 MFU | 125484 tok/s step 13576/19560 | loss 3.290731 (-1.01z)| norm 0.2487 (-0.30z)| lr 1.88e-04 | 4164.20 ms | 32.4% bf16 MFU | 125505 tok/s step 13577/19560 | loss 3.321361 (-0.11z)| norm 0.2366 (-1.10z)| lr 1.88e-04 | 4190.74 ms | 32.2% bf16 MFU | 125485 tok/s step 13578/19560 | loss 3.316821 (-0.26z)| norm 0.2453 (-0.52z)| lr 1.88e-04 | 4177.73 ms | 32.3% bf16 MFU | 125486 tok/s step 13579/19560 | loss 3.319624 (-0.19z)| norm 0.2539 (+0.06z)| lr 1.88e-04 | 4166.32 ms | 32.4% bf16 MFU | 125503 tok/s step 13580/19560 | loss 3.316095 (-0.29z)| norm 0.2438 (-0.60z)| lr 1.88e-04 | 4169.35 ms | 32.4% bf16 MFU | 125516 tok/s step 13581/19560 | loss 3.252202 (-2.16z)| norm 0.2556 (+0.18z)| lr 1.88e-04 | 4175.33 ms | 32.3% bf16 MFU | 125518 tok/s step 13582/19560 | loss 3.330823 (+0.16z)| norm 0.2578 (+0.32z)| lr 1.88e-04 | 4174.14 ms | 32.3% bf16 MFU | 125523 tok/s step 13583/19560 | loss 3.338398 (+0.37z)| norm 0.2636 (+0.70z)| lr 1.88e-04 | 4186.69 ms | 32.2% bf16 MFU | 125508 tok/s step 13584/19560 | loss 3.286932 (-1.14z)| norm 0.2288 (-1.61z)| lr 1.88e-04 | 4173.19 ms | 32.4% bf16 MFU | 125514 tok/s step 13585/19560 | loss 3.391997 (+1.93z)| norm 0.2720 (+1.25z)| lr 1.88e-04 | 4171.92 ms | 32.4% bf16 MFU | 125522 tok/s step 13586/19560 | loss 3.370238 (+1.27z)| norm 0.2485 (-0.31z)| lr 1.88e-04 | 4168.89 ms | 32.4% bf16 MFU | 125534 tok/s step 13587/19560 | loss 3.332019 (+0.16z)| norm 0.2455 (-0.52z)| lr 1.88e-04 | 4196.90 ms | 32.2% bf16 MFU | 125503 tok/s step 13588/19560 | loss 3.327575 (+0.04z)| norm 0.2694 (+1.06z)| lr 1.88e-04 | 4176.14 ms | 32.3% bf16 MFU | 125505 tok/s step 13589/19560 | loss 3.309514 (-0.50z)| norm 0.2415 (-0.78z)| lr 1.88e-04 | 4182.72 ms | 32.3% bf16 MFU | 125497 tok/s step 13590/19560 | loss 3.445965 (+3.34z)| norm 0.2823 (+1.91z)| lr 1.87e-04 | 4182.62 ms | 32.3% bf16 MFU | 125490 tok/s step 13591/19560 | loss 3.323191 (-0.12z)| norm 0.2622 (+0.59z)| lr 1.87e-04 | 4181.85 ms | 32.3% bf16 MFU | 125484 tok/s step 13592/19560 | loss 3.321079 (-0.17z)| norm 0.2477 (-0.40z)| lr 1.87e-04 | 4254.20 ms | 31.7% bf16 MFU | 125372 tok/s step 13593/19560 | loss 3.277600 (-1.39z)| norm 0.2509 (-0.17z)| lr 1.87e-04 | 4178.19 ms | 32.3% bf16 MFU | 125377 tok/s step 13594/19560 | loss 3.302402 (-0.69z)| norm 0.2464 (-0.49z)| lr 1.87e-04 | 4163.70 ms | 32.4% bf16 MFU | 125404 tok/s step 13595/19560 | loss 3.395938 (+1.89z)| norm 0.2625 (+0.60z)| lr 1.87e-04 | 4178.24 ms | 32.3% bf16 MFU | 125408 tok/s step 13596/19560 | loss 3.336567 (+0.26z)| norm 0.2496 (-0.28z)| lr 1.87e-04 | 4180.10 ms | 32.3% bf16 MFU | 125409 tok/s step 13597/19560 | loss 3.405950 (+2.17z)| norm 0.2700 (+1.13z)| lr 1.87e-04 | 4175.25 ms | 32.3% bf16 MFU | 125417 tok/s step 13598/19560 | loss 3.326744 (-0.04z)| norm 0.2539 (+0.02z)| lr 1.87e-04 | 4207.53 ms | 32.1% bf16 MFU | 125377 tok/s step 13599/19560 | loss 3.265710 (-1.76z)| norm 0.2510 (-0.18z)| lr 1.87e-04 | 4164.79 ms | 32.4% bf16 MFU | 125402 tok/s step 13600/19560 | loss 3.253801 (-2.05z)| norm 0.2365 (-1.15z)| lr 1.87e-04 | 4176.99 ms | 32.3% bf16 MFU | 125408 tok/s step 13601/19560 | loss 3.251710 (-2.05z)| norm 0.2426 (-0.75z)| lr 1.87e-04 | 4178.60 ms | 32.3% bf16 MFU | 125411 tok/s step 13602/19560 | loss 3.334665 (+0.20z)| norm 0.2433 (-0.69z)| lr 1.87e-04 | 4168.65 ms | 32.4% bf16 MFU | 125429 tok/s step 13603/19560 | loss 3.306195 (-0.58z)| norm 0.2247 (-1.94z)| lr 1.87e-04 | 4184.68 ms | 32.3% bf16 MFU | 125422 tok/s step 13604/19560 | loss 3.330698 (+0.08z)| norm 0.2561 (+0.21z)| lr 1.87e-04 | 4171.22 ms | 32.4% bf16 MFU | 125435 tok/s step 13605/19560 | loss 3.359523 (+0.87z)| norm 0.2445 (-0.58z)| lr 1.87e-04 | 4204.93 ms | 32.1% bf16 MFU | 125398 tok/s step 13606/19560 | loss 3.318114 (-0.26z)| norm 0.2384 (-0.98z)| lr 1.87e-04 | 4183.35 ms | 32.3% bf16 MFU | 125394 tok/s step 13607/19560 | loss 3.318190 (-0.26z)| norm 0.2748 (+1.49z)| lr 1.87e-04 | 4180.15 ms | 32.3% bf16 MFU | 125396 tok/s step 13608/19560 | loss 3.394466 (+1.80z)| norm 0.2470 (-0.40z)| lr 1.86e-04 | 4175.68 ms | 32.3% bf16 MFU | 125404 tok/s step 13609/19560 | loss 3.427894 (+2.62z)| norm 0.2560 (+0.21z)| lr 1.86e-04 | 4169.31 ms | 32.4% bf16 MFU | 125421 tok/s step 13610/19560 | loss 3.394470 (+1.70z)| norm 0.2552 (+0.15z)| lr 1.86e-04 | 4174.30 ms | 32.3% bf16 MFU | 125430 tok/s step 13611/19560 | loss 3.318246 (-0.29z)| norm 0.2468 (-0.42z)| lr 1.86e-04 | 4177.21 ms | 32.3% bf16 MFU | 125434 tok/s step 13612/19560 | loss 3.379366 (+1.33z)| norm 0.2611 (+0.56z)| lr 1.86e-04 | 4174.13 ms | 32.3% bf16 MFU | 125442 tok/s step 13613/19560 | loss 3.330070 (+0.03z)| norm 0.2429 (-0.69z)| lr 1.86e-04 | 4186.96 ms | 32.2% bf16 MFU | 125431 tok/s step 13614/19560 | loss 3.452073 (+3.14z)| norm 0.2855 (+2.18z)| lr 1.86e-04 | 4202.56 ms | 32.1% bf16 MFU | 125397 tok/s step 13615/19560 | loss 3.320165 (-0.24z)| norm 0.2592 (+0.41z)| lr 1.86e-04 | 4171.88 ms | 32.4% bf16 MFU | 125411 tok/s step 13616/19560 | loss 3.335409 (+0.15z)| norm 0.2613 (+0.53z)| lr 1.86e-04 | 4222.10 ms | 32.0% bf16 MFU | 125349 tok/s step 13617/19560 | loss 3.343770 (+0.38z)| norm 0.2608 (+0.51z)| lr 1.86e-04 | 4184.91 ms | 32.3% bf16 MFU | 125346 tok/s step 13618/19560 | loss 3.405068 (+1.92z)| norm 0.2479 (-0.37z)| lr 1.86e-04 | 4179.81 ms | 32.3% bf16 MFU | 125350 tok/s step 13619/19560 | loss 3.301967 (-0.70z)| norm 0.2444 (-0.60z)| lr 1.86e-04 | 4166.05 ms | 32.4% bf16 MFU | 125375 tok/s step 13620/19560 | loss 3.379264 (+1.26z)| norm 0.2392 (-0.96z)| lr 1.86e-04 | 4166.00 ms | 32.4% bf16 MFU | 125399 tok/s step 13621/19560 | loss 3.348537 (+0.47z)| norm 0.2474 (-0.39z)| lr 1.86e-04 | 4180.36 ms | 32.3% bf16 MFU | 125400 tok/s step 13622/19560 | loss 3.273018 (-1.42z)| norm 0.2635 (+0.71z)| lr 1.86e-04 | 4227.56 ms | 31.9% bf16 MFU | 125331 tok/s step 13623/19560 | loss 3.323040 (-0.16z)| norm 0.2478 (-0.36z)| lr 1.86e-04 | 4173.90 ms | 32.3% bf16 MFU | 125345 tok/s step 13624/19560 | loss 3.340178 (+0.27z)| norm 0.2708 (+1.27z)| lr 1.86e-04 | 4180.06 ms | 32.3% bf16 MFU | 125349 tok/s step 13625/19560 | loss 3.379845 (+1.25z)| norm 0.2518 (-0.06z)| lr 1.85e-04 | 4200.81 ms | 32.1% bf16 MFU | 125322 tok/s step 13626/19560 | loss 3.294168 (-0.89z)| norm 0.2722 (+1.38z)| lr 1.85e-04 | 4175.28 ms | 32.3% bf16 MFU | 125334 tok/s step 13627/19560 | loss 3.328372 (-0.02z)| norm 0.2261 (-1.85z)| lr 1.85e-04 | 4172.21 ms | 32.4% bf16 MFU | 125350 tok/s step 13628/19560 | loss 3.342539 (+0.34z)| norm 0.2613 (+0.68z)| lr 1.85e-04 | 4172.86 ms | 32.4% bf16 MFU | 125365 tok/s step 13629/19560 | loss 3.382273 (+1.32z)| norm 0.2619 (+0.71z)| lr 1.85e-04 | 4210.56 ms | 32.1% bf16 MFU | 125323 tok/s step 13630/19560 | loss 3.335425 (+0.14z)| norm 0.2550 (+0.22z)| lr 1.85e-04 | 4167.68 ms | 32.4% bf16 MFU | 125346 tok/s step 13631/19560 | loss 3.388119 (+1.48z)| norm 0.2473 (-0.34z)| lr 1.85e-04 | 4209.41 ms | 32.1% bf16 MFU | 125307 tok/s step 13632/19560 | loss 3.404328 (+1.84z)| norm 0.2421 (-0.72z)| lr 1.85e-04 | 4174.83 ms | 32.3% bf16 MFU | 125321 tok/s step 13633/19560 | loss 3.371572 (+1.01z)| norm 0.2667 (+1.18z)| lr 1.85e-04 | 4162.68 ms | 32.4% bf16 MFU | 125352 tok/s step 13634/19560 | loss 3.263569 (-1.66z)| norm 0.2484 (-0.24z)| lr 1.85e-04 | 4176.24 ms | 32.3% bf16 MFU | 125361 tok/s step 13635/19560 | loss 3.335503 (+0.12z)| norm 0.2521 (+0.06z)| lr 1.85e-04 | 4177.57 ms | 32.3% bf16 MFU | 125368 tok/s step 13636/19560 | loss 3.302314 (-0.69z)| norm 0.2442 (-0.55z)| lr 1.85e-04 | 4178.24 ms | 32.3% bf16 MFU | 125374 tok/s step 13637/19560 | loss 3.279532 (-1.27z)| norm 0.2367 (-1.12z)| lr 1.85e-04 | 4173.48 ms | 32.4% bf16 MFU | 125386 tok/s step 13638/19560 | loss 3.323305 (-0.19z)| norm 0.2572 (+0.48z)| lr 1.85e-04 | 4163.99 ms | 32.4% bf16 MFU | 125413 tok/s step 13639/19560 | loss 3.298282 (-0.80z)| norm 0.2259 (-1.94z)| lr 1.85e-04 | 4164.94 ms | 32.4% bf16 MFU | 125436 tok/s step 13640/19560 | loss 3.363221 (+0.81z)| norm 0.2533 (+0.18z)| lr 1.85e-04 | 4187.15 ms | 32.2% bf16 MFU | 125425 tok/s step 13641/19560 | loss 3.288300 (-1.04z)| norm 0.2175 (-2.54z)| lr 1.85e-04 | 4177.65 ms | 32.3% bf16 MFU | 125429 tok/s step 13642/19560 | loss 3.394310 (+1.57z)| norm 0.2761 (+1.88z)| lr 1.85e-04 | 4182.97 ms | 32.3% bf16 MFU | 125424 tok/s step 13643/19560 | loss 3.370131 (+0.95z)| norm 0.2430 (-0.63z)| lr 1.84e-04 | 4168.04 ms | 32.4% bf16 MFU | 125442 tok/s step 13644/19560 | loss 3.295591 (-0.88z)| norm 0.2576 (+0.48z)| lr 1.84e-04 | 4193.63 ms | 32.2% bf16 MFU | 125421 tok/s step 13645/19560 | loss 3.372101 (+1.00z)| norm 0.2490 (-0.17z)| lr 1.84e-04 | 4767.87 ms | 28.3% bf16 MFU | 124648 tok/s step 13646/19560 | loss 3.345379 (+0.34z)| norm 0.2445 (-0.51z)| lr 1.84e-04 | 4723.70 ms | 28.6% bf16 MFU | 123965 tok/s step 13647/19560 | loss 3.349136 (+0.43z)| norm 0.2588 (+0.58z)| lr 1.84e-04 | 4506.12 ms | 30.0% bf16 MFU | 123585 tok/s step 13648/19560 | loss 3.411118 (+1.92z)| norm 0.2499 (-0.09z)| lr 1.84e-04 | 4598.75 ms | 29.4% bf16 MFU | 123106 tok/s step 13649/19560 | loss 3.321444 (-0.27z)| norm 0.2601 (+0.67z)| lr 1.84e-04 | 4298.04 ms | 31.4% bf16 MFU | 123050 tok/s step 13650/19560 | loss 3.410586 (+1.86z)| norm 0.2486 (-0.20z)| lr 1.84e-04 | 4226.07 ms | 31.9% bf16 MFU | 123100 tok/s step 13651/19560 | loss 3.373525 (+0.95z)| norm 0.2430 (-0.63z)| lr 1.84e-04 | 4379.49 ms | 30.8% bf16 MFU | 122931 tok/s step 13652/19560 | loss 3.308743 (-0.61z)| norm 0.2542 (+0.22z)| lr 1.84e-04 | 4491.37 ms | 30.1% bf16 MFU | 122621 tok/s step 13653/19560 | loss 3.345896 (+0.29z)| norm 0.3057 (+3.90z)| lr 1.84e-04 | 4380.83 ms | 30.8% bf16 MFU | 122474 tok/s step 13654/19560 | loss 3.318068 (-0.39z)| norm 0.2405 (-0.81z)| lr 1.84e-04 | 4286.00 ms | 31.5% bf16 MFU | 122466 tok/s step 13655/19560 | loss 3.419523 (+2.04z)| norm 0.2498 (-0.15z)| lr 1.84e-04 | 4565.11 ms | 29.6% bf16 MFU | 122085 tok/s step 13656/19560 | loss 3.380688 (+1.09z)| norm 0.2692 (+1.25z)| lr 1.84e-04 | 4339.41 ms | 31.1% bf16 MFU | 122022 tok/s step 13657/19560 | loss 3.372418 (+0.89z)| norm 0.2466 (-0.39z)| lr 1.84e-04 | 4173.06 ms | 32.4% bf16 MFU | 122203 tok/s step 13658/19560 | loss 3.367543 (+0.76z)| norm 0.2510 (-0.05z)| lr 1.84e-04 | 4257.51 ms | 31.7% bf16 MFU | 122250 tok/s step 13659/19560 | loss 3.324826 (-0.25z)| norm 0.2611 (+0.67z)| lr 1.84e-04 | 4278.92 ms | 31.6% bf16 MFU | 122264 tok/s step 13660/19560 | loss 3.358236 (+0.54z)| norm 0.2414 (-0.76z)| lr 1.83e-04 | 4180.40 ms | 32.3% bf16 MFU | 122421 tok/s step 13661/19560 | loss 3.496510 (+3.61z)| norm 0.2518 (+0.00z)| lr 1.83e-04 | 4289.39 ms | 31.5% bf16 MFU | 122412 tok/s step 13662/19560 | loss 3.343813 (+0.17z)| norm 0.2450 (-0.49z)| lr 1.83e-04 | 4212.79 ms | 32.0% bf16 MFU | 122514 tok/s step 13663/19560 | loss 3.432163 (+2.11z)| norm 0.2436 (-0.59z)| lr 1.83e-04 | 4201.08 ms | 32.1% bf16 MFU | 122628 tok/s step 13664/19560 | loss 3.358215 (+0.46z)| norm 0.2464 (-0.38z)| lr 1.83e-04 | 4371.62 ms | 30.9% bf16 MFU | 122493 tok/s step 13665/19560 | loss 3.324793 (-0.29z)| norm 0.2488 (-0.19z)| lr 1.83e-04 | 4167.48 ms | 32.4% bf16 MFU | 122659 tok/s step 13666/19560 | loss 3.313554 (-0.53z)| norm 0.2380 (-1.00z)| lr 1.83e-04 | 4175.89 ms | 32.3% bf16 MFU | 122803 tok/s step 13667/19560 | loss 3.343626 (+0.14z)| norm 0.2594 (+0.60z)| lr 1.83e-04 | 4202.72 ms | 32.1% bf16 MFU | 122901 tok/s step 13668/19560 | loss 3.353679 (+0.35z)| norm 0.2465 (-0.37z)| lr 1.83e-04 | 4184.18 ms | 32.3% bf16 MFU | 123021 tok/s step 13669/19560 | loss 3.278879 (-1.32z)| norm 0.2657 (+1.06z)| lr 1.83e-04 | 4174.13 ms | 32.3% bf16 MFU | 123150 tok/s step 13670/19560 | loss 3.314403 (-0.52z)| norm 0.2481 (-0.26z)| lr 1.83e-04 | 4182.71 ms | 32.3% bf16 MFU | 123260 tok/s step 13671/19560 | loss 3.311018 (-0.59z)| norm 0.2448 (-0.50z)| lr 1.83e-04 | 4306.17 ms | 31.4% bf16 MFU | 123184 tok/s step 13672/19560 | loss 3.319721 (-0.39z)| norm 0.2535 (+0.15z)| lr 1.83e-04 | 4175.05 ms | 32.3% bf16 MFU | 123304 tok/s step 13673/19560 | loss 3.331235 (-0.14z)| norm 0.2348 (-1.24z)| lr 1.83e-04 | 4204.79 ms | 32.1% bf16 MFU | 123373 tok/s step 13674/19560 | loss 3.321125 (-0.37z)| norm 0.2424 (-0.66z)| lr 1.83e-04 | 4181.84 ms | 32.3% bf16 MFU | 123473 tok/s step 13675/19560 | loss 3.300575 (-0.84z)| norm 0.2463 (-0.38z)| lr 1.83e-04 | 4187.53 ms | 32.2% bf16 MFU | 123560 tok/s step 13676/19560 | loss 3.360677 (+0.52z)| norm 0.2376 (-1.01z)| lr 1.83e-04 | 4214.22 ms | 32.0% bf16 MFU | 123602 tok/s step 13677/19560 | loss 3.307605 (-0.67z)| norm 0.2360 (-1.12z)| lr 1.83e-04 | 4174.97 ms | 32.3% bf16 MFU | 123701 tok/s step 13678/19560 | loss 3.361646 (+0.55z)| norm 0.2507 (-0.05z)| lr 1.82e-04 | 4268.35 ms | 31.6% bf16 MFU | 123657 tok/s step 13679/19560 | loss 3.306205 (-0.70z)| norm 0.2367 (-1.08z)| lr 1.82e-04 | 4169.93 ms | 32.4% bf16 MFU | 123761 tok/s step 13680/19560 | loss 3.340031 (+0.06z)| norm 0.2477 (-0.27z)| lr 1.82e-04 | 4295.41 ms | 31.4% bf16 MFU | 123676 tok/s step 13681/19560 | loss 3.317418 (-0.46z)| norm 0.2474 (-0.28z)| lr 1.82e-04 | 4178.19 ms | 32.3% bf16 MFU | 123766 tok/s step 13682/19560 | loss 3.315484 (-0.50z)| norm 0.2394 (-0.89z)| lr 1.82e-04 | 4186.99 ms | 32.2% bf16 MFU | 123839 tok/s step 13683/19560 | loss 3.426846 (+1.99z)| norm 0.2555 (+0.32z)| lr 1.82e-04 | 4186.42 ms | 32.3% bf16 MFU | 123909 tok/s step 13684/19560 | loss 3.363400 (+0.56z)| norm 0.2417 (-0.71z)| lr 1.82e-04 | 4339.34 ms | 31.1% bf16 MFU | 123754 tok/s step 13685/19560 | loss 3.317552 (-0.46z)| norm 0.2475 (-0.27z)| lr 1.82e-04 | 4286.81 ms | 31.5% bf16 MFU | 123682 tok/s step 13686/19560 | loss 3.366025 (+0.65z)| norm 0.2389 (-0.92z)| lr 1.82e-04 | 4177.84 ms | 32.3% bf16 MFU | 123772 tok/s step 13687/19560 | loss 3.320910 (-0.40z)| norm 0.2574 (+0.52z)| lr 1.82e-04 | 4307.77 ms | 31.3% bf16 MFU | 123669 tok/s step 13688/19560 | loss 3.363228 (+0.57z)| norm 0.2451 (-0.43z)| lr 1.82e-04 | 4210.22 ms | 32.1% bf16 MFU | 123712 tok/s step 13689/19560 | loss 3.302101 (-0.83z)| norm 0.2440 (-0.51z)| lr 1.82e-04 | 4254.15 ms | 31.7% bf16 MFU | 123688 tok/s step 13690/19560 | loss 3.340287 (+0.05z)| norm 0.2391 (-0.88z)| lr 1.82e-04 | 4172.97 ms | 32.4% bf16 MFU | 123786 tok/s step 13691/19560 | loss 3.319667 (-0.43z)| norm 0.2583 (+0.62z)| lr 1.82e-04 | 4183.17 ms | 32.3% bf16 MFU | 123863 tok/s step 13692/19560 | loss 3.326239 (-0.29z)| norm 0.2334 (-1.32z)| lr 1.82e-04 | 4304.18 ms | 31.4% bf16 MFU | 123761 tok/s step 13693/19560 | loss 3.320301 (-0.42z)| norm 0.2508 (+0.03z)| lr 1.82e-04 | 4181.28 ms | 32.3% bf16 MFU | 123842 tok/s step 13694/19560 | loss 3.392567 (+1.24z)| norm 0.2444 (-0.47z)| lr 1.82e-04 | 4219.52 ms | 32.0% bf16 MFU | 123863 tok/s step 13695/19560 | loss 3.294039 (-1.03z)| norm 0.2547 (+0.35z)| lr 1.82e-04 | 4185.04 ms | 32.3% bf16 MFU | 123933 tok/s step 13696/19560 | loss 3.393078 (+1.23z)| norm 0.2519 (+0.12z)| lr 1.81e-04 | 4277.59 ms | 31.6% bf16 MFU | 123865 tok/s step 13697/19560 | loss 3.318738 (-0.47z)| norm 0.2445 (-0.45z)| lr 1.81e-04 | 4186.78 ms | 32.2% bf16 MFU | 123933 tok/s step 13698/19560 | loss 3.362667 (+0.52z)| norm 0.2705 (+1.58z)| lr 1.81e-04 | 4185.75 ms | 32.3% bf16 MFU | 123999 tok/s step 13699/19560 | loss 3.376325 (+0.83z)| norm 0.2603 (+0.78z)| lr 1.81e-04 | 4176.18 ms | 32.3% bf16 MFU | 124076 tok/s step 13700/19560 | loss 3.299895 (-0.93z)| norm 0.2396 (-0.86z)| lr 1.81e-04 | 4179.24 ms | 32.3% bf16 MFU | 124145 tok/s step 13701/19560 | loss 3.342622 (+0.05z)| norm 0.2704 (+1.56z)| lr 1.81e-04 | 4190.90 ms | 32.2% bf16 MFU | 124193 tok/s step 13702/19560 | loss 3.344418 (+0.09z)| norm 0.2326 (-1.40z)| lr 1.81e-04 | 4356.14 ms | 31.0% bf16 MFU | 124001 tok/s step 13703/19560 | loss 3.335845 (-0.12z)| norm 0.2485 (-0.18z)| lr 1.81e-04 | 4175.79 ms | 32.3% bf16 MFU | 124079 tok/s step 13704/19560 | loss 3.346574 (+0.12z)| norm 0.2437 (-0.55z)| lr 1.81e-04 | 4179.50 ms | 32.3% bf16 MFU | 124147 tok/s step 13705/19560 | loss 3.421074 (+1.81z)| norm 0.2596 (+0.70z)| lr 1.81e-04 | 4198.48 ms | 32.2% bf16 MFU | 124183 tok/s step 13706/19560 | loss 3.352207 (+0.22z)| norm 0.2568 (+0.46z)| lr 1.81e-04 | 4252.53 ms | 31.7% bf16 MFU | 124138 tok/s step 13707/19560 | loss 3.315618 (-0.62z)| norm 0.2461 (-0.38z)| lr 1.81e-04 | 4188.46 ms | 32.2% bf16 MFU | 124190 tok/s step 13708/19560 | loss 3.311339 (-0.71z)| norm 0.2688 (+1.40z)| lr 1.81e-04 | 4268.89 ms | 31.6% bf16 MFU | 124122 tok/s step 13709/19560 | loss 3.309287 (-0.78z)| norm 0.2650 (+1.09z)| lr 1.81e-04 | 4276.74 ms | 31.6% bf16 MFU | 124045 tok/s step 13710/19560 | loss 3.347105 (+0.10z)| norm 0.2624 (+0.88z)| lr 1.81e-04 | 4357.64 ms | 31.0% bf16 MFU | 123858 tok/s step 13711/19560 | loss 3.341000 (-0.05z)| norm 0.2434 (-0.60z)| lr 1.81e-04 | 4176.64 ms | 32.3% bf16 MFU | 123942 tok/s step 13712/19560 | loss 3.376667 (+0.77z)| norm 0.2359 (-1.20z)| lr 1.81e-04 | 4163.93 ms | 32.4% bf16 MFU | 124040 tok/s step 13713/19560 | loss 3.345332 (+0.05z)| norm 0.2490 (-0.15z)| lr 1.80e-04 | 4337.50 ms | 31.1% bf16 MFU | 123882 tok/s step 13714/19560 | loss 3.241564 (-2.33z)| norm 0.2383 (-1.00z)| lr 1.80e-04 | 4176.49 ms | 32.3% bf16 MFU | 123965 tok/s step 13715/19560 | loss 3.371437 (+0.66z)| norm 0.2662 (+1.21z)| lr 1.80e-04 | 4174.46 ms | 32.3% bf16 MFU | 124046 tok/s step 13716/19560 | loss 3.295439 (-1.08z)| norm 0.2319 (-1.49z)| lr 1.80e-04 | 4459.02 ms | 30.3% bf16 MFU | 123723 tok/s step 13717/19560 | loss 3.300664 (-0.96z)| norm 0.2509 (+0.01z)| lr 1.80e-04 | 4218.12 ms | 32.0% bf16 MFU | 123751 tok/s step 13718/19560 | loss 3.315566 (-0.60z)| norm 0.2458 (-0.38z)| lr 1.80e-04 | 4169.20 ms | 32.4% bf16 MFU | 123851 tok/s step 13719/19560 | loss 3.457793 (+2.63z)| norm 0.2649 (+1.17z)| lr 1.80e-04 | 4167.78 ms | 32.4% bf16 MFU | 123949 tok/s step 13720/19560 | loss 3.446976 (+2.32z)| norm 0.2514 (+0.07z)| lr 1.80e-04 | 4183.49 ms | 32.3% bf16 MFU | 124017 tok/s step 13721/19560 | loss 3.306312 (-0.84z)| norm 0.2457 (-0.40z)| lr 1.80e-04 | 4220.02 ms | 32.0% bf16 MFU | 124028 tok/s step 13722/19560 | loss 3.387392 (+0.97z)| norm 0.2636 (+1.05z)| lr 1.80e-04 | 4172.08 ms | 32.4% bf16 MFU | 124110 tok/s step 13723/19560 | loss 3.270079 (-1.63z)| norm 0.2591 (+0.69z)| lr 1.80e-04 | 4169.47 ms | 32.4% bf16 MFU | 124192 tok/s step 13724/19560 | loss 3.375168 (+0.71z)| norm 0.2517 (+0.09z)| lr 1.80e-04 | 4259.74 ms | 31.7% bf16 MFU | 124136 tok/s step 13725/19560 | loss 3.440327 (+2.13z)| norm 0.2793 (+2.29z)| lr 1.80e-04 | 4176.88 ms | 32.3% bf16 MFU | 124206 tok/s step 13726/19560 | loss 3.304437 (-0.86z)| norm 0.2467 (-0.31z)| lr 1.80e-04 | 4221.37 ms | 32.0% bf16 MFU | 124205 tok/s step 13727/19560 | loss 3.335450 (-0.19z)| norm 0.2576 (+0.55z)| lr 1.80e-04 | 4181.37 ms | 32.3% bf16 MFU | 124264 tok/s step 13728/19560 | loss 3.355889 (+0.25z)| norm 0.2611 (+0.82z)| lr 1.80e-04 | 4176.02 ms | 32.3% bf16 MFU | 124328 tok/s step 13729/19560 | loss 3.378571 (+0.75z)| norm 0.2329 (-1.43z)| lr 1.80e-04 | 4187.24 ms | 32.2% bf16 MFU | 124373 tok/s step 13730/19560 | loss 3.400525 (+1.24z)| norm 0.2706 (+1.55z)| lr 1.80e-04 | 4181.39 ms | 32.3% bf16 MFU | 124423 tok/s step 13731/19560 | loss 3.345876 (-0.02z)| norm 0.2609 (+0.77z)| lr 1.79e-04 | 4169.41 ms | 32.4% bf16 MFU | 124489 tok/s step 13732/19560 | loss 3.344152 (-0.06z)| norm 0.2615 (+0.81z)| lr 1.79e-04 | 4203.99 ms | 32.1% bf16 MFU | 124501 tok/s step 13733/19560 | loss 3.327287 (-0.44z)| norm 0.2712 (+1.56z)| lr 1.79e-04 | 4218.06 ms | 32.0% bf16 MFU | 124490 tok/s step 13734/19560 | loss 3.418375 (+1.61z)| norm 0.2449 (-0.54z)| lr 1.79e-04 | 4218.94 ms | 32.0% bf16 MFU | 124479 tok/s step 13735/19560 | loss 3.378969 (+0.71z)| norm 0.2520 (+0.04z)| lr 1.79e-04 | 4181.01 ms | 32.3% bf16 MFU | 124525 tok/s step 13736/19560 | loss 3.264508 (-1.85z)| norm 0.2370 (-1.16z)| lr 1.79e-04 | 4182.90 ms | 32.3% bf16 MFU | 124566 tok/s step 13737/19560 | loss 3.336277 (-0.23z)| norm 0.2720 (+1.63z)| lr 1.79e-04 | 4178.91 ms | 32.3% bf16 MFU | 124611 tok/s step 13738/19560 | loss 3.368347 (+0.51z)| norm 0.2404 (-0.87z)| lr 1.79e-04 | 4205.31 ms | 32.1% bf16 MFU | 124614 tok/s step 13739/19560 | loss 3.327189 (-0.43z)| norm 0.2676 (+1.27z)| lr 1.79e-04 | 4180.68 ms | 32.3% bf16 MFU | 124653 tok/s step 13740/19560 | loss 3.323298 (-0.51z)| norm 0.2618 (+0.81z)| lr 1.79e-04 | 4182.32 ms | 32.3% bf16 MFU | 124689 tok/s step 13741/19560 | loss 3.301258 (-1.01z)| norm 0.2723 (+1.60z)| lr 1.79e-04 | 4177.46 ms | 32.3% bf16 MFU | 124729 tok/s step 13742/19560 | loss 3.351447 (+0.16z)| norm 0.2696 (+1.43z)| lr 1.79e-04 | 4188.33 ms | 32.2% bf16 MFU | 124752 tok/s step 13743/19560 | loss 3.341786 (-0.07z)| norm 0.3139 (+4.53z)| lr 1.79e-04 | 4269.08 ms | 31.6% bf16 MFU | 124655 tok/s step 13744/19560 | loss 3.341968 (-0.07z)| norm 0.2784 (+1.90z)| lr 1.79e-04 | 4183.33 ms | 32.3% bf16 MFU | 124688 tok/s step 13745/19560 | loss 3.316026 (-0.67z)| norm 0.2593 (+0.51z)| lr 1.79e-04 | 4166.80 ms | 32.4% bf16 MFU | 124745 tok/s step 13746/19560 | loss 3.370017 (+0.61z)| norm 0.2588 (+0.47z)| lr 1.79e-04 | 4195.14 ms | 32.2% bf16 MFU | 124757 tok/s step 13747/19560 | loss 3.335410 (-0.22z)| norm 0.2970 (+3.09z)| lr 1.79e-04 | 4292.35 ms | 31.5% bf16 MFU | 124626 tok/s step 13748/19560 | loss 3.335351 (-0.21z)| norm 0.2365 (-1.12z)| lr 1.79e-04 | 4186.44 ms | 32.3% bf16 MFU | 124657 tok/s step 13749/19560 | loss 3.347167 (+0.07z)| norm 0.2608 (+0.56z)| lr 1.78e-04 | 4195.11 ms | 32.2% bf16 MFU | 124673 tok/s step 13750/19560 | loss 3.457208 (+2.60z)| norm 0.2611 (+0.58z)| lr 1.78e-04 | 4170.97 ms | 32.4% bf16 MFU | 124724 tok/s val loss 3.303965 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2998/10042 = 0.298546 step 13751/19560 | loss 3.286899 (-1.36z)| norm 0.2529 (+0.00z)| lr 1.78e-04 | 4172.01 ms | 32.4% bf16 MFU | 124771 tok/s step 13752/19560 | loss 3.367748 (+0.51z)| norm 0.2575 (+0.34z)| lr 1.78e-04 | 4254.20 ms | 31.7% bf16 MFU | 124695 tok/s step 13753/19560 | loss 3.342647 (-0.06z)| norm 0.2619 (+0.64z)| lr 1.78e-04 | 4186.91 ms | 32.2% bf16 MFU | 124721 tok/s step 13754/19560 | loss 3.374805 (+0.67z)| norm 0.2490 (-0.25z)| lr 1.78e-04 | 4182.04 ms | 32.3% bf16 MFU | 124753 tok/s step 13755/19560 | loss 3.339988 (-0.14z)| norm 0.2573 (+0.32z)| lr 1.78e-04 | 4221.70 ms | 32.0% bf16 MFU | 124725 tok/s step 13756/19560 | loss 3.366859 (+0.48z)| norm 0.2474 (-0.38z)| lr 1.78e-04 | 4211.24 ms | 32.1% bf16 MFU | 124714 tok/s step 13757/19560 | loss 3.368671 (+0.53z)| norm 0.2871 (+2.39z)| lr 1.78e-04 | 4260.01 ms | 31.7% bf16 MFU | 124631 tok/s step 13758/19560 | loss 3.351418 (+0.12z)| norm 0.2459 (-0.49z)| lr 1.78e-04 | 4185.66 ms | 32.3% bf16 MFU | 124663 tok/s step 13759/19560 | loss 3.379561 (+0.78z)| norm 0.2901 (+2.52z)| lr 1.78e-04 | 4268.73 ms | 31.6% bf16 MFU | 124571 tok/s step 13760/19560 | loss 3.396357 (+1.18z)| norm 0.2768 (+1.58z)| lr 1.78e-04 | 4192.94 ms | 32.2% bf16 MFU | 124594 tok/s step 13761/19560 | loss 3.322924 (-0.53z)| norm 0.2540 (+0.04z)| lr 1.78e-04 | 4259.46 ms | 31.7% bf16 MFU | 124519 tok/s step 13762/19560 | loss 3.297450 (-1.15z)| norm 0.2669 (+0.91z)| lr 1.78e-04 | 4179.34 ms | 32.3% bf16 MFU | 124565 tok/s step 13763/19560 | loss 3.335917 (-0.24z)| norm 0.2470 (-0.43z)| lr 1.78e-04 | 4204.45 ms | 32.1% bf16 MFU | 124572 tok/s step 13764/19560 | loss 3.334530 (-0.28z)| norm 0.2738 (+1.35z)| lr 1.78e-04 | 4218.58 ms | 32.0% bf16 MFU | 124557 tok/s step 13765/19560 | loss 3.303758 (-1.02z)| norm 0.2577 (+0.26z)| lr 1.78e-04 | 4256.59 ms | 31.7% bf16 MFU | 124488 tok/s step 13766/19560 | loss 3.305861 (-0.96z)| norm 0.2633 (+0.63z)| lr 1.77e-04 | 4256.93 ms | 31.7% bf16 MFU | 124422 tok/s step 13767/19560 | loss 3.354159 (+0.18z)| norm 0.2438 (-0.70z)| lr 1.77e-04 | 4175.85 ms | 32.3% bf16 MFU | 124478 tok/s step 13768/19560 | loss 3.306561 (-0.95z)| norm 0.2608 (+0.46z)| lr 1.77e-04 | 4283.09 ms | 31.5% bf16 MFU | 124375 tok/s step 13769/19560 | loss 3.320067 (-0.64z)| norm 0.2497 (-0.33z)| lr 1.77e-04 | 4176.02 ms | 32.3% bf16 MFU | 124433 tok/s step 13770/19560 | loss 3.303034 (-1.03z)| norm 0.2737 (+1.35z)| lr 1.77e-04 | 4174.20 ms | 32.3% bf16 MFU | 124492 tok/s step 13771/19560 | loss 3.344491 (-0.03z)| norm 0.2521 (-0.16z)| lr 1.77e-04 | 4230.84 ms | 31.9% bf16 MFU | 124463 tok/s step 13772/19560 | loss 3.239414 (-2.50z)| norm 0.2606 (+0.44z)| lr 1.77e-04 | 4177.27 ms | 32.3% bf16 MFU | 124516 tok/s step 13773/19560 | loss 3.292072 (-1.23z)| norm 0.2651 (+0.74z)| lr 1.77e-04 | 4171.31 ms | 32.4% bf16 MFU | 124574 tok/s step 13774/19560 | loss 3.318075 (-0.62z)| norm 0.2453 (-0.65z)| lr 1.77e-04 | 4192.50 ms | 32.2% bf16 MFU | 124598 tok/s step 13775/19560 | loss 3.466544 (+2.76z)| norm 0.2797 (+1.73z)| lr 1.77e-04 | 4187.03 ms | 32.2% bf16 MFU | 124629 tok/s step 13776/19560 | loss 3.364685 (+0.45z)| norm 0.2593 (+0.31z)| lr 1.77e-04 | 4206.79 ms | 32.1% bf16 MFU | 124629 tok/s step 13777/19560 | loss 3.387500 (+0.96z)| norm 0.2689 (+0.97z)| lr 1.77e-04 | 4233.59 ms | 31.9% bf16 MFU | 124590 tok/s step 13778/19560 | loss 3.314319 (-0.70z)| norm 0.2593 (+0.30z)| lr 1.77e-04 | 4241.44 ms | 31.8% bf16 MFU | 124541 tok/s step 13779/19560 | loss 3.324737 (-0.45z)| norm 0.2407 (-0.98z)| lr 1.77e-04 | 4225.99 ms | 31.9% bf16 MFU | 124517 tok/s step 13780/19560 | loss 3.348160 (+0.08z)| norm 0.2445 (-0.71z)| lr 1.77e-04 | 4580.95 ms | 29.5% bf16 MFU | 124013 tok/s step 13781/19560 | loss 3.370162 (+0.59z)| norm 0.2592 (+0.34z)| lr 1.77e-04 | 4262.20 ms | 31.7% bf16 MFU | 123963 tok/s step 13782/19560 | loss 3.348105 (+0.07z)| norm 0.2515 (-0.22z)| lr 1.77e-04 | 4195.49 ms | 32.2% bf16 MFU | 124013 tok/s step 13783/19560 | loss 3.405977 (+1.42z)| norm 0.2580 (+0.24z)| lr 1.77e-04 | 4176.48 ms | 32.3% bf16 MFU | 124089 tok/s step 13784/19560 | loss 3.328315 (-0.38z)| norm 0.2519 (-0.19z)| lr 1.76e-04 | 4180.83 ms | 32.3% bf16 MFU | 124155 tok/s step 13785/19560 | loss 3.340815 (-0.08z)| norm 0.2590 (+0.32z)| lr 1.76e-04 | 4178.59 ms | 32.3% bf16 MFU | 124221 tok/s step 13786/19560 | loss 3.378160 (+0.79z)| norm 0.2309 (-1.71z)| lr 1.76e-04 | 4173.76 ms | 32.3% bf16 MFU | 124290 tok/s step 13787/19560 | loss 3.348719 (+0.10z)| norm 0.2540 (-0.03z)| lr 1.76e-04 | 4198.35 ms | 32.2% bf16 MFU | 124320 tok/s step 13788/19560 | loss 3.294279 (-1.15z)| norm 0.2399 (-1.05z)| lr 1.76e-04 | 4192.59 ms | 32.2% bf16 MFU | 124356 tok/s step 13789/19560 | loss 3.303858 (-0.94z)| norm 0.2538 (-0.04z)| lr 1.76e-04 | 4172.84 ms | 32.4% bf16 MFU | 124421 tok/s step 13790/19560 | loss 3.324507 (-0.43z)| norm 0.2435 (-0.79z)| lr 1.76e-04 | 4195.20 ms | 32.2% bf16 MFU | 124448 tok/s step 13791/19560 | loss 3.338654 (-0.07z)| norm 0.2422 (-0.88z)| lr 1.76e-04 | 4458.97 ms | 30.3% bf16 MFU | 124105 tok/s step 13792/19560 | loss 3.326269 (-0.37z)| norm 0.2373 (-1.23z)| lr 1.76e-04 | 4274.16 ms | 31.6% bf16 MFU | 124033 tok/s step 13793/19560 | loss 3.282401 (-1.44z)| norm 0.2379 (-1.17z)| lr 1.76e-04 | 4174.59 ms | 32.3% bf16 MFU | 124111 tok/s step 13794/19560 | loss 3.389851 (+1.18z)| norm 0.2391 (-1.08z)| lr 1.76e-04 | 4186.52 ms | 32.3% bf16 MFU | 124167 tok/s step 13795/19560 | loss 3.353655 (+0.29z)| norm 0.2343 (-1.40z)| lr 1.76e-04 | 4179.23 ms | 32.3% bf16 MFU | 124231 tok/s step 13796/19560 | loss 3.383543 (+1.02z)| norm 0.2418 (-0.87z)| lr 1.76e-04 | 4200.08 ms | 32.1% bf16 MFU | 124261 tok/s step 13797/19560 | loss 3.312721 (-0.73z)| norm 0.2382 (-1.10z)| lr 1.76e-04 | 4189.69 ms | 32.2% bf16 MFU | 124305 tok/s step 13798/19560 | loss 3.313590 (-0.70z)| norm 0.2374 (-1.15z)| lr 1.76e-04 | 4172.03 ms | 32.4% bf16 MFU | 124373 tok/s step 13799/19560 | loss 3.314284 (-0.69z)| norm 0.2419 (-0.83z)| lr 1.76e-04 | 4261.99 ms | 31.7% bf16 MFU | 124305 tok/s step 13800/19560 | loss 3.334180 (-0.20z)| norm 0.2330 (-1.43z)| lr 1.76e-04 | 4183.36 ms | 32.3% bf16 MFU | 124356 tok/s step 13801/19560 | loss 3.299111 (-1.05z)| norm 0.2553 (+0.12z)| lr 1.76e-04 | 4242.51 ms | 31.8% bf16 MFU | 124317 tok/s step 13802/19560 | loss 3.323684 (-0.45z)| norm 0.2295 (-1.68z)| lr 1.75e-04 | 4183.03 ms | 32.3% bf16 MFU | 124368 tok/s step 13803/19560 | loss 3.345446 (+0.07z)| norm 0.2779 (+1.66z)| lr 1.75e-04 | 4208.54 ms | 32.1% bf16 MFU | 124379 tok/s step 13804/19560 | loss 3.266565 (-1.83z)| norm 0.2422 (-0.81z)| lr 1.75e-04 | 4202.19 ms | 32.1% bf16 MFU | 124398 tok/s step 13805/19560 | loss 3.405849 (+1.53z)| norm 0.2808 (+1.83z)| lr 1.75e-04 | 4181.88 ms | 32.3% bf16 MFU | 124447 tok/s step 13806/19560 | loss 3.356345 (+0.34z)| norm 0.2665 (+0.84z)| lr 1.75e-04 | 4195.88 ms | 32.2% bf16 MFU | 124472 tok/s step 13807/19560 | loss 3.563650 (+4.81z)| norm 0.2726 (+1.23z)| lr 1.75e-04 | 4188.68 ms | 32.2% bf16 MFU | 124507 tok/s step 13808/19560 | loss 3.269993 (-1.61z)| norm 0.2735 (+1.28z)| lr 1.75e-04 | 4199.58 ms | 32.2% bf16 MFU | 124524 tok/s step 13809/19560 | loss 3.374994 (+0.66z)| norm 0.2630 (+0.55z)| lr 1.75e-04 | 4177.37 ms | 32.3% bf16 MFU | 124573 tok/s step 13810/19560 | loss 3.348232 (+0.08z)| norm 0.2687 (+0.92z)| lr 1.75e-04 | 4259.57 ms | 31.7% bf16 MFU | 124498 tok/s step 13811/19560 | loss 3.310526 (-0.73z)| norm 0.2454 (-0.66z)| lr 1.75e-04 | 4243.00 ms | 31.8% bf16 MFU | 124452 tok/s step 13812/19560 | loss 3.399960 (+1.22z)| norm 0.2963 (+2.70z)| lr 1.75e-04 | 4181.77 ms | 32.3% bf16 MFU | 124498 tok/s step 13813/19560 | loss 3.368557 (+0.53z)| norm 0.2742 (+1.22z)| lr 1.75e-04 | 4339.64 ms | 31.1% bf16 MFU | 124314 tok/s step 13814/19560 | loss 3.346357 (+0.05z)| norm 0.2640 (+0.53z)| lr 1.75e-04 | 4213.63 ms | 32.0% bf16 MFU | 124319 tok/s step 13815/19560 | loss 3.373632 (+0.63z)| norm 0.2758 (+1.30z)| lr 1.75e-04 | 4180.61 ms | 32.3% bf16 MFU | 124374 tok/s step 13816/19560 | loss 3.415284 (+1.52z)| norm 0.2659 (+0.64z)| lr 1.75e-04 | 4228.04 ms | 31.9% bf16 MFU | 124355 tok/s step 13817/19560 | loss 3.438993 (+1.99z)| norm 0.2841 (+1.80z)| lr 1.75e-04 | 4266.82 ms | 31.6% bf16 MFU | 124281 tok/s step 13818/19560 | loss 3.328210 (-0.38z)| norm 0.2725 (+1.02z)| lr 1.75e-04 | 4256.08 ms | 31.7% bf16 MFU | 124226 tok/s step 13819/19560 | loss 3.366240 (+0.42z)| norm 0.2607 (+0.25z)| lr 1.75e-04 | 4185.84 ms | 32.3% bf16 MFU | 124278 tok/s step 13820/19560 | loss 3.350944 (+0.09z)| norm 0.2494 (-0.50z)| lr 1.74e-04 | 4178.51 ms | 32.3% bf16 MFU | 124338 tok/s step 13821/19560 | loss 3.468539 (+2.53z)| norm 0.2753 (+1.19z)| lr 1.74e-04 | 4184.01 ms | 32.3% bf16 MFU | 124386 tok/s step 13822/19560 | loss 3.297937 (-1.03z)| norm 0.2759 (+1.21z)| lr 1.74e-04 | 4203.11 ms | 32.1% bf16 MFU | 124404 tok/s step 13823/19560 | loss 3.369188 (+0.45z)| norm 0.2484 (-0.58z)| lr 1.74e-04 | 4190.49 ms | 32.2% bf16 MFU | 124439 tok/s step 13824/19560 | loss 3.325063 (-0.46z)| norm 0.2569 (-0.03z)| lr 1.74e-04 | 4236.34 ms | 31.9% bf16 MFU | 124405 tok/s step 13825/19560 | loss 3.336945 (-0.22z)| norm 0.2615 (+0.26z)| lr 1.74e-04 | 4185.85 ms | 32.3% bf16 MFU | 124448 tok/s step 13826/19560 | loss 3.362317 (+0.32z)| norm 0.2470 (-0.67z)| lr 1.74e-04 | 4194.07 ms | 32.2% bf16 MFU | 124475 tok/s step 13827/19560 | loss 3.302608 (-0.93z)| norm 0.2565 (-0.05z)| lr 1.74e-04 | 4195.18 ms | 32.2% bf16 MFU | 124500 tok/s step 13828/19560 | loss 3.357223 (+0.21z)| norm 0.2615 (+0.27z)| lr 1.74e-04 | 4181.06 ms | 32.3% bf16 MFU | 124545 tok/s step 13829/19560 | loss 3.336619 (-0.22z)| norm 0.2418 (-1.01z)| lr 1.74e-04 | 4190.90 ms | 32.2% bf16 MFU | 124573 tok/s step 13830/19560 | loss 3.359406 (+0.26z)| norm 0.2678 (+0.68z)| lr 1.74e-04 | 4163.81 ms | 32.4% bf16 MFU | 124640 tok/s step 13831/19560 | loss 3.328093 (-0.40z)| norm 0.2588 (+0.08z)| lr 1.74e-04 | 4176.49 ms | 32.3% bf16 MFU | 124685 tok/s step 13832/19560 | loss 3.317278 (-0.62z)| norm 0.2496 (-0.54z)| lr 1.74e-04 | 4178.82 ms | 32.3% bf16 MFU | 124724 tok/s step 13833/19560 | loss 3.322798 (-0.50z)| norm 0.2663 (+0.57z)| lr 1.74e-04 | 4426.34 ms | 30.5% bf16 MFU | 124410 tok/s step 13834/19560 | loss 3.333723 (-0.26z)| norm 0.2517 (-0.39z)| lr 1.74e-04 | 4164.92 ms | 32.4% bf16 MFU | 124483 tok/s step 13835/19560 | loss 3.383298 (+0.78z)| norm 0.2662 (+0.56z)| lr 1.74e-04 | 4193.21 ms | 32.2% bf16 MFU | 124511 tok/s step 13836/19560 | loss 3.245624 (-2.10z)| norm 0.2543 (-0.22z)| lr 1.74e-04 | 4714.22 ms | 28.6% bf16 MFU | 123846 tok/s step 13837/19560 | loss 3.338483 (-0.16z)| norm 0.2413 (-1.07z)| lr 1.74e-04 | 4742.91 ms | 28.5% bf16 MFU | 123181 tok/s step 13838/19560 | loss 3.241357 (-2.14z)| norm 0.2486 (-0.58z)| lr 1.73e-04 | 4294.90 ms | 31.4% bf16 MFU | 123125 tok/s step 13839/19560 | loss 3.317349 (-0.57z)| norm 0.2536 (-0.26z)| lr 1.73e-04 | 4326.99 ms | 31.2% bf16 MFU | 123027 tok/s step 13840/19560 | loss 3.367337 (+0.46z)| norm 0.2582 (+0.04z)| lr 1.73e-04 | 4504.90 ms | 30.0% bf16 MFU | 122695 tok/s step 13841/19560 | loss 3.262672 (-1.66z)| norm 0.2865 (+1.89z)| lr 1.73e-04 | 4182.93 ms | 32.3% bf16 MFU | 122827 tok/s step 13842/19560 | loss 3.287124 (-1.19z)| norm 0.2441 (-0.92z)| lr 1.73e-04 | 4343.42 ms | 31.1% bf16 MFU | 122721 tok/s step 13843/19560 | loss 3.294055 (-1.03z)| norm 0.2684 (+0.69z)| lr 1.73e-04 | 4275.94 ms | 31.6% bf16 MFU | 122716 tok/s step 13844/19560 | loss 3.316096 (-0.58z)| norm 0.2757 (+1.16z)| lr 1.73e-04 | 4218.82 ms | 32.0% bf16 MFU | 122794 tok/s step 13845/19560 | loss 3.292037 (-1.07z)| norm 0.2727 (+0.94z)| lr 1.73e-04 | 4233.09 ms | 31.9% bf16 MFU | 122847 tok/s step 13846/19560 | loss 3.272502 (-1.46z)| norm 0.2630 (+0.29z)| lr 1.73e-04 | 4173.92 ms | 32.3% bf16 MFU | 122985 tok/s step 13847/19560 | loss 3.357782 (+0.30z)| norm 0.2514 (-0.48z)| lr 1.73e-04 | 4171.62 ms | 32.4% bf16 MFU | 123120 tok/s step 13848/19560 | loss 3.319078 (-0.49z)| norm 0.2537 (-0.32z)| lr 1.73e-04 | 4301.76 ms | 31.4% bf16 MFU | 123058 tok/s step 13849/19560 | loss 3.317980 (-0.52z)| norm 0.2617 (+0.20z)| lr 1.73e-04 | 4181.80 ms | 32.3% bf16 MFU | 123174 tok/s step 13850/19560 | loss 3.297257 (-0.94z)| norm 0.2574 (-0.09z)| lr 1.73e-04 | 4192.73 ms | 32.2% bf16 MFU | 123267 tok/s step 13851/19560 | loss 3.260950 (-1.71z)| norm 0.2572 (-0.09z)| lr 1.73e-04 | 4203.84 ms | 32.1% bf16 MFU | 123340 tok/s step 13852/19560 | loss 3.298402 (-0.90z)| norm 0.2636 (+0.32z)| lr 1.73e-04 | 4234.02 ms | 31.9% bf16 MFU | 123364 tok/s step 13853/19560 | loss 3.341623 (+0.03z)| norm 0.2409 (-1.18z)| lr 1.73e-04 | 4202.34 ms | 32.1% bf16 MFU | 123434 tok/s step 13854/19560 | loss 3.287001 (-1.14z)| norm 0.2400 (-1.23z)| lr 1.73e-04 | 4221.61 ms | 32.0% bf16 MFU | 123472 tok/s step 13855/19560 | loss 3.246449 (-1.96z)| norm 0.2508 (-0.50z)| lr 1.73e-04 | 4264.02 ms | 31.7% bf16 MFU | 123446 tok/s step 13856/19560 | loss 3.391265 (+1.09z)| norm 0.2390 (-1.27z)| lr 1.72e-04 | 4177.29 ms | 32.3% bf16 MFU | 123549 tok/s step 13857/19560 | loss 3.299079 (-0.84z)| norm 0.2442 (-0.94z)| lr 1.72e-04 | 4456.62 ms | 30.3% bf16 MFU | 123254 tok/s step 13858/19560 | loss 3.255131 (-1.73z)| norm 0.2517 (-0.43z)| lr 1.72e-04 | 4175.60 ms | 32.3% bf16 MFU | 123369 tok/s step 13859/19560 | loss 3.291575 (-0.96z)| norm 0.2421 (-1.06z)| lr 1.72e-04 | 4188.26 ms | 32.2% bf16 MFU | 123460 tok/s step 13860/19560 | loss 3.241894 (-1.95z)| norm 0.2544 (-0.23z)| lr 1.72e-04 | 4168.57 ms | 32.4% bf16 MFU | 123575 tok/s step 13861/19560 | loss 3.249218 (-1.76z)| norm 0.2520 (-0.39z)| lr 1.72e-04 | 4187.91 ms | 32.2% bf16 MFU | 123656 tok/s step 13862/19560 | loss 3.271611 (-1.29z)| norm 0.2456 (-0.81z)| lr 1.72e-04 | 4178.13 ms | 32.3% bf16 MFU | 123747 tok/s step 13863/19560 | loss 3.339515 (+0.10z)| norm 0.2454 (-0.82z)| lr 1.72e-04 | 4180.50 ms | 32.3% bf16 MFU | 123831 tok/s step 13864/19560 | loss 3.349828 (+0.30z)| norm 0.2530 (-0.33z)| lr 1.72e-04 | 4203.71 ms | 32.1% bf16 MFU | 123875 tok/s step 13865/19560 | loss 3.257940 (-1.57z)| norm 0.2501 (-0.51z)| lr 1.72e-04 | 4172.42 ms | 32.4% bf16 MFU | 123964 tok/s step 13866/19560 | loss 3.349719 (+0.31z)| norm 0.2616 (+0.25z)| lr 1.72e-04 | 4201.87 ms | 32.1% bf16 MFU | 124005 tok/s step 13867/19560 | loss 3.374387 (+0.81z)| norm 0.2456 (-0.81z)| lr 1.72e-04 | 4254.40 ms | 31.7% bf16 MFU | 123966 tok/s step 13868/19560 | loss 3.336754 (+0.04z)| norm 0.2655 (+0.53z)| lr 1.72e-04 | 4390.84 ms | 30.7% bf16 MFU | 123738 tok/s step 13869/19560 | loss 3.353177 (+0.36z)| norm 0.2606 (+0.21z)| lr 1.72e-04 | 4196.79 ms | 32.2% bf16 MFU | 123798 tok/s step 13870/19560 | loss 3.271317 (-1.29z)| norm 0.2447 (-0.86z)| lr 1.72e-04 | 4182.70 ms | 32.3% bf16 MFU | 123875 tok/s step 13871/19560 | loss 3.299987 (-0.70z)| norm 0.2557 (-0.09z)| lr 1.72e-04 | 4305.54 ms | 31.4% bf16 MFU | 123770 tok/s step 13872/19560 | loss 3.250779 (-1.66z)| norm 0.2373 (-1.40z)| lr 1.72e-04 | 4346.50 ms | 31.1% bf16 MFU | 123612 tok/s step 13873/19560 | loss 3.289365 (-0.88z)| norm 0.2570 (+0.03z)| lr 1.72e-04 | 4200.42 ms | 32.1% bf16 MFU | 123673 tok/s step 13874/19560 | loss 3.328136 (-0.10z)| norm 0.2625 (+0.42z)| lr 1.71e-04 | 4197.19 ms | 32.2% bf16 MFU | 123735 tok/s step 13875/19560 | loss 3.270271 (-1.24z)| norm 0.2606 (+0.32z)| lr 1.71e-04 | 4176.39 ms | 32.3% bf16 MFU | 123825 tok/s step 13876/19560 | loss 3.298483 (-0.67z)| norm 0.2519 (-0.35z)| lr 1.71e-04 | 4175.96 ms | 32.3% bf16 MFU | 123911 tok/s step 13877/19560 | loss 3.260796 (-1.40z)| norm 0.2490 (-0.55z)| lr 1.71e-04 | 4181.49 ms | 32.3% bf16 MFU | 123985 tok/s step 13878/19560 | loss 3.237896 (-1.84z)| norm 0.2482 (-0.61z)| lr 1.71e-04 | 4177.19 ms | 32.3% bf16 MFU | 124061 tok/s step 13879/19560 | loss 3.331857 (+0.03z)| norm 0.2604 (+0.31z)| lr 1.71e-04 | 4189.37 ms | 32.2% bf16 MFU | 124115 tok/s step 13880/19560 | loss 3.318815 (-0.22z)| norm 0.2648 (+0.63z)| lr 1.71e-04 | 4192.01 ms | 32.2% bf16 MFU | 124163 tok/s step 13881/19560 | loss 3.298369 (-0.63z)| norm 0.2524 (-0.29z)| lr 1.71e-04 | 4170.00 ms | 32.4% bf16 MFU | 124241 tok/s step 13882/19560 | loss 3.298708 (-0.61z)| norm 0.2663 (+0.74z)| lr 1.71e-04 | 4168.05 ms | 32.4% bf16 MFU | 124319 tok/s step 13883/19560 | loss 3.314726 (-0.28z)| norm 0.2570 (+0.04z)| lr 1.71e-04 | 4242.52 ms | 31.8% bf16 MFU | 124282 tok/s step 13884/19560 | loss 3.270077 (-1.16z)| norm 0.2558 (-0.05z)| lr 1.71e-04 | 4177.06 ms | 32.3% bf16 MFU | 124343 tok/s step 13885/19560 | loss 3.298822 (-0.57z)| norm 0.2622 (+0.45z)| lr 1.71e-04 | 4180.39 ms | 32.3% bf16 MFU | 124397 tok/s step 13886/19560 | loss 3.321020 (-0.12z)| norm 0.2311 (-1.91z)| lr 1.71e-04 | 4255.37 ms | 31.7% bf16 MFU | 124337 tok/s step 13887/19560 | loss 3.286050 (-0.81z)| norm 0.2683 (+0.95z)| lr 1.71e-04 | 4191.80 ms | 32.2% bf16 MFU | 124374 tok/s step 13888/19560 | loss 3.283887 (-0.84z)| norm 0.2465 (-0.73z)| lr 1.71e-04 | 4189.95 ms | 32.2% bf16 MFU | 124412 tok/s step 13889/19560 | loss 3.325790 (+0.00z)| norm 0.2563 (+0.04z)| lr 1.71e-04 | 4175.09 ms | 32.3% bf16 MFU | 124470 tok/s step 13890/19560 | loss 3.378217 (+1.05z)| norm 0.2598 (+0.32z)| lr 1.71e-04 | 4172.07 ms | 32.4% bf16 MFU | 124530 tok/s step 13891/19560 | loss 3.237270 (-1.75z)| norm 0.2524 (-0.27z)| lr 1.71e-04 | 4177.69 ms | 32.3% bf16 MFU | 124578 tok/s step 13892/19560 | loss 3.261139 (-1.26z)| norm 0.2497 (-0.47z)| lr 1.70e-04 | 4190.25 ms | 32.2% bf16 MFU | 124605 tok/s step 13893/19560 | loss 3.297767 (-0.54z)| norm 0.2523 (-0.25z)| lr 1.70e-04 | 4165.68 ms | 32.4% bf16 MFU | 124668 tok/s step 13894/19560 | loss 3.267334 (-1.13z)| norm 0.2295 (-2.01z)| lr 1.70e-04 | 4183.93 ms | 32.3% bf16 MFU | 124700 tok/s step 13895/19560 | loss 3.267148 (-1.11z)| norm 0.2338 (-1.66z)| lr 1.70e-04 | 4254.28 ms | 31.7% bf16 MFU | 124627 tok/s step 13896/19560 | loss 3.307740 (-0.32z)| norm 0.2286 (-2.02z)| lr 1.70e-04 | 4186.13 ms | 32.3% bf16 MFU | 124658 tok/s step 13897/19560 | loss 3.313375 (-0.21z)| norm 0.2312 (-1.78z)| lr 1.70e-04 | 4167.65 ms | 32.4% bf16 MFU | 124715 tok/s step 13898/19560 | loss 3.283656 (-0.78z)| norm 0.2390 (-1.17z)| lr 1.70e-04 | 4194.04 ms | 32.2% bf16 MFU | 124730 tok/s step 13899/19560 | loss 3.261240 (-1.20z)| norm 0.2253 (-2.15z)| lr 1.70e-04 | 4181.42 ms | 32.3% bf16 MFU | 124762 tok/s step 13900/19560 | loss 3.192189 (-2.50z)| norm 0.2606 (+0.47z)| lr 1.70e-04 | 4193.35 ms | 32.2% bf16 MFU | 124776 tok/s step 13901/19560 | loss 3.282839 (-0.76z)| norm 0.2348 (-1.42z)| lr 1.70e-04 | 4183.96 ms | 32.3% bf16 MFU | 124802 tok/s step 13902/19560 | loss 3.357917 (+0.67z)| norm 0.2603 (+0.45z)| lr 1.70e-04 | 4180.45 ms | 32.3% bf16 MFU | 124833 tok/s step 13903/19560 | loss 3.347230 (+0.50z)| norm 0.2368 (-1.27z)| lr 1.70e-04 | 4184.07 ms | 32.3% bf16 MFU | 124857 tok/s step 13904/19560 | loss 3.310885 (-0.21z)| norm 0.2541 (+0.02z)| lr 1.70e-04 | 4221.54 ms | 32.0% bf16 MFU | 124823 tok/s step 13905/19560 | loss 3.234412 (-1.69z)| norm 0.2499 (-0.29z)| lr 1.70e-04 | 4174.69 ms | 32.3% bf16 MFU | 124862 tok/s step 13906/19560 | loss 3.257145 (-1.23z)| norm 0.2469 (-0.50z)| lr 1.70e-04 | 4185.00 ms | 32.3% bf16 MFU | 124882 tok/s step 13907/19560 | loss 3.286492 (-0.65z)| norm 0.2420 (-0.87z)| lr 1.70e-04 | 4178.25 ms | 32.3% bf16 MFU | 124912 tok/s step 13908/19560 | loss 3.270697 (-0.94z)| norm 0.2422 (-0.85z)| lr 1.70e-04 | 4172.67 ms | 32.4% bf16 MFU | 124949 tok/s step 13909/19560 | loss 3.321326 (+0.05z)| norm 0.2473 (-0.46z)| lr 1.70e-04 | 4196.33 ms | 32.2% bf16 MFU | 124949 tok/s step 13910/19560 | loss 3.332091 (+0.27z)| norm 0.2703 (+1.24z)| lr 1.69e-04 | 4183.85 ms | 32.3% bf16 MFU | 124967 tok/s step 13911/19560 | loss 3.266202 (-1.01z)| norm 0.2487 (-0.36z)| lr 1.69e-04 | 4181.46 ms | 32.3% bf16 MFU | 124988 tok/s step 13912/19560 | loss 3.251486 (-1.28z)| norm 0.2469 (-0.49z)| lr 1.69e-04 | 4228.17 ms | 31.9% bf16 MFU | 124938 tok/s step 13913/19560 | loss 3.289083 (-0.54z)| norm 0.2636 (+0.74z)| lr 1.69e-04 | 4181.70 ms | 32.3% bf16 MFU | 124960 tok/s step 13914/19560 | loss 3.322293 (+0.12z)| norm 0.2642 (+0.77z)| lr 1.69e-04 | 4177.48 ms | 32.3% bf16 MFU | 124987 tok/s step 13915/19560 | loss 3.263718 (-1.01z)| norm 0.2687 (+1.10z)| lr 1.69e-04 | 4193.15 ms | 32.2% bf16 MFU | 124990 tok/s step 13916/19560 | loss 3.348917 (+0.65z)| norm 0.2781 (+1.76z)| lr 1.69e-04 | 4183.16 ms | 32.3% bf16 MFU | 125007 tok/s step 13917/19560 | loss 3.330314 (+0.28z)| norm 0.2844 (+2.17z)| lr 1.69e-04 | 4191.60 ms | 32.2% bf16 MFU | 125011 tok/s step 13918/19560 | loss 3.273079 (-0.83z)| norm 0.2463 (-0.59z)| lr 1.69e-04 | 4164.25 ms | 32.4% bf16 MFU | 125055 tok/s step 13919/19560 | loss 3.287997 (-0.53z)| norm 0.2667 (+0.87z)| lr 1.69e-04 | 4167.30 ms | 32.4% bf16 MFU | 125093 tok/s step 13920/19560 | loss 3.365705 (+0.98z)| norm 0.2549 (+0.00z)| lr 1.69e-04 | 4163.07 ms | 32.4% bf16 MFU | 125135 tok/s step 13921/19560 | loss 3.273726 (-0.81z)| norm 0.2623 (+0.53z)| lr 1.69e-04 | 4194.32 ms | 32.2% bf16 MFU | 125128 tok/s step 13922/19560 | loss 3.226678 (-1.70z)| norm 0.2539 (-0.09z)| lr 1.69e-04 | 4177.47 ms | 32.3% bf16 MFU | 125147 tok/s step 13923/19560 | loss 3.290147 (-0.46z)| norm 0.2731 (+1.31z)| lr 1.69e-04 | 4175.25 ms | 32.3% bf16 MFU | 125168 tok/s step 13924/19560 | loss 3.285853 (-0.53z)| norm 0.2893 (+2.43z)| lr 1.69e-04 | 4168.03 ms | 32.4% bf16 MFU | 125199 tok/s step 13925/19560 | loss 3.376368 (+1.23z)| norm 0.2677 (+0.85z)| lr 1.69e-04 | 4182.89 ms | 32.3% bf16 MFU | 125206 tok/s step 13926/19560 | loss 3.309801 (-0.07z)| norm 0.2781 (+1.58z)| lr 1.69e-04 | 4171.93 ms | 32.4% bf16 MFU | 125230 tok/s step 13927/19560 | loss 3.240068 (-1.40z)| norm 0.2498 (-0.48z)| lr 1.69e-04 | 4173.71 ms | 32.3% bf16 MFU | 125249 tok/s step 13928/19560 | loss 3.361797 (+0.94z)| norm 0.2923 (+2.54z)| lr 1.68e-04 | 4170.86 ms | 32.4% bf16 MFU | 125272 tok/s step 13929/19560 | loss 3.299398 (-0.26z)| norm 0.2541 (-0.20z)| lr 1.68e-04 | 4184.49 ms | 32.3% bf16 MFU | 125273 tok/s step 13930/19560 | loss 3.272089 (-0.78z)| norm 0.2864 (+2.09z)| lr 1.68e-04 | 4631.37 ms | 29.2% bf16 MFU | 124669 tok/s step 13931/19560 | loss 3.301827 (-0.20z)| norm 0.2307 (-1.87z)| lr 1.68e-04 | 4172.92 ms | 32.4% bf16 MFU | 124718 tok/s step 13932/19560 | loss 3.266748 (-0.87z)| norm 0.2594 (+0.16z)| lr 1.68e-04 | 4179.19 ms | 32.3% bf16 MFU | 124754 tok/s step 13933/19560 | loss 3.308974 (-0.05z)| norm 0.2480 (-0.64z)| lr 1.68e-04 | 4183.27 ms | 32.3% bf16 MFU | 124783 tok/s step 13934/19560 | loss 3.386036 (+1.44z)| norm 0.2466 (-0.73z)| lr 1.68e-04 | 4177.63 ms | 32.3% bf16 MFU | 124819 tok/s step 13935/19560 | loss 3.241225 (-1.45z)| norm 0.2613 (+0.34z)| lr 1.68e-04 | 4183.14 ms | 32.3% bf16 MFU | 124845 tok/s step 13936/19560 | loss 3.281061 (-0.60z)| norm 0.2286 (-1.98z)| lr 1.68e-04 | 4180.76 ms | 32.3% bf16 MFU | 124873 tok/s step 13937/19560 | loss 3.253071 (-1.18z)| norm 0.2452 (-0.78z)| lr 1.68e-04 | 4179.67 ms | 32.3% bf16 MFU | 124901 tok/s step 13938/19560 | loss 3.249659 (-1.24z)| norm 0.2497 (-0.45z)| lr 1.68e-04 | 4612.75 ms | 29.3% bf16 MFU | 124339 tok/s step 13939/19560 | loss 3.264937 (-0.90z)| norm 0.2268 (-2.06z)| lr 1.68e-04 | 4158.14 ms | 32.5% bf16 MFU | 124426 tok/s step 13940/19560 | loss 3.253505 (-1.13z)| norm 0.2440 (-0.83z)| lr 1.68e-04 | 4180.58 ms | 32.3% bf16 MFU | 124476 tok/s step 13941/19560 | loss 3.255906 (-1.06z)| norm 0.2388 (-1.20z)| lr 1.68e-04 | 4180.54 ms | 32.3% bf16 MFU | 124522 tok/s step 13942/19560 | loss 3.309035 (+0.09z)| norm 0.2309 (-1.74z)| lr 1.68e-04 | 4184.23 ms | 32.3% bf16 MFU | 124561 tok/s step 13943/19560 | loss 3.319853 (+0.34z)| norm 0.2311 (-1.69z)| lr 1.68e-04 | 4183.68 ms | 32.3% bf16 MFU | 124599 tok/s step 13944/19560 | loss 3.234626 (-1.51z)| norm 0.2430 (-0.82z)| lr 1.68e-04 | 4179.91 ms | 32.3% bf16 MFU | 124641 tok/s step 13945/19560 | loss 3.344431 (+0.96z)| norm 0.2269 (-1.96z)| lr 1.68e-04 | 4184.01 ms | 32.3% bf16 MFU | 124674 tok/s step 13946/19560 | loss 3.317872 (+0.36z)| norm 0.2333 (-1.47z)| lr 1.67e-04 | 4191.56 ms | 32.2% bf16 MFU | 124694 tok/s step 13947/19560 | loss 3.386618 (+1.91z)| norm 0.2539 (+0.03z)| lr 1.67e-04 | 4185.07 ms | 32.3% bf16 MFU | 124723 tok/s step 13948/19560 | loss 3.368249 (+1.49z)| norm 0.2396 (-1.00z)| lr 1.67e-04 | 4172.22 ms | 32.4% bf16 MFU | 124770 tok/s step 13949/19560 | loss 3.403242 (+2.38z)| norm 0.2462 (-0.51z)| lr 1.67e-04 | 4167.12 ms | 32.4% bf16 MFU | 124823 tok/s step 13950/19560 | loss 3.261827 (-0.93z)| norm 0.2376 (-1.12z)| lr 1.67e-04 | 4175.93 ms | 32.3% bf16 MFU | 124859 tok/s step 13951/19560 | loss 3.290368 (-0.25z)| norm 0.2472 (-0.42z)| lr 1.67e-04 | 4171.87 ms | 32.4% bf16 MFU | 124900 tok/s step 13952/19560 | loss 3.295433 (-0.13z)| norm 0.2236 (-2.10z)| lr 1.67e-04 | 4178.50 ms | 32.3% bf16 MFU | 124928 tok/s step 13953/19560 | loss 3.262408 (-0.90z)| norm 0.2398 (-0.91z)| lr 1.67e-04 | 4187.25 ms | 32.2% bf16 MFU | 124942 tok/s step 13954/19560 | loss 3.313603 (+0.33z)| norm 0.2341 (-1.31z)| lr 1.67e-04 | 4286.39 ms | 31.5% bf16 MFU | 124811 tok/s step 13955/19560 | loss 3.266435 (-0.79z)| norm 0.2353 (-1.21z)| lr 1.67e-04 | 4294.39 ms | 31.4% bf16 MFU | 124675 tok/s step 13956/19560 | loss 3.321394 (+0.53z)| norm 0.2335 (-1.31z)| lr 1.67e-04 | 4267.19 ms | 31.6% bf16 MFU | 124584 tok/s step 13957/19560 | loss 3.282456 (-0.40z)| norm 0.2344 (-1.24z)| lr 1.67e-04 | 4175.77 ms | 32.3% bf16 MFU | 124633 tok/s step 13958/19560 | loss 3.299886 (+0.03z)| norm 0.2571 (+0.37z)| lr 1.67e-04 | 4302.33 ms | 31.4% bf16 MFU | 124494 tok/s step 13959/19560 | loss 3.360110 (+1.48z)| norm 0.2333 (-1.30z)| lr 1.67e-04 | 4195.88 ms | 32.2% bf16 MFU | 124517 tok/s step 13960/19560 | loss 3.334443 (+0.85z)| norm 0.2772 (+1.76z)| lr 1.67e-04 | 4215.46 ms | 32.0% bf16 MFU | 124510 tok/s step 13961/19560 | loss 3.336313 (+0.89z)| norm 0.2661 (+0.99z)| lr 1.67e-04 | 4273.00 ms | 31.6% bf16 MFU | 124419 tok/s step 13962/19560 | loss 3.299872 (+0.03z)| norm 0.2791 (+1.86z)| lr 1.67e-04 | 4182.60 ms | 32.3% bf16 MFU | 124466 tok/s step 13963/19560 | loss 3.256875 (-0.99z)| norm 0.2727 (+1.40z)| lr 1.67e-04 | 4175.15 ms | 32.3% bf16 MFU | 124521 tok/s step 13964/19560 | loss 3.200499 (-2.32z)| norm 0.2340 (-1.22z)| lr 1.66e-04 | 4241.72 ms | 31.8% bf16 MFU | 124475 tok/s step 13965/19560 | loss 3.282409 (-0.35z)| norm 0.2643 (+0.82z)| lr 1.66e-04 | 4210.07 ms | 32.1% bf16 MFU | 124478 tok/s step 13966/19560 | loss 3.343744 (+1.11z)| norm 0.2560 (+0.25z)| lr 1.66e-04 | 4175.27 ms | 32.3% bf16 MFU | 124533 tok/s step 13967/19560 | loss 3.286828 (-0.26z)| norm 0.2887 (+2.40z)| lr 1.66e-04 | 4171.78 ms | 32.4% bf16 MFU | 124590 tok/s step 13968/19560 | loss 3.272354 (-0.59z)| norm 0.2663 (+0.91z)| lr 1.66e-04 | 4171.07 ms | 32.4% bf16 MFU | 124645 tok/s step 13969/19560 | loss 3.276383 (-0.50z)| norm 0.2518 (-0.03z)| lr 1.66e-04 | 4316.24 ms | 31.3% bf16 MFU | 124486 tok/s step 13970/19560 | loss 3.320922 (+0.58z)| norm 0.2604 (+0.54z)| lr 1.66e-04 | 4181.93 ms | 32.3% bf16 MFU | 124531 tok/s step 13971/19560 | loss 3.332820 (+0.86z)| norm 0.2520 (-0.02z)| lr 1.66e-04 | 4171.84 ms | 32.4% bf16 MFU | 124588 tok/s step 13972/19560 | loss 3.280078 (-0.41z)| norm 0.2562 (+0.28z)| lr 1.66e-04 | 4180.94 ms | 32.3% bf16 MFU | 124628 tok/s step 13973/19560 | loss 3.364671 (+1.61z)| norm 0.2390 (-0.89z)| lr 1.66e-04 | 4169.12 ms | 32.4% bf16 MFU | 124685 tok/s step 13974/19560 | loss 3.301459 (+0.08z)| norm 0.2581 (+0.43z)| lr 1.66e-04 | 4210.57 ms | 32.1% bf16 MFU | 124676 tok/s step 13975/19560 | loss 3.314330 (+0.41z)| norm 0.2505 (-0.09z)| lr 1.66e-04 | 4186.23 ms | 32.3% bf16 MFU | 124704 tok/s step 13976/19560 | loss 3.328118 (+0.74z)| norm 0.2409 (-0.75z)| lr 1.66e-04 | 4168.89 ms | 32.4% bf16 MFU | 124757 tok/s step 13977/19560 | loss 3.251145 (-1.11z)| norm 0.2431 (-0.58z)| lr 1.66e-04 | 4177.86 ms | 32.3% bf16 MFU | 124794 tok/s step 13978/19560 | loss 3.234493 (-1.49z)| norm 0.2385 (-0.89z)| lr 1.66e-04 | 4172.17 ms | 32.4% bf16 MFU | 124837 tok/s step 13979/19560 | loss 3.268933 (-0.67z)| norm 0.2676 (+1.10z)| lr 1.66e-04 | 4175.81 ms | 32.3% bf16 MFU | 124873 tok/s step 13980/19560 | loss 3.357828 (+1.44z)| norm 0.2456 (-0.39z)| lr 1.66e-04 | 4162.19 ms | 32.4% bf16 MFU | 124928 tok/s step 13981/19560 | loss 3.311325 (+0.34z)| norm 0.2570 (+0.38z)| lr 1.66e-04 | 4189.38 ms | 32.2% bf16 MFU | 124939 tok/s step 13982/19560 | loss 3.290056 (-0.16z)| norm 0.2523 (+0.05z)| lr 1.65e-04 | 4185.21 ms | 32.3% bf16 MFU | 124955 tok/s step 13983/19560 | loss 3.327569 (+0.72z)| norm 0.2391 (-0.85z)| lr 1.65e-04 | 4177.84 ms | 32.3% bf16 MFU | 124982 tok/s step 13984/19560 | loss 3.313433 (+0.40z)| norm 0.2533 (+0.11z)| lr 1.65e-04 | 4165.58 ms | 32.4% bf16 MFU | 125026 tok/s step 13985/19560 | loss 3.323056 (+0.63z)| norm 0.2288 (-1.55z)| lr 1.65e-04 | 4173.04 ms | 32.4% bf16 MFU | 125057 tok/s step 13986/19560 | loss 3.285936 (-0.28z)| norm 0.2556 (+0.28z)| lr 1.65e-04 | 4168.03 ms | 32.4% bf16 MFU | 125093 tok/s step 13987/19560 | loss 3.265436 (-0.78z)| norm 0.2334 (-1.22z)| lr 1.65e-04 | 4169.75 ms | 32.4% bf16 MFU | 125126 tok/s step 13988/19560 | loss 3.378896 (+1.95z)| norm 0.2527 (+0.09z)| lr 1.65e-04 | 4253.97 ms | 31.7% bf16 MFU | 125032 tok/s step 13989/19560 | loss 3.325103 (+0.64z)| norm 0.2368 (-0.98z)| lr 1.65e-04 | 4186.79 ms | 32.2% bf16 MFU | 125041 tok/s step 13990/19560 | loss 3.300209 (+0.03z)| norm 0.2406 (-0.72z)| lr 1.65e-04 | 4180.16 ms | 32.3% bf16 MFU | 125060 tok/s step 13991/19560 | loss 3.313099 (+0.35z)| norm 0.2381 (-0.89z)| lr 1.65e-04 | 4219.84 ms | 32.0% bf16 MFU | 125019 tok/s step 13992/19560 | loss 3.259864 (-0.94z)| norm 0.2724 (+1.41z)| lr 1.65e-04 | 4182.84 ms | 32.3% bf16 MFU | 125036 tok/s step 13993/19560 | loss 3.285556 (-0.32z)| norm 0.2401 (-0.75z)| lr 1.65e-04 | 4179.41 ms | 32.3% bf16 MFU | 125056 tok/s step 13994/19560 | loss 3.243701 (-1.32z)| norm 0.2315 (-1.30z)| lr 1.65e-04 | 4179.00 ms | 32.3% bf16 MFU | 125076 tok/s step 13995/19560 | loss 3.289925 (-0.17z)| norm 0.2402 (-0.72z)| lr 1.65e-04 | 4189.54 ms | 32.2% bf16 MFU | 125080 tok/s step 13996/19560 | loss 3.293067 (-0.09z)| norm 0.2360 (-0.98z)| lr 1.65e-04 | 4219.40 ms | 32.0% bf16 MFU | 125038 tok/s step 13997/19560 | loss 3.305601 (+0.24z)| norm 0.2385 (-0.80z)| lr 1.65e-04 | 4186.72 ms | 32.2% bf16 MFU | 125048 tok/s step 13998/19560 | loss 3.323508 (+0.68z)| norm 0.2365 (-0.93z)| lr 1.65e-04 | 4174.14 ms | 32.3% bf16 MFU | 125076 tok/s step 13999/19560 | loss 3.270676 (-0.65z)| norm 0.2273 (-1.51z)| lr 1.65e-04 | 4264.26 ms | 31.7% bf16 MFU | 124969 tok/s step 14000/19560 | loss 3.421196 (+3.01z)| norm 0.2324 (-1.17z)| lr 1.65e-04 | 4181.25 ms | 32.3% bf16 MFU | 124990 tok/s val loss 3.299355 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2992/10042 = 0.297949 step 14001/19560 | loss 3.330611 (+0.79z)| norm 0.2301 (-1.30z)| lr 1.64e-04 | 4181.84 ms | 32.3% bf16 MFU | 125009 tok/s step 14002/19560 | loss 3.305654 (+0.19z)| norm 0.2451 (-0.32z)| lr 1.64e-04 | 4187.91 ms | 32.2% bf16 MFU | 125018 tok/s step 14003/19560 | loss 3.318443 (+0.49z)| norm 0.2424 (-0.48z)| lr 1.64e-04 | 4172.99 ms | 32.4% bf16 MFU | 125049 tok/s step 14004/19560 | loss 3.335297 (+0.89z)| norm 0.2442 (-0.36z)| lr 1.64e-04 | 4270.11 ms | 31.6% bf16 MFU | 124936 tok/s step 14005/19560 | loss 3.328248 (+0.71z)| norm 0.2408 (-0.57z)| lr 1.64e-04 | 4283.47 ms | 31.5% bf16 MFU | 124809 tok/s step 14006/19560 | loss 3.367338 (+1.63z)| norm 0.2473 (-0.16z)| lr 1.64e-04 | 4171.79 ms | 32.4% bf16 MFU | 124852 tok/s step 14007/19560 | loss 3.301445 (+0.04z)| norm 0.2446 (-0.32z)| lr 1.64e-04 | 4214.79 ms | 32.0% bf16 MFU | 124829 tok/s step 14008/19560 | loss 3.284776 (-0.36z)| norm 0.2423 (-0.46z)| lr 1.64e-04 | 4313.90 ms | 31.3% bf16 MFU | 124665 tok/s step 14009/19560 | loss 3.325346 (+0.62z)| norm 0.2260 (-1.50z)| lr 1.64e-04 | 4183.45 ms | 32.3% bf16 MFU | 124698 tok/s step 14010/19560 | loss 3.296720 (-0.07z)| norm 0.2258 (-1.48z)| lr 1.64e-04 | 4171.71 ms | 32.4% bf16 MFU | 124747 tok/s step 14011/19560 | loss 3.262187 (-0.90z)| norm 0.2489 (+0.00z)| lr 1.64e-04 | 4249.15 ms | 31.8% bf16 MFU | 124679 tok/s step 14012/19560 | loss 3.302343 (+0.07z)| norm 0.2437 (-0.32z)| lr 1.64e-04 | 4194.69 ms | 32.2% bf16 MFU | 124694 tok/s step 14013/19560 | loss 3.280656 (-0.46z)| norm 0.2467 (-0.12z)| lr 1.64e-04 | 4165.16 ms | 32.4% bf16 MFU | 124753 tok/s step 14014/19560 | loss 3.326581 (+0.66z)| norm 0.2557 (+0.45z)| lr 1.64e-04 | 4164.13 ms | 32.4% bf16 MFU | 124811 tok/s step 14015/19560 | loss 3.318777 (+0.46z)| norm 0.2357 (-0.83z)| lr 1.64e-04 | 4177.15 ms | 32.3% bf16 MFU | 124846 tok/s step 14016/19560 | loss 3.250376 (-1.19z)| norm 0.2563 (+0.51z)| lr 1.64e-04 | 4319.82 ms | 31.3% bf16 MFU | 124672 tok/s step 14017/19560 | loss 3.208385 (-2.14z)| norm 0.2556 (+0.46z)| lr 1.64e-04 | 4171.93 ms | 32.4% bf16 MFU | 124722 tok/s step 14018/19560 | loss 3.321842 (+0.57z)| norm 0.2602 (+0.76z)| lr 1.64e-04 | 4182.32 ms | 32.3% bf16 MFU | 124754 tok/s step 14019/19560 | loss 3.224691 (-1.76z)| norm 0.2508 (+0.14z)| lr 1.63e-04 | 4169.35 ms | 32.4% bf16 MFU | 124803 tok/s step 14020/19560 | loss 3.298533 (+0.00z)| norm 0.2388 (-0.63z)| lr 1.63e-04 | 4213.46 ms | 32.0% bf16 MFU | 124785 tok/s step 14021/19560 | loss 3.329283 (+0.74z)| norm 0.2528 (+0.28z)| lr 1.63e-04 | 4282.37 ms | 31.5% bf16 MFU | 124667 tok/s step 14022/19560 | loss 3.270931 (-0.66z)| norm 0.2303 (-1.19z)| lr 1.63e-04 | 4188.50 ms | 32.2% bf16 MFU | 124692 tok/s step 14023/19560 | loss 3.315127 (+0.39z)| norm 0.2453 (-0.22z)| lr 1.63e-04 | 4177.40 ms | 32.3% bf16 MFU | 124733 tok/s step 14024/19560 | loss 3.267670 (-0.74z)| norm 0.2498 (+0.07z)| lr 1.63e-04 | 4268.30 ms | 31.6% bf16 MFU | 124638 tok/s step 14025/19560 | loss 3.304683 (+0.15z)| norm 0.2353 (-0.89z)| lr 1.63e-04 | 4245.37 ms | 31.8% bf16 MFU | 124581 tok/s step 14026/19560 | loss 3.333315 (+0.82z)| norm 0.2480 (-0.06z)| lr 1.63e-04 | 4662.77 ms | 29.0% bf16 MFU | 123974 tok/s step 14027/19560 | loss 3.256490 (-1.02z)| norm 0.2355 (-0.89z)| lr 1.63e-04 | 4683.49 ms | 28.8% bf16 MFU | 123372 tok/s step 14028/19560 | loss 3.322589 (+0.55z)| norm 0.2427 (-0.40z)| lr 1.63e-04 | 4781.91 ms | 28.2% bf16 MFU | 122686 tok/s step 14029/19560 | loss 3.280395 (-0.48z)| norm 0.2346 (-0.95z)| lr 1.63e-04 | 4502.53 ms | 30.0% bf16 MFU | 122374 tok/s step 14030/19560 | loss 3.299588 (+0.00z)| norm 0.2487 (-0.00z)| lr 1.63e-04 | 4393.20 ms | 30.7% bf16 MFU | 122222 tok/s step 14031/19560 | loss 3.276404 (-0.56z)| norm 0.2413 (-0.50z)| lr 1.63e-04 | 4239.74 ms | 31.8% bf16 MFU | 122294 tok/s step 14032/19560 | loss 3.250667 (-1.18z)| norm 0.2581 (+0.62z)| lr 1.63e-04 | 4205.07 ms | 32.1% bf16 MFU | 122413 tok/s step 14033/19560 | loss 3.306870 (+0.20z)| norm 0.2351 (-0.91z)| lr 1.63e-04 | 4332.19 ms | 31.2% bf16 MFU | 122344 tok/s step 14034/19560 | loss 3.318764 (+0.48z)| norm 0.2495 (+0.05z)| lr 1.63e-04 | 4269.86 ms | 31.6% bf16 MFU | 122366 tok/s step 14035/19560 | loss 3.313408 (+0.34z)| norm 0.2506 (+0.12z)| lr 1.63e-04 | 4291.28 ms | 31.5% bf16 MFU | 122356 tok/s step 14036/19560 | loss 3.276854 (-0.57z)| norm 0.2510 (+0.15z)| lr 1.63e-04 | 4217.16 ms | 32.0% bf16 MFU | 122455 tok/s step 14037/19560 | loss 3.263456 (-0.90z)| norm 0.2358 (-0.86z)| lr 1.62e-04 | 4312.42 ms | 31.3% bf16 MFU | 122411 tok/s step 14038/19560 | loss 3.279569 (-0.48z)| norm 0.2274 (-1.40z)| lr 1.62e-04 | 4270.39 ms | 31.6% bf16 MFU | 122429 tok/s step 14039/19560 | loss 3.275847 (-0.58z)| norm 0.2275 (-1.37z)| lr 1.62e-04 | 4177.40 ms | 32.3% bf16 MFU | 122583 tok/s step 14040/19560 | loss 3.332486 (+0.83z)| norm 0.2616 (+0.87z)| lr 1.62e-04 | 4176.90 ms | 32.3% bf16 MFU | 122730 tok/s step 14041/19560 | loss 3.274039 (-0.64z)| norm 0.2322 (-1.04z)| lr 1.62e-04 | 4349.33 ms | 31.0% bf16 MFU | 122620 tok/s step 14042/19560 | loss 3.290563 (-0.22z)| norm 0.2369 (-0.72z)| lr 1.62e-04 | 4196.65 ms | 32.2% bf16 MFU | 122736 tok/s step 14043/19560 | loss 3.307025 (+0.19z)| norm 0.2476 (-0.00z)| lr 1.62e-04 | 4199.01 ms | 32.2% bf16 MFU | 122842 tok/s step 14044/19560 | loss 3.241543 (-1.44z)| norm 0.2544 (+0.47z)| lr 1.62e-04 | 4175.68 ms | 32.3% bf16 MFU | 122978 tok/s step 14045/19560 | loss 3.324554 (+0.65z)| norm 0.2453 (-0.13z)| lr 1.62e-04 | 4193.85 ms | 32.2% bf16 MFU | 123080 tok/s step 14046/19560 | loss 3.303885 (+0.12z)| norm 0.2427 (-0.31z)| lr 1.62e-04 | 4202.52 ms | 32.1% bf16 MFU | 123163 tok/s step 14047/19560 | loss 3.318281 (+0.48z)| norm 0.2744 (+1.87z)| lr 1.62e-04 | 4274.78 ms | 31.6% bf16 MFU | 123138 tok/s step 14048/19560 | loss 3.324596 (+0.65z)| norm 0.2330 (-0.97z)| lr 1.62e-04 | 4171.62 ms | 32.4% bf16 MFU | 123265 tok/s step 14049/19560 | loss 3.299743 (+0.02z)| norm 0.2588 (+0.80z)| lr 1.62e-04 | 4170.23 ms | 32.4% bf16 MFU | 123388 tok/s step 14050/19560 | loss 3.286040 (-0.35z)| norm 0.2639 (+1.15z)| lr 1.62e-04 | 4176.81 ms | 32.3% bf16 MFU | 123494 tok/s step 14051/19560 | loss 3.273930 (-0.66z)| norm 0.2525 (+0.39z)| lr 1.62e-04 | 4165.97 ms | 32.4% bf16 MFU | 123612 tok/s step 14052/19560 | loss 3.300380 (+0.02z)| norm 0.2588 (+0.86z)| lr 1.62e-04 | 4171.19 ms | 32.4% bf16 MFU | 123716 tok/s step 14053/19560 | loss 3.294840 (-0.11z)| norm 0.2557 (+0.65z)| lr 1.62e-04 | 4223.02 ms | 32.0% bf16 MFU | 123738 tok/s step 14054/19560 | loss 3.335442 (+0.95z)| norm 0.2517 (+0.39z)| lr 1.62e-04 | 4245.53 ms | 31.8% bf16 MFU | 123726 tok/s step 14055/19560 | loss 3.259881 (-1.03z)| norm 0.2411 (-0.39z)| lr 1.62e-04 | 4253.22 ms | 31.7% bf16 MFU | 123703 tok/s step 14056/19560 | loss 3.272388 (-0.69z)| norm 0.2500 (+0.31z)| lr 1.61e-04 | 4186.37 ms | 32.3% bf16 MFU | 123779 tok/s step 14057/19560 | loss 3.296501 (-0.05z)| norm 0.2501 (+0.32z)| lr 1.61e-04 | 4168.46 ms | 32.4% bf16 MFU | 123879 tok/s step 14058/19560 | loss 3.301956 (+0.08z)| norm 0.2351 (-0.84z)| lr 1.61e-04 | 4171.84 ms | 32.4% bf16 MFU | 123969 tok/s step 14059/19560 | loss 3.246472 (-1.37z)| norm 0.2500 (+0.34z)| lr 1.61e-04 | 4178.09 ms | 32.3% bf16 MFU | 124045 tok/s step 14060/19560 | loss 3.286120 (-0.33z)| norm 0.2361 (-0.76z)| lr 1.61e-04 | 4169.41 ms | 32.4% bf16 MFU | 124130 tok/s step 14061/19560 | loss 3.360274 (+1.61z)| norm 0.2459 (+0.03z)| lr 1.61e-04 | 4231.46 ms | 31.9% bf16 MFU | 124118 tok/s step 14062/19560 | loss 3.335783 (+0.99z)| norm 0.2564 (+0.87z)| lr 1.61e-04 | 4183.41 ms | 32.3% bf16 MFU | 124179 tok/s step 14063/19560 | loss 3.245241 (-1.42z)| norm 0.2483 (+0.23z)| lr 1.61e-04 | 4175.64 ms | 32.3% bf16 MFU | 124248 tok/s step 14064/19560 | loss 3.318021 (+0.51z)| norm 0.2437 (-0.16z)| lr 1.61e-04 | 4172.67 ms | 32.4% bf16 MFU | 124318 tok/s step 14065/19560 | loss 3.281018 (-0.48z)| norm 0.2539 (+0.66z)| lr 1.61e-04 | 4231.67 ms | 31.9% bf16 MFU | 124297 tok/s step 14066/19560 | loss 3.261650 (-1.01z)| norm 0.2361 (-0.77z)| lr 1.61e-04 | 4170.64 ms | 32.4% bf16 MFU | 124367 tok/s step 14067/19560 | loss 3.302420 (+0.08z)| norm 0.2663 (+1.65z)| lr 1.61e-04 | 4177.75 ms | 32.3% bf16 MFU | 124424 tok/s step 14068/19560 | loss 3.288606 (-0.30z)| norm 0.2316 (-1.14z)| lr 1.61e-04 | 4168.41 ms | 32.4% bf16 MFU | 124491 tok/s step 14069/19560 | loss 3.277970 (-0.60z)| norm 0.2501 (+0.34z)| lr 1.61e-04 | 4176.04 ms | 32.3% bf16 MFU | 124544 tok/s step 14070/19560 | loss 3.282775 (-0.46z)| norm 0.2371 (-0.72z)| lr 1.61e-04 | 4176.14 ms | 32.3% bf16 MFU | 124594 tok/s step 14071/19560 | loss 3.340935 (+1.11z)| norm 0.2407 (-0.44z)| lr 1.61e-04 | 4274.16 ms | 31.6% bf16 MFU | 124498 tok/s step 14072/19560 | loss 3.276218 (-0.65z)| norm 0.2312 (-1.19z)| lr 1.61e-04 | 4177.50 ms | 32.3% bf16 MFU | 124548 tok/s step 14073/19560 | loss 3.346613 (+1.27z)| norm 0.2347 (-0.92z)| lr 1.61e-04 | 4171.84 ms | 32.4% bf16 MFU | 124604 tok/s step 14074/19560 | loss 3.345722 (+1.23z)| norm 0.2392 (-0.56z)| lr 1.60e-04 | 4211.36 ms | 32.1% bf16 MFU | 124599 tok/s step 14075/19560 | loss 3.259663 (-1.10z)| norm 0.2346 (-0.92z)| lr 1.60e-04 | 4206.94 ms | 32.1% bf16 MFU | 124600 tok/s step 14076/19560 | loss 3.330833 (+0.89z)| norm 0.2441 (-0.15z)| lr 1.60e-04 | 4176.26 ms | 32.3% bf16 MFU | 124647 tok/s step 14077/19560 | loss 3.335029 (+1.05z)| norm 0.2514 (+0.45z)| lr 1.60e-04 | 4180.09 ms | 32.3% bf16 MFU | 124686 tok/s step 14078/19560 | loss 3.366115 (+1.91z)| norm 0.2320 (-1.13z)| lr 1.60e-04 | 4180.32 ms | 32.3% bf16 MFU | 124722 tok/s step 14079/19560 | loss 3.339648 (+1.13z)| norm 0.2654 (+1.55z)| lr 1.60e-04 | 4176.99 ms | 32.3% bf16 MFU | 124762 tok/s step 14080/19560 | loss 3.341791 (+1.18z)| norm 0.2560 (+0.79z)| lr 1.60e-04 | 4235.51 ms | 31.9% bf16 MFU | 124713 tok/s step 14081/19560 | loss 3.258588 (-1.17z)| norm 0.2426 (-0.31z)| lr 1.60e-04 | 4171.72 ms | 32.4% bf16 MFU | 124761 tok/s step 14082/19560 | loss 3.294342 (-0.16z)| norm 0.2535 (+0.57z)| lr 1.60e-04 | 4186.64 ms | 32.2% bf16 MFU | 124785 tok/s step 14083/19560 | loss 3.218205 (-2.26z)| norm 0.2379 (-0.70z)| lr 1.60e-04 | 4176.09 ms | 32.3% bf16 MFU | 124823 tok/s step 14084/19560 | loss 3.275187 (-0.67z)| norm 0.2414 (-0.42z)| lr 1.60e-04 | 4292.82 ms | 31.5% bf16 MFU | 124688 tok/s step 14085/19560 | loss 3.253264 (-1.26z)| norm 0.2340 (-1.03z)| lr 1.60e-04 | 4187.29 ms | 32.2% bf16 MFU | 124714 tok/s step 14086/19560 | loss 3.262944 (-0.98z)| norm 0.2471 (+0.05z)| lr 1.60e-04 | 4177.92 ms | 32.3% bf16 MFU | 124753 tok/s step 14087/19560 | loss 3.288964 (-0.26z)| norm 0.2408 (-0.47z)| lr 1.60e-04 | 4173.95 ms | 32.3% bf16 MFU | 124796 tok/s step 14088/19560 | loss 3.316422 (+0.51z)| norm 0.2569 (+0.89z)| lr 1.60e-04 | 4168.49 ms | 32.4% bf16 MFU | 124845 tok/s step 14089/19560 | loss 3.333889 (+1.00z)| norm 0.2480 (+0.15z)| lr 1.60e-04 | 4181.60 ms | 32.3% bf16 MFU | 124872 tok/s step 14090/19560 | loss 3.267058 (-0.85z)| norm 0.2441 (-0.16z)| lr 1.60e-04 | 4165.92 ms | 32.4% bf16 MFU | 124921 tok/s step 14091/19560 | loss 3.326628 (+0.79z)| norm 0.2347 (-0.99z)| lr 1.60e-04 | 4173.54 ms | 32.4% bf16 MFU | 124956 tok/s step 14092/19560 | loss 3.378249 (+2.22z)| norm 0.2471 (+0.11z)| lr 1.59e-04 | 4195.19 ms | 32.2% bf16 MFU | 124957 tok/s step 14093/19560 | loss 3.337821 (+1.06z)| norm 0.2349 (-0.97z)| lr 1.59e-04 | 4179.46 ms | 32.3% bf16 MFU | 124981 tok/s step 14094/19560 | loss 3.288174 (-0.32z)| norm 0.2562 (+0.96z)| lr 1.59e-04 | 4240.35 ms | 31.8% bf16 MFU | 124914 tok/s step 14095/19560 | loss 3.321913 (+0.62z)| norm 0.2391 (-0.59z)| lr 1.59e-04 | 4237.11 ms | 31.9% bf16 MFU | 124855 tok/s step 14096/19560 | loss 3.294200 (-0.17z)| norm 0.2481 (+0.30z)| lr 1.59e-04 | 4195.72 ms | 32.2% bf16 MFU | 124860 tok/s step 14097/19560 | loss 3.315697 (+0.43z)| norm 0.2566 (+1.13z)| lr 1.59e-04 | 4182.63 ms | 32.3% bf16 MFU | 124885 tok/s step 14098/19560 | loss 3.322019 (+0.61z)| norm 0.2314 (-1.32z)| lr 1.59e-04 | 4173.84 ms | 32.3% bf16 MFU | 124921 tok/s step 14099/19560 | loss 3.359472 (+1.66z)| norm 0.2554 (+1.04z)| lr 1.59e-04 | 4178.30 ms | 32.3% bf16 MFU | 124949 tok/s step 14100/19560 | loss 3.315668 (+0.42z)| norm 0.2318 (-1.26z)| lr 1.59e-04 | 4185.15 ms | 32.3% bf16 MFU | 124965 tok/s step 14101/19560 | loss 3.268112 (-0.91z)| norm 0.2227 (-2.10z)| lr 1.59e-04 | 4177.37 ms | 32.3% bf16 MFU | 124992 tok/s step 14102/19560 | loss 3.315549 (+0.43z)| norm 0.2415 (-0.28z)| lr 1.59e-04 | 4229.42 ms | 31.9% bf16 MFU | 124941 tok/s step 14103/19560 | loss 3.325982 (+0.73z)| norm 0.2250 (-1.83z)| lr 1.59e-04 | 4176.45 ms | 32.3% bf16 MFU | 124970 tok/s step 14104/19560 | loss 3.343288 (+1.21z)| norm 0.2299 (-1.35z)| lr 1.59e-04 | 4185.44 ms | 32.3% bf16 MFU | 124985 tok/s step 14105/19560 | loss 3.301018 (+0.00z)| norm 0.2441 (+0.00z)| lr 1.59e-04 | 4212.18 ms | 32.1% bf16 MFU | 124959 tok/s step 14106/19560 | loss 3.315522 (+0.41z)| norm 0.2286 (-1.46z)| lr 1.59e-04 | 4181.45 ms | 32.3% bf16 MFU | 124981 tok/s step 14107/19560 | loss 3.333853 (+0.92z)| norm 0.2288 (-1.43z)| lr 1.59e-04 | 4175.49 ms | 32.3% bf16 MFU | 125010 tok/s step 14108/19560 | loss 3.299855 (-0.05z)| norm 0.2336 (-0.96z)| lr 1.59e-04 | 4172.25 ms | 32.4% bf16 MFU | 125042 tok/s step 14109/19560 | loss 3.350008 (+1.40z)| norm 0.2347 (-0.84z)| lr 1.59e-04 | 4203.45 ms | 32.1% bf16 MFU | 125027 tok/s step 14110/19560 | loss 3.279642 (-0.64z)| norm 0.2387 (-0.45z)| lr 1.59e-04 | 4184.44 ms | 32.3% bf16 MFU | 125040 tok/s step 14111/19560 | loss 3.359667 (+1.65z)| norm 0.2254 (-1.69z)| lr 1.58e-04 | 4168.47 ms | 32.4% bf16 MFU | 125077 tok/s step 14112/19560 | loss 3.307043 (+0.15z)| norm 0.2380 (-0.49z)| lr 1.58e-04 | 4179.61 ms | 32.3% bf16 MFU | 125095 tok/s step 14113/19560 | loss 3.312600 (+0.31z)| norm 0.2469 (+0.34z)| lr 1.58e-04 | 4174.31 ms | 32.3% bf16 MFU | 125120 tok/s step 14114/19560 | loss 3.275602 (-0.75z)| norm 0.2642 (+1.97z)| lr 1.58e-04 | 4175.39 ms | 32.3% bf16 MFU | 125142 tok/s step 14115/19560 | loss 3.264821 (-1.06z)| norm 0.2580 (+1.36z)| lr 1.58e-04 | 4171.75 ms | 32.4% bf16 MFU | 125169 tok/s step 14116/19560 | loss 3.312290 (+0.32z)| norm 0.2570 (+1.26z)| lr 1.58e-04 | 4190.65 ms | 32.2% bf16 MFU | 125166 tok/s step 14117/19560 | loss 3.292874 (-0.24z)| norm 0.2520 (+0.77z)| lr 1.58e-04 | 4170.42 ms | 32.4% bf16 MFU | 125194 tok/s step 14118/19560 | loss 3.294331 (-0.19z)| norm 0.2350 (-0.81z)| lr 1.58e-04 | 4189.71 ms | 32.2% bf16 MFU | 125191 tok/s step 14119/19560 | loss 3.272498 (-0.82z)| norm 0.2700 (+2.39z)| lr 1.58e-04 | 4179.47 ms | 32.3% bf16 MFU | 125203 tok/s step 14120/19560 | loss 3.331009 (+0.87z)| norm 0.2603 (+1.53z)| lr 1.58e-04 | 4175.67 ms | 32.3% bf16 MFU | 125221 tok/s step 14121/19560 | loss 3.340487 (+1.13z)| norm 0.2508 (+0.64z)| lr 1.58e-04 | 4172.51 ms | 32.4% bf16 MFU | 125243 tok/s step 14122/19560 | loss 3.283394 (-0.55z)| norm 0.2700 (+2.36z)| lr 1.58e-04 | 4180.68 ms | 32.3% bf16 MFU | 125251 tok/s step 14123/19560 | loss 3.331074 (+0.85z)| norm 0.2505 (+0.56z)| lr 1.58e-04 | 4189.04 ms | 32.2% bf16 MFU | 125246 tok/s step 14124/19560 | loss 3.270906 (-0.91z)| norm 0.2713 (+2.39z)| lr 1.58e-04 | 4197.10 ms | 32.2% bf16 MFU | 125230 tok/s step 14125/19560 | loss 3.293788 (-0.24z)| norm 0.2483 (+0.33z)| lr 1.58e-04 | 4166.21 ms | 32.4% bf16 MFU | 125260 tok/s step 14126/19560 | loss 3.307878 (+0.18z)| norm 0.2761 (+2.72z)| lr 1.58e-04 | 4181.92 ms | 32.3% bf16 MFU | 125266 tok/s step 14127/19560 | loss 3.245880 (-1.63z)| norm 0.2512 (+0.54z)| lr 1.58e-04 | 4170.59 ms | 32.4% bf16 MFU | 125288 tok/s step 14128/19560 | loss 3.314653 (+0.42z)| norm 0.2628 (+1.52z)| lr 1.58e-04 | 4185.71 ms | 32.3% bf16 MFU | 125287 tok/s step 14129/19560 | loss 3.358221 (+1.73z)| norm 0.2542 (+0.76z)| lr 1.57e-04 | 4191.25 ms | 32.2% bf16 MFU | 125277 tok/s step 14130/19560 | loss 3.340098 (+1.17z)| norm 0.2404 (-0.45z)| lr 1.57e-04 | 4176.92 ms | 32.3% bf16 MFU | 125289 tok/s step 14131/19560 | loss 3.335964 (+1.04z)| norm 0.2539 (+0.73z)| lr 1.57e-04 | 4179.99 ms | 32.3% bf16 MFU | 125296 tok/s step 14132/19560 | loss 3.282579 (-0.56z)| norm 0.2511 (+0.47z)| lr 1.57e-04 | 4175.52 ms | 32.3% bf16 MFU | 125309 tok/s step 14133/19560 | loss 3.250738 (-1.49z)| norm 0.2368 (-0.78z)| lr 1.57e-04 | 4168.92 ms | 32.4% bf16 MFU | 125332 tok/s step 14134/19560 | loss 3.299937 (+0.00z)| norm 0.2639 (+1.57z)| lr 1.57e-04 | 4182.00 ms | 32.3% bf16 MFU | 125334 tok/s step 14135/19560 | loss 3.304372 (+0.13z)| norm 0.2407 (-0.44z)| lr 1.57e-04 | 4180.92 ms | 32.3% bf16 MFU | 125337 tok/s step 14136/19560 | loss 3.273788 (-0.79z)| norm 0.2315 (-1.22z)| lr 1.57e-04 | 4169.09 ms | 32.4% bf16 MFU | 125358 tok/s step 14137/19560 | loss 3.301762 (+0.06z)| norm 0.2676 (+1.86z)| lr 1.57e-04 | 4175.55 ms | 32.3% bf16 MFU | 125368 tok/s step 14138/19560 | loss 3.232942 (-1.99z)| norm 0.2388 (-0.63z)| lr 1.57e-04 | 4174.37 ms | 32.3% bf16 MFU | 125380 tok/s step 14139/19560 | loss 3.343687 (+1.31z)| norm 0.2465 (+0.04z)| lr 1.57e-04 | 4174.51 ms | 32.3% bf16 MFU | 125390 tok/s step 14140/19560 | loss 3.293523 (-0.19z)| norm 0.2583 (+1.05z)| lr 1.57e-04 | 4178.85 ms | 32.3% bf16 MFU | 125394 tok/s step 14141/19560 | loss 3.318376 (+0.55z)| norm 0.2495 (+0.28z)| lr 1.57e-04 | 4188.42 ms | 32.2% bf16 MFU | 125383 tok/s step 14142/19560 | loss 3.329493 (+0.88z)| norm 0.2479 (+0.15z)| lr 1.57e-04 | 4191.81 ms | 32.2% bf16 MFU | 125367 tok/s step 14143/19560 | loss 3.293884 (-0.18z)| norm 0.2558 (+0.83z)| lr 1.57e-04 | 4171.62 ms | 32.4% bf16 MFU | 125383 tok/s step 14144/19560 | loss 3.322321 (+0.66z)| norm 0.2416 (-0.39z)| lr 1.57e-04 | 4230.31 ms | 31.9% bf16 MFU | 125311 tok/s step 14145/19560 | loss 3.301608 (+0.01z)| norm 0.2432 (-0.25z)| lr 1.57e-04 | 4174.46 ms | 32.3% bf16 MFU | 125325 tok/s step 14146/19560 | loss 3.250113 (-1.56z)| norm 0.2423 (-0.32z)| lr 1.57e-04 | 4170.90 ms | 32.4% bf16 MFU | 125344 tok/s step 14147/19560 | loss 3.289070 (-0.38z)| norm 0.2369 (-0.79z)| lr 1.57e-04 | 4167.58 ms | 32.4% bf16 MFU | 125367 tok/s step 14148/19560 | loss 3.269113 (-1.00z)| norm 0.2321 (-1.19z)| lr 1.56e-04 | 4181.25 ms | 32.3% bf16 MFU | 125368 tok/s step 14149/19560 | loss 3.313155 (+0.39z)| norm 0.2547 (+0.78z)| lr 1.56e-04 | 4180.59 ms | 32.3% bf16 MFU | 125370 tok/s step 14150/19560 | loss 3.299676 (-0.04z)| norm 0.2488 (+0.25z)| lr 1.56e-04 | 4169.57 ms | 32.4% bf16 MFU | 125388 tok/s step 14151/19560 | loss 3.295939 (-0.15z)| norm 0.2307 (-1.31z)| lr 1.56e-04 | 4169.40 ms | 32.4% bf16 MFU | 125406 tok/s step 14152/19560 | loss 3.298867 (-0.07z)| norm 0.2337 (-1.04z)| lr 1.56e-04 | 4178.43 ms | 32.3% bf16 MFU | 125410 tok/s step 14153/19560 | loss 3.337415 (+1.14z)| norm 0.2565 (+0.92z)| lr 1.56e-04 | 4179.12 ms | 32.3% bf16 MFU | 125412 tok/s step 14154/19560 | loss 3.274816 (-0.83z)| norm 0.2372 (-0.74z)| lr 1.56e-04 | 4173.79 ms | 32.3% bf16 MFU | 125422 tok/s step 14155/19560 | loss 3.310699 (+0.30z)| norm 0.2429 (-0.25z)| lr 1.56e-04 | 4170.46 ms | 32.4% bf16 MFU | 125437 tok/s step 14156/19560 | loss 3.315415 (+0.45z)| norm 0.2531 (+0.62z)| lr 1.56e-04 | 4181.48 ms | 32.3% bf16 MFU | 125434 tok/s step 14157/19560 | loss 3.346398 (+1.42z)| norm 0.2247 (-1.82z)| lr 1.56e-04 | 4205.39 ms | 32.1% bf16 MFU | 125396 tok/s step 14158/19560 | loss 3.285674 (-0.51z)| norm 0.2604 (+1.23z)| lr 1.56e-04 | 4172.14 ms | 32.4% bf16 MFU | 125409 tok/s step 14159/19560 | loss 3.285889 (-0.50z)| norm 0.2458 (-0.01z)| lr 1.56e-04 | 4175.18 ms | 32.3% bf16 MFU | 125417 tok/s step 14160/19560 | loss 3.254660 (-1.50z)| norm 0.2349 (-0.93z)| lr 1.56e-04 | 4173.87 ms | 32.3% bf16 MFU | 125427 tok/s step 14161/19560 | loss 3.258469 (-1.36z)| norm 0.2495 (+0.31z)| lr 1.56e-04 | 4206.87 ms | 32.1% bf16 MFU | 125387 tok/s step 14162/19560 | loss 3.307902 (+0.21z)| norm 0.2550 (+0.77z)| lr 1.56e-04 | 4232.08 ms | 31.9% bf16 MFU | 125312 tok/s step 14163/19560 | loss 3.332361 (+0.97z)| norm 0.2524 (+0.55z)| lr 1.56e-04 | 4178.16 ms | 32.3% bf16 MFU | 125321 tok/s step 14164/19560 | loss 3.403363 (+3.07z)| norm 0.2594 (+1.14z)| lr 1.56e-04 | 4178.95 ms | 32.3% bf16 MFU | 125327 tok/s step 14165/19560 | loss 3.294479 (-0.25z)| norm 0.2572 (+0.93z)| lr 1.56e-04 | 4174.36 ms | 32.3% bf16 MFU | 125341 tok/s step 14166/19560 | loss 3.319598 (+0.51z)| norm 0.2666 (+1.71z)| lr 1.56e-04 | 4169.43 ms | 32.4% bf16 MFU | 125361 tok/s step 14167/19560 | loss 3.206931 (-2.84z)| norm 0.2534 (+0.58z)| lr 1.55e-04 | 4167.37 ms | 32.4% bf16 MFU | 125384 tok/s step 14168/19560 | loss 3.294708 (-0.22z)| norm 0.2502 (+0.31z)| lr 1.55e-04 | 4171.13 ms | 32.4% bf16 MFU | 125399 tok/s step 14169/19560 | loss 3.301145 (-0.04z)| norm 0.2499 (+0.28z)| lr 1.55e-04 | 4180.76 ms | 32.3% bf16 MFU | 125399 tok/s step 14170/19560 | loss 3.303596 (+0.03z)| norm 0.2403 (-0.56z)| lr 1.55e-04 | 4184.27 ms | 32.3% bf16 MFU | 125394 tok/s step 14171/19560 | loss 3.304153 (+0.05z)| norm 0.2388 (-0.69z)| lr 1.55e-04 | 4175.87 ms | 32.3% bf16 MFU | 125402 tok/s step 14172/19560 | loss 3.326219 (+0.70z)| norm 0.2519 (+0.45z)| lr 1.55e-04 | 4171.14 ms | 32.4% bf16 MFU | 125417 tok/s step 14173/19560 | loss 3.287205 (-0.47z)| norm 0.2707 (+2.03z)| lr 1.55e-04 | 4171.12 ms | 32.4% bf16 MFU | 125431 tok/s step 14174/19560 | loss 3.272562 (-0.90z)| norm 0.2538 (+0.58z)| lr 1.55e-04 | 4156.65 ms | 32.5% bf16 MFU | 125466 tok/s step 14175/19560 | loss 3.349563 (+1.40z)| norm 0.2730 (+2.23z)| lr 1.55e-04 | 4169.22 ms | 32.4% bf16 MFU | 125480 tok/s step 14176/19560 | loss 3.338972 (+1.08z)| norm 0.2795 (+2.69z)| lr 1.55e-04 | 4195.66 ms | 32.2% bf16 MFU | 125454 tok/s step 14177/19560 | loss 3.298882 (-0.12z)| norm 0.2591 (+0.99z)| lr 1.55e-04 | 4188.28 ms | 32.2% bf16 MFU | 125440 tok/s step 14178/19560 | loss 3.280439 (-0.67z)| norm 0.2515 (+0.36z)| lr 1.55e-04 | 4172.59 ms | 32.4% bf16 MFU | 125451 tok/s step 14179/19560 | loss 3.300882 (-0.07z)| norm 0.2805 (+2.70z)| lr 1.55e-04 | 4178.21 ms | 32.3% bf16 MFU | 125452 tok/s step 14180/19560 | loss 3.271432 (-0.94z)| norm 0.2567 (+0.76z)| lr 1.55e-04 | 4172.96 ms | 32.4% bf16 MFU | 125462 tok/s step 14181/19560 | loss 3.304791 (+0.06z)| norm 0.2551 (+0.63z)| lr 1.55e-04 | 4179.03 ms | 32.3% bf16 MFU | 125462 tok/s step 14182/19560 | loss 3.287465 (-0.45z)| norm 0.2769 (+2.35z)| lr 1.55e-04 | 4203.05 ms | 32.1% bf16 MFU | 125425 tok/s step 14183/19560 | loss 3.367697 (+1.91z)| norm 0.2833 (+2.75z)| lr 1.55e-04 | 4175.28 ms | 32.3% bf16 MFU | 125433 tok/s step 14184/19560 | loss 3.264920 (-1.14z)| norm 0.2772 (+2.22z)| lr 1.55e-04 | 4173.44 ms | 32.4% bf16 MFU | 125442 tok/s step 14185/19560 | loss 3.220333 (-2.39z)| norm 0.2648 (+1.25z)| lr 1.54e-04 | 4175.44 ms | 32.3% bf16 MFU | 125448 tok/s step 14186/19560 | loss 3.303262 (+0.01z)| norm 0.2804 (+2.36z)| lr 1.54e-04 | 4170.23 ms | 32.4% bf16 MFU | 125462 tok/s step 14187/19560 | loss 3.321782 (+0.54z)| norm 0.2524 (+0.28z)| lr 1.54e-04 | 4166.73 ms | 32.4% bf16 MFU | 125480 tok/s step 14188/19560 | loss 3.303208 (-0.01z)| norm 0.2513 (+0.19z)| lr 1.54e-04 | 4178.49 ms | 32.3% bf16 MFU | 125480 tok/s step 14189/19560 | loss 3.302895 (-0.00z)| norm 0.2602 (+0.84z)| lr 1.54e-04 | 4167.30 ms | 32.4% bf16 MFU | 125496 tok/s step 14190/19560 | loss 3.278734 (-0.71z)| norm 0.2418 (-0.52z)| lr 1.54e-04 | 4179.50 ms | 32.3% bf16 MFU | 125494 tok/s step 14191/19560 | loss 3.298843 (-0.12z)| norm 0.2373 (-0.84z)| lr 1.54e-04 | 4176.66 ms | 32.3% bf16 MFU | 125495 tok/s step 14192/19560 | loss 3.243087 (-1.76z)| norm 0.2380 (-0.79z)| lr 1.54e-04 | 4173.19 ms | 32.4% bf16 MFU | 125502 tok/s step 14193/19560 | loss 3.277304 (-0.74z)| norm 0.2472 (-0.10z)| lr 1.54e-04 | 4169.08 ms | 32.4% bf16 MFU | 125515 tok/s step 14194/19560 | loss 3.259358 (-1.27z)| norm 0.2491 (+0.03z)| lr 1.54e-04 | 4169.67 ms | 32.4% bf16 MFU | 125526 tok/s step 14195/19560 | loss 3.257201 (-1.32z)| norm 0.2528 (+0.32z)| lr 1.54e-04 | 4171.40 ms | 32.4% bf16 MFU | 125534 tok/s step 14196/19560 | loss 3.320148 (+0.53z)| norm 0.2527 (+0.30z)| lr 1.54e-04 | 4181.11 ms | 32.3% bf16 MFU | 125527 tok/s step 14197/19560 | loss 3.277729 (-0.72z)| norm 0.2344 (-1.07z)| lr 1.54e-04 | 4177.25 ms | 32.3% bf16 MFU | 125526 tok/s step 14198/19560 | loss 3.345542 (+1.25z)| norm 0.2699 (+1.57z)| lr 1.54e-04 | 4201.06 ms | 32.1% bf16 MFU | 125490 tok/s step 14199/19560 | loss 3.334327 (+0.93z)| norm 0.2458 (-0.23z)| lr 1.54e-04 | 4168.19 ms | 32.4% bf16 MFU | 125505 tok/s step 14200/19560 | loss 3.315969 (+0.38z)| norm 0.2532 (+0.31z)| lr 1.54e-04 | 4182.91 ms | 32.3% bf16 MFU | 125496 tok/s step 14201/19560 | loss 3.328576 (+0.76z)| norm 0.2575 (+0.62z)| lr 1.54e-04 | 4176.88 ms | 32.3% bf16 MFU | 125498 tok/s step 14202/19560 | loss 3.252565 (-1.46z)| norm 0.2601 (+0.80z)| lr 1.54e-04 | 4175.54 ms | 32.3% bf16 MFU | 125501 tok/s step 14203/19560 | loss 3.217056 (-2.45z)| norm 0.2381 (-0.86z)| lr 1.54e-04 | 4174.78 ms | 32.3% bf16 MFU | 125505 tok/s step 14204/19560 | loss 3.355263 (+1.53z)| norm 0.2851 (+2.59z)| lr 1.53e-04 | 4172.09 ms | 32.4% bf16 MFU | 125513 tok/s step 14205/19560 | loss 3.330783 (+0.83z)| norm 0.2595 (+0.71z)| lr 1.53e-04 | 4179.79 ms | 32.3% bf16 MFU | 125509 tok/s step 14206/19560 | loss 3.272455 (-0.83z)| norm 0.2653 (+1.12z)| lr 1.53e-04 | 4165.16 ms | 32.4% bf16 MFU | 125527 tok/s step 14207/19560 | loss 3.389521 (+2.50z)| norm 0.2858 (+2.56z)| lr 1.53e-04 | 4168.90 ms | 32.4% bf16 MFU | 125539 tok/s step 14208/19560 | loss 3.312259 (+0.31z)| norm 0.2436 (-0.47z)| lr 1.53e-04 | 4176.55 ms | 32.3% bf16 MFU | 125539 tok/s step 14209/19560 | loss 3.259407 (-1.20z)| norm 0.2765 (+1.85z)| lr 1.53e-04 | 4170.50 ms | 32.4% bf16 MFU | 125547 tok/s step 14210/19560 | loss 3.355862 (+1.53z)| norm 0.2656 (+1.06z)| lr 1.53e-04 | 4165.86 ms | 32.4% bf16 MFU | 125563 tok/s step 14211/19560 | loss 3.323670 (+0.61z)| norm 0.2610 (+0.73z)| lr 1.53e-04 | 4174.43 ms | 32.3% bf16 MFU | 125564 tok/s step 14212/19560 | loss 3.311516 (+0.25z)| norm 0.2598 (+0.63z)| lr 1.53e-04 | 4174.24 ms | 32.3% bf16 MFU | 125566 tok/s step 14213/19560 | loss 3.314404 (+0.32z)| norm 0.2386 (-0.87z)| lr 1.53e-04 | 4197.18 ms | 32.2% bf16 MFU | 125534 tok/s step 14214/19560 | loss 3.250100 (-1.55z)| norm 0.2553 (+0.31z)| lr 1.53e-04 | 4197.73 ms | 32.2% bf16 MFU | 125502 tok/s step 14215/19560 | loss 3.278358 (-0.72z)| norm 0.2454 (-0.40z)| lr 1.53e-04 | 4173.09 ms | 32.4% bf16 MFU | 125509 tok/s step 14216/19560 | loss 3.280940 (-0.64z)| norm 0.2590 (+0.57z)| lr 1.53e-04 | 4176.96 ms | 32.3% bf16 MFU | 125509 tok/s step 14217/19560 | loss 3.316089 (+0.38z)| norm 0.2519 (+0.06z)| lr 1.53e-04 | 4682.20 ms | 28.8% bf16 MFU | 124832 tok/s step 14218/19560 | loss 3.368424 (+1.87z)| norm 0.2478 (-0.23z)| lr 1.53e-04 | 4756.83 ms | 28.4% bf16 MFU | 124102 tok/s step 14219/19560 | loss 3.299500 (-0.11z)| norm 0.2507 (-0.04z)| lr 1.53e-04 | 4708.83 ms | 28.7% bf16 MFU | 123464 tok/s step 14220/19560 | loss 3.292524 (-0.30z)| norm 0.2534 (+0.16z)| lr 1.53e-04 | 4558.48 ms | 29.6% bf16 MFU | 123041 tok/s step 14221/19560 | loss 3.333302 (+0.90z)| norm 0.2386 (-0.91z)| lr 1.53e-04 | 4239.32 ms | 31.8% bf16 MFU | 123073 tok/s step 14222/19560 | loss 3.301269 (-0.05z)| norm 0.2513 (+0.00z)| lr 1.53e-04 | 4330.05 ms | 31.2% bf16 MFU | 122973 tok/s step 14223/19560 | loss 3.325418 (+0.66z)| norm 0.2506 (-0.05z)| lr 1.52e-04 | 4219.07 ms | 32.0% bf16 MFU | 123038 tok/s step 14224/19560 | loss 3.333422 (+0.89z)| norm 0.2373 (-1.00z)| lr 1.52e-04 | 4198.13 ms | 32.2% bf16 MFU | 123130 tok/s step 14225/19560 | loss 3.263964 (-1.13z)| norm 0.2421 (-0.64z)| lr 1.52e-04 | 4171.37 ms | 32.4% bf16 MFU | 123258 tok/s step 14226/19560 | loss 3.393242 (+2.56z)| norm 0.2412 (-0.72z)| lr 1.52e-04 | 4170.18 ms | 32.4% bf16 MFU | 123381 tok/s step 14227/19560 | loss 3.438889 (+3.67z)| norm 0.2679 (+1.19z)| lr 1.52e-04 | 4327.39 ms | 31.2% bf16 MFU | 123270 tok/s step 14228/19560 | loss 3.316194 (+0.33z)| norm 0.2434 (-0.58z)| lr 1.52e-04 | 4200.89 ms | 32.1% bf16 MFU | 123347 tok/s step 14229/19560 | loss 3.326894 (+0.61z)| norm 0.2400 (-0.84z)| lr 1.52e-04 | 4216.34 ms | 32.0% bf16 MFU | 123397 tok/s step 14230/19560 | loss 3.302705 (-0.04z)| norm 0.2316 (-1.44z)| lr 1.52e-04 | 4192.87 ms | 32.2% bf16 MFU | 123479 tok/s step 14231/19560 | loss 3.287769 (-0.44z)| norm 0.2385 (-0.96z)| lr 1.52e-04 | 4190.82 ms | 32.2% bf16 MFU | 123560 tok/s step 14232/19560 | loss 3.332870 (+0.79z)| norm 0.2319 (-1.44z)| lr 1.52e-04 | 4173.70 ms | 32.3% bf16 MFU | 123663 tok/s step 14233/19560 | loss 3.274029 (-0.81z)| norm 0.2385 (-0.96z)| lr 1.52e-04 | 4167.18 ms | 32.4% bf16 MFU | 123771 tok/s step 14234/19560 | loss 3.313368 (+0.26z)| norm 0.2406 (-0.82z)| lr 1.52e-04 | 4161.14 ms | 32.4% bf16 MFU | 123882 tok/s step 14235/19560 | loss 3.304048 (+0.02z)| norm 0.2271 (-1.81z)| lr 1.52e-04 | 4169.22 ms | 32.4% bf16 MFU | 123975 tok/s step 14236/19560 | loss 3.319004 (+0.42z)| norm 0.2808 (+2.11z)| lr 1.52e-04 | 4169.85 ms | 32.4% bf16 MFU | 124063 tok/s step 14237/19560 | loss 3.314250 (+0.30z)| norm 0.2240 (-2.02z)| lr 1.52e-04 | 4170.83 ms | 32.4% bf16 MFU | 124145 tok/s step 14238/19560 | loss 3.336275 (+0.89z)| norm 0.2629 (+0.78z)| lr 1.52e-04 | 4159.61 ms | 32.5% bf16 MFU | 124240 tok/s step 14239/19560 | loss 3.338374 (+0.96z)| norm 0.2386 (-0.99z)| lr 1.52e-04 | 4174.02 ms | 32.3% bf16 MFU | 124309 tok/s step 14240/19560 | loss 3.298435 (-0.14z)| norm 0.2326 (-1.42z)| lr 1.52e-04 | 4234.59 ms | 31.9% bf16 MFU | 124284 tok/s step 14241/19560 | loss 3.278838 (-0.67z)| norm 0.2444 (-0.56z)| lr 1.52e-04 | 4216.45 ms | 32.0% bf16 MFU | 124287 tok/s step 14242/19560 | loss 3.277493 (-0.71z)| norm 0.2407 (-0.82z)| lr 1.51e-04 | 4171.46 ms | 32.4% bf16 MFU | 124357 tok/s step 14243/19560 | loss 3.349189 (+1.25z)| norm 0.2291 (-1.63z)| lr 1.51e-04 | 4162.11 ms | 32.4% bf16 MFU | 124437 tok/s step 14244/19560 | loss 3.341286 (+1.02z)| norm 0.2630 (+0.81z)| lr 1.51e-04 | 4180.65 ms | 32.3% bf16 MFU | 124486 tok/s step 14245/19560 | loss 3.328712 (+0.66z)| norm 0.2452 (-0.47z)| lr 1.51e-04 | 4167.17 ms | 32.4% bf16 MFU | 124552 tok/s step 14246/19560 | loss 3.311943 (+0.20z)| norm 0.2403 (-0.83z)| lr 1.51e-04 | 4167.82 ms | 32.4% bf16 MFU | 124614 tok/s step 14247/19560 | loss 3.390271 (+2.28z)| norm 0.2648 (+0.95z)| lr 1.51e-04 | 4178.98 ms | 32.3% bf16 MFU | 124656 tok/s step 14248/19560 | loss 3.279068 (-0.70z)| norm 0.2687 (+1.22z)| lr 1.51e-04 | 4160.18 ms | 32.5% bf16 MFU | 124725 tok/s step 14249/19560 | loss 3.292514 (-0.33z)| norm 0.2528 (+0.07z)| lr 1.51e-04 | 4178.52 ms | 32.3% bf16 MFU | 124762 tok/s step 14250/19560 | loss 3.318240 (+0.36z)| norm 0.2316 (-1.43z)| lr 1.51e-04 | 4190.81 ms | 32.2% bf16 MFU | 124779 tok/s val loss 3.296182 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3012/10042 = 0.299940 step 14251/19560 | loss 3.282050 (-0.61z)| norm 0.2596 (+0.58z)| lr 1.51e-04 | 4176.14 ms | 32.3% bf16 MFU | 124817 tok/s step 14252/19560 | loss 3.291874 (-0.35z)| norm 0.2503 (-0.08z)| lr 1.51e-04 | 4361.22 ms | 31.0% bf16 MFU | 124587 tok/s step 14253/19560 | loss 3.270416 (-0.92z)| norm 0.2529 (+0.10z)| lr 1.51e-04 | 4173.34 ms | 32.4% bf16 MFU | 124639 tok/s step 14254/19560 | loss 3.339518 (+0.93z)| norm 0.2802 (+2.07z)| lr 1.51e-04 | 4172.47 ms | 32.4% bf16 MFU | 124690 tok/s step 14255/19560 | loss 3.334408 (+0.78z)| norm 0.2451 (-0.45z)| lr 1.51e-04 | 4167.59 ms | 32.4% bf16 MFU | 124746 tok/s step 14256/19560 | loss 3.355524 (+1.34z)| norm 0.2604 (+0.65z)| lr 1.51e-04 | 4176.14 ms | 32.3% bf16 MFU | 124786 tok/s step 14257/19560 | loss 3.334756 (+0.79z)| norm 0.2547 (+0.24z)| lr 1.51e-04 | 4199.99 ms | 32.1% bf16 MFU | 124788 tok/s step 14258/19560 | loss 3.289323 (-0.43z)| norm 0.2550 (+0.26z)| lr 1.51e-04 | 4166.59 ms | 32.4% bf16 MFU | 124840 tok/s step 14259/19560 | loss 3.271876 (-0.89z)| norm 0.2482 (-0.24z)| lr 1.51e-04 | 4167.51 ms | 32.4% bf16 MFU | 124888 tok/s step 14260/19560 | loss 3.317662 (+0.34z)| norm 0.2496 (-0.13z)| lr 1.51e-04 | 4165.57 ms | 32.4% bf16 MFU | 124937 tok/s step 14261/19560 | loss 3.324452 (+0.52z)| norm 0.2582 (+0.48z)| lr 1.50e-04 | 4177.85 ms | 32.3% bf16 MFU | 124965 tok/s step 14262/19560 | loss 3.265205 (-1.10z)| norm 0.2419 (-0.69z)| lr 1.50e-04 | 4169.15 ms | 32.4% bf16 MFU | 125004 tok/s step 14263/19560 | loss 3.326550 (+0.57z)| norm 0.2619 (+0.75z)| lr 1.50e-04 | 4167.76 ms | 32.4% bf16 MFU | 125044 tok/s step 14264/19560 | loss 3.296423 (-0.25z)| norm 0.2413 (-0.76z)| lr 1.50e-04 | 4171.55 ms | 32.4% bf16 MFU | 125076 tok/s step 14265/19560 | loss 3.282612 (-0.63z)| norm 0.2544 (+0.20z)| lr 1.50e-04 | 4178.30 ms | 32.3% bf16 MFU | 125096 tok/s step 14266/19560 | loss 3.349833 (+1.19z)| norm 0.2545 (+0.21z)| lr 1.50e-04 | 4168.67 ms | 32.4% bf16 MFU | 125129 tok/s step 14267/19560 | loss 3.309985 (+0.10z)| norm 0.2430 (-0.64z)| lr 1.50e-04 | 4166.31 ms | 32.4% bf16 MFU | 125165 tok/s step 14268/19560 | loss 3.346382 (+1.09z)| norm 0.2407 (-0.80z)| lr 1.50e-04 | 4174.45 ms | 32.3% bf16 MFU | 125186 tok/s step 14269/19560 | loss 3.267112 (-1.07z)| norm 0.2316 (-1.44z)| lr 1.50e-04 | 4200.39 ms | 32.1% bf16 MFU | 125168 tok/s step 14270/19560 | loss 3.306443 (+0.01z)| norm 0.2267 (-1.77z)| lr 1.50e-04 | 4173.60 ms | 32.4% bf16 MFU | 125191 tok/s step 14271/19560 | loss 3.349015 (+1.16z)| norm 0.2580 (+0.49z)| lr 1.50e-04 | 4168.36 ms | 32.4% bf16 MFU | 125220 tok/s step 14272/19560 | loss 3.302556 (-0.10z)| norm 0.2322 (-1.36z)| lr 1.50e-04 | 4166.82 ms | 32.4% bf16 MFU | 125250 tok/s step 14273/19560 | loss 3.351420 (+1.22z)| norm 0.2379 (-0.95z)| lr 1.50e-04 | 4167.50 ms | 32.4% bf16 MFU | 125278 tok/s step 14274/19560 | loss 3.285905 (-0.58z)| norm 0.2488 (-0.17z)| lr 1.50e-04 | 4167.78 ms | 32.4% bf16 MFU | 125304 tok/s step 14275/19560 | loss 3.270933 (-0.98z)| norm 0.2363 (-1.07z)| lr 1.50e-04 | 4166.10 ms | 32.4% bf16 MFU | 125331 tok/s step 14276/19560 | loss 3.329912 (+0.62z)| norm 0.2375 (-0.99z)| lr 1.50e-04 | 4169.75 ms | 32.4% bf16 MFU | 125351 tok/s step 14277/19560 | loss 3.260027 (-1.28z)| norm 0.2377 (-0.96z)| lr 1.50e-04 | 4166.12 ms | 32.4% bf16 MFU | 125376 tok/s step 14278/19560 | loss 3.309257 (+0.06z)| norm 0.2337 (-1.23z)| lr 1.50e-04 | 4185.79 ms | 32.3% bf16 MFU | 125370 tok/s step 14279/19560 | loss 3.376629 (+1.85z)| norm 0.2397 (-0.82z)| lr 1.50e-04 | 4169.71 ms | 32.4% bf16 MFU | 125388 tok/s step 14280/19560 | loss 3.344606 (+0.98z)| norm 0.2269 (-1.72z)| lr 1.49e-04 | 4173.35 ms | 32.4% bf16 MFU | 125400 tok/s step 14281/19560 | loss 3.304710 (-0.08z)| norm 0.2363 (-1.03z)| lr 1.49e-04 | 4182.16 ms | 32.3% bf16 MFU | 125398 tok/s step 14282/19560 | loss 3.361698 (+1.42z)| norm 0.2405 (-0.74z)| lr 1.49e-04 | 4171.86 ms | 32.4% bf16 MFU | 125412 tok/s step 14283/19560 | loss 3.305488 (-0.08z)| norm 0.2406 (-0.73z)| lr 1.49e-04 | 4163.87 ms | 32.4% bf16 MFU | 125437 tok/s step 14284/19560 | loss 3.294889 (-0.36z)| norm 0.2414 (-0.67z)| lr 1.49e-04 | 4179.60 ms | 32.3% bf16 MFU | 125437 tok/s step 14285/19560 | loss 3.300486 (-0.20z)| norm 0.2516 (+0.04z)| lr 1.49e-04 | 4175.42 ms | 32.3% bf16 MFU | 125444 tok/s step 14286/19560 | loss 3.324829 (+0.45z)| norm 0.2668 (+1.13z)| lr 1.49e-04 | 4172.24 ms | 32.4% bf16 MFU | 125454 tok/s step 14287/19560 | loss 3.353583 (+1.20z)| norm 0.2419 (-0.65z)| lr 1.49e-04 | 4208.93 ms | 32.1% bf16 MFU | 125410 tok/s step 14288/19560 | loss 3.287965 (-0.56z)| norm 0.2497 (-0.10z)| lr 1.49e-04 | 4164.85 ms | 32.4% bf16 MFU | 125434 tok/s step 14289/19560 | loss 3.351068 (+1.12z)| norm 0.2344 (-1.19z)| lr 1.49e-04 | 4173.14 ms | 32.4% bf16 MFU | 125444 tok/s step 14290/19560 | loss 3.323328 (+0.37z)| norm 0.2541 (+0.23z)| lr 1.49e-04 | 4166.61 ms | 32.4% bf16 MFU | 125463 tok/s step 14291/19560 | loss 3.332303 (+0.61z)| norm 0.2468 (-0.29z)| lr 1.49e-04 | 4170.42 ms | 32.4% bf16 MFU | 125476 tok/s step 14292/19560 | loss 3.342743 (+0.92z)| norm 0.2380 (-0.91z)| lr 1.49e-04 | 4167.58 ms | 32.4% bf16 MFU | 125492 tok/s step 14293/19560 | loss 3.315185 (+0.16z)| norm 0.2496 (-0.08z)| lr 1.49e-04 | 4159.06 ms | 32.5% bf16 MFU | 125520 tok/s step 14294/19560 | loss 3.333354 (+0.65z)| norm 0.2497 (-0.06z)| lr 1.49e-04 | 4161.20 ms | 32.4% bf16 MFU | 125544 tok/s step 14295/19560 | loss 3.242918 (-1.88z)| norm 0.2645 (+0.99z)| lr 1.49e-04 | 4169.47 ms | 32.4% bf16 MFU | 125554 tok/s step 14296/19560 | loss 3.295922 (-0.39z)| norm 0.2462 (-0.32z)| lr 1.49e-04 | 4168.60 ms | 32.4% bf16 MFU | 125565 tok/s step 14297/19560 | loss 3.348771 (+1.08z)| norm 0.2420 (-0.61z)| lr 1.49e-04 | 4286.28 ms | 31.5% bf16 MFU | 125403 tok/s step 14298/19560 | loss 3.319228 (+0.25z)| norm 0.2985 (+3.25z)| lr 1.48e-04 | 4166.99 ms | 32.4% bf16 MFU | 125423 tok/s step 14299/19560 | loss 3.322419 (+0.33z)| norm 0.2647 (+0.92z)| lr 1.48e-04 | 4160.02 ms | 32.5% bf16 MFU | 125454 tok/s step 14300/19560 | loss 3.367929 (+1.58z)| norm 0.2378 (-0.91z)| lr 1.48e-04 | 4162.99 ms | 32.4% bf16 MFU | 125478 tok/s step 14301/19560 | loss 3.283610 (-0.75z)| norm 0.2494 (-0.11z)| lr 1.48e-04 | 4172.03 ms | 32.4% bf16 MFU | 125488 tok/s step 14302/19560 | loss 3.374025 (+1.71z)| norm 0.2496 (-0.09z)| lr 1.48e-04 | 4163.83 ms | 32.4% bf16 MFU | 125509 tok/s step 14303/19560 | loss 3.264134 (-1.28z)| norm 0.2392 (-0.80z)| lr 1.48e-04 | 4164.09 ms | 32.4% bf16 MFU | 125529 tok/s step 14304/19560 | loss 3.299256 (-0.31z)| norm 0.2450 (-0.38z)| lr 1.48e-04 | 4325.04 ms | 31.2% bf16 MFU | 125313 tok/s step 14305/19560 | loss 3.305234 (-0.15z)| norm 0.2308 (-1.36z)| lr 1.48e-04 | 4161.90 ms | 32.4% bf16 MFU | 125346 tok/s step 14306/19560 | loss 3.281979 (-0.79z)| norm 0.2580 (+0.54z)| lr 1.48e-04 | 4167.12 ms | 32.4% bf16 MFU | 125370 tok/s step 14307/19560 | loss 3.249748 (-1.64z)| norm 0.2422 (-0.55z)| lr 1.48e-04 | 4153.83 ms | 32.5% bf16 MFU | 125412 tok/s step 14308/19560 | loss 3.310009 (-0.02z)| norm 0.2597 (+0.69z)| lr 1.48e-04 | 4162.30 ms | 32.4% bf16 MFU | 125440 tok/s step 14309/19560 | loss 3.370731 (+1.61z)| norm 0.2555 (+0.39z)| lr 1.48e-04 | 4269.46 ms | 31.6% bf16 MFU | 125308 tok/s step 14310/19560 | loss 3.313472 (+0.06z)| norm 0.2648 (+1.07z)| lr 1.48e-04 | 4173.33 ms | 32.4% bf16 MFU | 125324 tok/s step 14311/19560 | loss 3.374611 (+1.71z)| norm 0.2381 (-0.84z)| lr 1.48e-04 | 4165.44 ms | 32.4% bf16 MFU | 125351 tok/s step 14312/19560 | loss 3.287953 (-0.64z)| norm 0.2698 (+1.50z)| lr 1.48e-04 | 4179.67 ms | 32.3% bf16 MFU | 125355 tok/s step 14313/19560 | loss 3.259544 (-1.44z)| norm 0.2363 (-0.96z)| lr 1.48e-04 | 4171.29 ms | 32.4% bf16 MFU | 125372 tok/s step 14314/19560 | loss 3.324306 (+0.34z)| norm 0.2481 (-0.07z)| lr 1.48e-04 | 4170.61 ms | 32.4% bf16 MFU | 125389 tok/s step 14315/19560 | loss 3.326569 (+0.40z)| norm 0.2420 (-0.52z)| lr 1.48e-04 | 4170.82 ms | 32.4% bf16 MFU | 125405 tok/s step 14316/19560 | loss 3.344192 (+0.87z)| norm 0.2592 (+0.77z)| lr 1.48e-04 | 4169.60 ms | 32.4% bf16 MFU | 125421 tok/s step 14317/19560 | loss 3.355580 (+1.17z)| norm 0.2476 (-0.09z)| lr 1.48e-04 | 4177.57 ms | 32.3% bf16 MFU | 125425 tok/s step 14318/19560 | loss 3.302955 (-0.27z)| norm 0.2470 (-0.14z)| lr 1.47e-04 | 4170.50 ms | 32.4% bf16 MFU | 125440 tok/s step 14319/19560 | loss 3.310042 (-0.08z)| norm 0.2569 (+0.60z)| lr 1.47e-04 | 4164.71 ms | 32.4% bf16 MFU | 125462 tok/s step 14320/19560 | loss 3.314617 (+0.03z)| norm 0.2421 (-0.53z)| lr 1.47e-04 | 4171.03 ms | 32.4% bf16 MFU | 125474 tok/s step 14321/19560 | loss 3.355516 (+1.15z)| norm 0.2309 (-1.36z)| lr 1.47e-04 | 4167.56 ms | 32.4% bf16 MFU | 125490 tok/s step 14322/19560 | loss 3.250838 (-1.76z)| norm 0.2458 (-0.24z)| lr 1.47e-04 | 4167.95 ms | 32.4% bf16 MFU | 125505 tok/s step 14323/19560 | loss 3.299004 (-0.43z)| norm 0.2315 (-1.29z)| lr 1.47e-04 | 4169.03 ms | 32.4% bf16 MFU | 125518 tok/s step 14324/19560 | loss 3.331278 (+0.47z)| norm 0.2322 (-1.22z)| lr 1.47e-04 | 4170.12 ms | 32.4% bf16 MFU | 125528 tok/s step 14325/19560 | loss 3.324852 (+0.28z)| norm 0.2499 (+0.09z)| lr 1.47e-04 | 4169.46 ms | 32.4% bf16 MFU | 125539 tok/s step 14326/19560 | loss 3.302714 (-0.33z)| norm 0.2498 (+0.10z)| lr 1.47e-04 | 4163.78 ms | 32.4% bf16 MFU | 125558 tok/s step 14327/19560 | loss 3.322960 (+0.24z)| norm 0.2382 (-0.78z)| lr 1.47e-04 | 4167.88 ms | 32.4% bf16 MFU | 125570 tok/s step 14328/19560 | loss 3.292097 (-0.63z)| norm 0.2463 (-0.16z)| lr 1.47e-04 | 4170.80 ms | 32.4% bf16 MFU | 125576 tok/s step 14329/19560 | loss 3.291854 (-0.62z)| norm 0.2319 (-1.22z)| lr 1.47e-04 | 4171.67 ms | 32.4% bf16 MFU | 125582 tok/s step 14330/19560 | loss 3.327520 (+0.37z)| norm 0.2546 (+0.48z)| lr 1.47e-04 | 4170.91 ms | 32.4% bf16 MFU | 125588 tok/s step 14331/19560 | loss 3.284720 (-0.89z)| norm 0.2419 (-0.47z)| lr 1.47e-04 | 4171.48 ms | 32.4% bf16 MFU | 125592 tok/s step 14332/19560 | loss 3.340502 (+0.75z)| norm 0.2683 (+1.55z)| lr 1.47e-04 | 4166.11 ms | 32.4% bf16 MFU | 125605 tok/s step 14333/19560 | loss 3.321401 (+0.19z)| norm 0.2450 (-0.23z)| lr 1.47e-04 | 4211.97 ms | 32.1% bf16 MFU | 125549 tok/s step 14334/19560 | loss 3.257911 (-1.66z)| norm 0.2427 (-0.40z)| lr 1.47e-04 | 4214.55 ms | 32.0% bf16 MFU | 125491 tok/s step 14335/19560 | loss 3.340805 (+0.78z)| norm 0.2715 (+1.89z)| lr 1.47e-04 | 4189.95 ms | 32.2% bf16 MFU | 125473 tok/s step 14336/19560 | loss 3.319674 (+0.15z)| norm 0.2226 (-1.94z)| lr 1.47e-04 | 4214.87 ms | 32.0% bf16 MFU | 125419 tok/s step 14337/19560 | loss 3.327602 (+0.38z)| norm 0.2939 (+3.50z)| lr 1.46e-04 | 4182.82 ms | 32.3% bf16 MFU | 125415 tok/s step 14338/19560 | loss 3.276832 (-1.13z)| norm 0.2310 (-1.24z)| lr 1.46e-04 | 4168.66 ms | 32.4% bf16 MFU | 125433 tok/s step 14339/19560 | loss 3.313852 (-0.02z)| norm 0.2545 (+0.54z)| lr 1.46e-04 | 4169.95 ms | 32.4% bf16 MFU | 125448 tok/s step 14340/19560 | loss 3.265924 (-1.43z)| norm 0.2479 (+0.05z)| lr 1.46e-04 | 4173.28 ms | 32.4% bf16 MFU | 125457 tok/s step 14341/19560 | loss 3.287676 (-0.78z)| norm 0.2476 (+0.02z)| lr 1.46e-04 | 4175.57 ms | 32.3% bf16 MFU | 125462 tok/s step 14342/19560 | loss 3.312403 (-0.06z)| norm 0.2530 (+0.43z)| lr 1.46e-04 | 4170.65 ms | 32.4% bf16 MFU | 125474 tok/s step 14343/19560 | loss 3.254503 (-1.78z)| norm 0.2342 (-0.99z)| lr 1.46e-04 | 4170.31 ms | 32.4% bf16 MFU | 125487 tok/s step 14344/19560 | loss 3.344760 (+0.90z)| norm 0.2486 (+0.11z)| lr 1.46e-04 | 4215.08 ms | 32.0% bf16 MFU | 125431 tok/s step 14345/19560 | loss 3.287433 (-0.80z)| norm 0.2775 (+2.26z)| lr 1.46e-04 | 4169.75 ms | 32.4% bf16 MFU | 125447 tok/s step 14346/19560 | loss 3.471399 (+4.35z)| norm 0.2445 (-0.21z)| lr 1.46e-04 | 4164.23 ms | 32.4% bf16 MFU | 125469 tok/s step 14347/19560 | loss 3.309920 (-0.15z)| norm 0.2595 (+0.90z)| lr 1.46e-04 | 4172.78 ms | 32.4% bf16 MFU | 125478 tok/s step 14348/19560 | loss 3.273493 (-1.15z)| norm 0.2562 (+0.66z)| lr 1.46e-04 | 4170.36 ms | 32.4% bf16 MFU | 125490 tok/s step 14349/19560 | loss 3.293487 (-0.59z)| norm 0.2657 (+1.34z)| lr 1.46e-04 | 4210.59 ms | 32.1% bf16 MFU | 125442 tok/s step 14350/19560 | loss 3.318722 (+0.10z)| norm 0.2474 (-0.02z)| lr 1.46e-04 | 4181.86 ms | 32.3% bf16 MFU | 125438 tok/s step 14351/19560 | loss 3.401656 (+2.34z)| norm 0.2637 (+1.18z)| lr 1.46e-04 | 4162.23 ms | 32.4% bf16 MFU | 125464 tok/s step 14352/19560 | loss 3.336375 (+0.56z)| norm 0.2375 (-0.75z)| lr 1.46e-04 | 4172.26 ms | 32.4% bf16 MFU | 125474 tok/s step 14353/19560 | loss 3.288196 (-0.75z)| norm 0.3054 (+3.96z)| lr 1.46e-04 | 4174.67 ms | 32.3% bf16 MFU | 125480 tok/s step 14354/19560 | loss 3.395026 (+2.16z)| norm 0.2520 (+0.26z)| lr 1.46e-04 | 4175.87 ms | 32.3% bf16 MFU | 125483 tok/s step 14355/19560 | loss 3.367943 (+1.49z)| norm 0.2440 (-0.29z)| lr 1.46e-04 | 4196.98 ms | 32.2% bf16 MFU | 125455 tok/s step 14356/19560 | loss 3.279254 (-1.01z)| norm 0.2520 (+0.27z)| lr 1.45e-04 | 4182.63 ms | 32.3% bf16 MFU | 125450 tok/s step 14357/19560 | loss 3.340520 (+0.72z)| norm 0.2485 (+0.02z)| lr 1.45e-04 | 4170.38 ms | 32.4% bf16 MFU | 125463 tok/s step 14358/19560 | loss 3.273417 (-1.16z)| norm 0.2405 (-0.54z)| lr 1.45e-04 | 4164.38 ms | 32.4% bf16 MFU | 125485 tok/s step 14359/19560 | loss 3.366548 (+1.42z)| norm 0.2519 (+0.25z)| lr 1.45e-04 | 4172.75 ms | 32.4% bf16 MFU | 125493 tok/s step 14360/19560 | loss 3.320080 (+0.13z)| norm 0.2482 (-0.02z)| lr 1.45e-04 | 4174.22 ms | 32.3% bf16 MFU | 125498 tok/s step 14361/19560 | loss 3.279914 (-0.99z)| norm 0.2464 (-0.15z)| lr 1.45e-04 | 4219.80 ms | 32.0% bf16 MFU | 125436 tok/s step 14362/19560 | loss 3.325323 (+0.28z)| norm 0.2425 (-0.43z)| lr 1.45e-04 | 4168.92 ms | 32.4% bf16 MFU | 125452 tok/s step 14363/19560 | loss 3.328959 (+0.37z)| norm 0.2391 (-0.68z)| lr 1.45e-04 | 4174.76 ms | 32.3% bf16 MFU | 125459 tok/s step 14364/19560 | loss 3.347079 (+0.87z)| norm 0.2544 (+0.43z)| lr 1.45e-04 | 4206.06 ms | 32.1% bf16 MFU | 125418 tok/s step 14365/19560 | loss 3.317551 (+0.05z)| norm 0.2306 (-1.31z)| lr 1.45e-04 | 4169.88 ms | 32.4% bf16 MFU | 125434 tok/s step 14366/19560 | loss 3.340023 (+0.67z)| norm 0.2379 (-0.76z)| lr 1.45e-04 | 4163.28 ms | 32.4% bf16 MFU | 125459 tok/s step 14367/19560 | loss 3.335480 (+0.54z)| norm 0.2268 (-1.56z)| lr 1.45e-04 | 4176.55 ms | 32.3% bf16 MFU | 125462 tok/s step 14368/19560 | loss 3.292310 (-0.65z)| norm 0.2363 (-0.87z)| lr 1.45e-04 | 4173.57 ms | 32.4% bf16 MFU | 125470 tok/s step 14369/19560 | loss 3.319299 (+0.09z)| norm 0.2196 (-2.05z)| lr 1.45e-04 | 4167.07 ms | 32.4% bf16 MFU | 125488 tok/s step 14370/19560 | loss 3.359367 (+1.19z)| norm 0.2342 (-0.99z)| lr 1.45e-04 | 4182.41 ms | 32.3% bf16 MFU | 125481 tok/s step 14371/19560 | loss 3.351718 (+0.97z)| norm 0.2444 (-0.27z)| lr 1.45e-04 | 4168.09 ms | 32.4% bf16 MFU | 125496 tok/s step 14372/19560 | loss 3.313584 (-0.08z)| norm 0.2177 (-2.14z)| lr 1.45e-04 | 4169.33 ms | 32.4% bf16 MFU | 125509 tok/s step 14373/19560 | loss 3.377859 (+1.68z)| norm 0.2529 (+0.36z)| lr 1.45e-04 | 4219.32 ms | 32.0% bf16 MFU | 125446 tok/s step 14374/19560 | loss 3.307872 (-0.25z)| norm 0.2314 (-1.15z)| lr 1.45e-04 | 4162.13 ms | 32.4% bf16 MFU | 125472 tok/s step 14375/19560 | loss 3.307065 (-0.26z)| norm 0.2455 (-0.15z)| lr 1.44e-04 | 4173.54 ms | 32.4% bf16 MFU | 125480 tok/s step 14376/19560 | loss 3.295033 (-0.60z)| norm 0.2356 (-0.84z)| lr 1.44e-04 | 4185.22 ms | 32.3% bf16 MFU | 125469 tok/s step 14377/19560 | loss 3.272710 (-1.22z)| norm 0.2462 (-0.08z)| lr 1.44e-04 | 4177.95 ms | 32.3% bf16 MFU | 125470 tok/s step 14378/19560 | loss 3.292379 (-0.66z)| norm 0.2242 (-1.64z)| lr 1.44e-04 | 4174.52 ms | 32.3% bf16 MFU | 125477 tok/s step 14379/19560 | loss 3.359880 (+1.21z)| norm 0.2432 (-0.28z)| lr 1.44e-04 | 4171.69 ms | 32.4% bf16 MFU | 125487 tok/s step 14380/19560 | loss 3.282666 (-0.95z)| norm 0.2461 (-0.07z)| lr 1.44e-04 | 4173.91 ms | 32.3% bf16 MFU | 125493 tok/s step 14381/19560 | loss 3.360575 (+1.21z)| norm 0.2563 (+0.66z)| lr 1.44e-04 | 4164.15 ms | 32.4% bf16 MFU | 125513 tok/s step 14382/19560 | loss 3.325173 (+0.22z)| norm 0.2491 (+0.16z)| lr 1.44e-04 | 4177.39 ms | 32.3% bf16 MFU | 125513 tok/s step 14383/19560 | loss 3.397674 (+2.19z)| norm 0.2461 (-0.06z)| lr 1.44e-04 | 4173.83 ms | 32.3% bf16 MFU | 125518 tok/s step 14384/19560 | loss 3.336210 (+0.51z)| norm 0.2471 (+0.03z)| lr 1.44e-04 | 4161.62 ms | 32.4% bf16 MFU | 125541 tok/s step 14385/19560 | loss 3.312424 (-0.14z)| norm 0.2235 (-1.67z)| lr 1.44e-04 | 4173.26 ms | 32.4% bf16 MFU | 125546 tok/s step 14386/19560 | loss 3.380996 (+1.72z)| norm 0.2524 (+0.43z)| lr 1.44e-04 | 4165.99 ms | 32.4% bf16 MFU | 125561 tok/s step 14387/19560 | loss 3.325696 (+0.20z)| norm 0.2552 (+0.63z)| lr 1.44e-04 | 4171.25 ms | 32.4% bf16 MFU | 125567 tok/s step 14388/19560 | loss 3.292275 (-0.71z)| norm 0.2337 (-0.92z)| lr 1.44e-04 | 4167.12 ms | 32.4% bf16 MFU | 125580 tok/s step 14389/19560 | loss 3.348795 (+0.83z)| norm 0.2475 (+0.08z)| lr 1.44e-04 | 4166.40 ms | 32.4% bf16 MFU | 125593 tok/s step 14390/19560 | loss 3.283129 (-0.97z)| norm 0.2618 (+1.10z)| lr 1.44e-04 | 4165.13 ms | 32.4% bf16 MFU | 125607 tok/s step 14391/19560 | loss 3.300662 (-0.49z)| norm 0.2360 (-0.75z)| lr 1.44e-04 | 4173.57 ms | 32.4% bf16 MFU | 125607 tok/s step 14392/19560 | loss 3.319835 (+0.03z)| norm 0.2423 (-0.29z)| lr 1.44e-04 | 4217.67 ms | 32.0% bf16 MFU | 125542 tok/s step 14393/19560 | loss 3.284939 (-0.93z)| norm 0.2546 (+0.60z)| lr 1.44e-04 | 4161.56 ms | 32.4% bf16 MFU | 125565 tok/s step 14394/19560 | loss 3.439284 (+3.17z)| norm 0.2683 (+1.57z)| lr 1.43e-04 | 4161.45 ms | 32.4% bf16 MFU | 125586 tok/s step 14395/19560 | loss 3.284368 (-0.92z)| norm 0.2507 (+0.30z)| lr 1.43e-04 | 4168.41 ms | 32.4% bf16 MFU | 125595 tok/s step 14396/19560 | loss 3.307484 (-0.30z)| norm 0.2638 (+1.22z)| lr 1.43e-04 | 4164.45 ms | 32.4% bf16 MFU | 125610 tok/s step 14397/19560 | loss 3.260894 (-1.53z)| norm 0.2517 (+0.35z)| lr 1.43e-04 | 4173.77 ms | 32.3% bf16 MFU | 125610 tok/s step 14398/19560 | loss 3.335183 (+0.43z)| norm 0.2681 (+1.50z)| lr 1.43e-04 | 4180.93 ms | 32.3% bf16 MFU | 125600 tok/s step 14399/19560 | loss 3.297659 (-0.55z)| norm 0.2574 (+0.74z)| lr 1.43e-04 | 4175.36 ms | 32.3% bf16 MFU | 125598 tok/s step 14400/19560 | loss 3.362266 (+1.13z)| norm 0.2531 (+0.42z)| lr 1.43e-04 | 4163.89 ms | 32.4% bf16 MFU | 125614 tok/s step 14401/19560 | loss 3.300050 (-0.49z)| norm 0.2485 (+0.08z)| lr 1.43e-04 | 4161.79 ms | 32.4% bf16 MFU | 125632 tok/s step 14402/19560 | loss 3.299284 (-0.51z)| norm 0.2569 (+0.68z)| lr 1.43e-04 | 4175.29 ms | 32.3% bf16 MFU | 125629 tok/s step 14403/19560 | loss 3.344198 (+0.66z)| norm 0.2555 (+0.57z)| lr 1.43e-04 | 4162.28 ms | 32.4% bf16 MFU | 125646 tok/s step 14404/19560 | loss 3.324228 (+0.13z)| norm 0.2469 (-0.06z)| lr 1.43e-04 | 4158.27 ms | 32.5% bf16 MFU | 125668 tok/s step 14405/19560 | loss 3.335067 (+0.41z)| norm 0.2485 (+0.05z)| lr 1.43e-04 | 4163.16 ms | 32.4% bf16 MFU | 125681 tok/s step 14406/19560 | loss 3.310118 (-0.26z)| norm 0.2462 (-0.12z)| lr 1.43e-04 | 4169.94 ms | 32.4% bf16 MFU | 125683 tok/s step 14407/19560 | loss 3.303124 (-0.44z)| norm 0.2335 (-1.03z)| lr 1.43e-04 | 4169.35 ms | 32.4% bf16 MFU | 125687 tok/s step 14408/19560 | loss 3.297745 (-0.57z)| norm 0.2422 (-0.42z)| lr 1.43e-04 | 5076.61 ms | 26.6% bf16 MFU | 124566 tok/s step 14409/19560 | loss 3.265115 (-1.43z)| norm 0.2451 (-0.22z)| lr 1.43e-04 | 4770.71 ms | 28.3% bf16 MFU | 123833 tok/s step 14410/19560 | loss 3.366149 (+1.27z)| norm 0.2349 (-0.95z)| lr 1.43e-04 | 4749.11 ms | 28.4% bf16 MFU | 123161 tok/s step 14411/19560 | loss 3.302143 (-0.44z)| norm 0.2424 (-0.41z)| lr 1.43e-04 | 4593.92 ms | 29.4% bf16 MFU | 122709 tok/s step 14412/19560 | loss 3.359270 (+1.07z)| norm 0.2467 (-0.10z)| lr 1.43e-04 | 4570.46 ms | 29.5% bf16 MFU | 122309 tok/s step 14413/19560 | loss 3.323092 (+0.10z)| norm 0.2438 (-0.31z)| lr 1.42e-04 | 4310.41 ms | 31.3% bf16 MFU | 122275 tok/s step 14414/19560 | loss 3.323642 (+0.11z)| norm 0.2387 (-0.66z)| lr 1.42e-04 | 4406.70 ms | 30.6% bf16 MFU | 122110 tok/s step 14415/19560 | loss 3.298445 (-0.55z)| norm 0.2813 (+2.39z)| lr 1.42e-04 | 4317.03 ms | 31.3% bf16 MFU | 122077 tok/s step 14416/19560 | loss 3.304260 (-0.40z)| norm 0.2447 (-0.24z)| lr 1.42e-04 | 4258.17 ms | 31.7% bf16 MFU | 122130 tok/s step 14417/19560 | loss 3.336598 (+0.47z)| norm 0.2516 (+0.25z)| lr 1.42e-04 | 4176.57 ms | 32.3% bf16 MFU | 122300 tok/s step 14418/19560 | loss 3.314962 (-0.10z)| norm 0.2646 (+1.18z)| lr 1.42e-04 | 4173.53 ms | 32.4% bf16 MFU | 122466 tok/s step 14419/19560 | loss 3.361260 (+1.13z)| norm 0.2604 (+0.86z)| lr 1.42e-04 | 4350.59 ms | 31.0% bf16 MFU | 122368 tok/s step 14420/19560 | loss 3.314034 (-0.13z)| norm 0.2497 (+0.09z)| lr 1.42e-04 | 4225.86 ms | 32.0% bf16 MFU | 122453 tok/s step 14421/19560 | loss 3.335814 (+0.45z)| norm 0.2502 (+0.12z)| lr 1.42e-04 | 4259.65 ms | 31.7% bf16 MFU | 122484 tok/s step 14422/19560 | loss 3.354517 (+0.94z)| norm 0.2617 (+0.94z)| lr 1.42e-04 | 4197.09 ms | 32.2% bf16 MFU | 122606 tok/s step 14423/19560 | loss 3.320689 (+0.02z)| norm 0.2319 (-1.17z)| lr 1.42e-04 | 4239.14 ms | 31.9% bf16 MFU | 122660 tok/s step 14424/19560 | loss 3.400406 (+2.13z)| norm 0.2736 (+1.78z)| lr 1.42e-04 | 4188.55 ms | 32.2% bf16 MFU | 122785 tok/s step 14425/19560 | loss 3.298859 (-0.57z)| norm 0.2314 (-1.19z)| lr 1.42e-04 | 4306.36 ms | 31.4% bf16 MFU | 122733 tok/s step 14426/19560 | loss 3.374521 (+1.43z)| norm 0.2437 (-0.31z)| lr 1.42e-04 | 4171.58 ms | 32.4% bf16 MFU | 122881 tok/s step 14427/19560 | loss 3.320856 (+0.01z)| norm 0.2400 (-0.57z)| lr 1.42e-04 | 4220.58 ms | 32.0% bf16 MFU | 122948 tok/s step 14428/19560 | loss 3.295986 (-0.64z)| norm 0.2528 (+0.37z)| lr 1.42e-04 | 4259.04 ms | 31.7% bf16 MFU | 122955 tok/s step 14429/19560 | loss 3.306413 (-0.37z)| norm 0.2473 (-0.04z)| lr 1.42e-04 | 4261.82 ms | 31.7% bf16 MFU | 122959 tok/s step 14430/19560 | loss 3.359627 (+1.06z)| norm 0.2438 (-0.30z)| lr 1.42e-04 | 4169.95 ms | 32.4% bf16 MFU | 123097 tok/s step 14431/19560 | loss 3.311810 (-0.24z)| norm 0.2380 (-0.73z)| lr 1.42e-04 | 4166.95 ms | 32.4% bf16 MFU | 123233 tok/s step 14432/19560 | loss 3.282437 (-1.02z)| norm 0.2405 (-0.54z)| lr 1.42e-04 | 4260.53 ms | 31.7% bf16 MFU | 123225 tok/s step 14433/19560 | loss 3.305843 (-0.39z)| norm 0.2559 (+0.59z)| lr 1.41e-04 | 4211.61 ms | 32.1% bf16 MFU | 123288 tok/s step 14434/19560 | loss 3.301815 (-0.51z)| norm 0.2443 (-0.27z)| lr 1.41e-04 | 4165.29 ms | 32.4% bf16 MFU | 123417 tok/s step 14435/19560 | loss 3.280975 (-1.09z)| norm 0.2434 (-0.33z)| lr 1.41e-04 | 4165.01 ms | 32.4% bf16 MFU | 123540 tok/s step 14436/19560 | loss 3.337408 (+0.45z)| norm 0.2239 (-1.76z)| lr 1.41e-04 | 4198.10 ms | 32.2% bf16 MFU | 123607 tok/s step 14437/19560 | loss 3.288102 (-0.88z)| norm 0.2440 (-0.26z)| lr 1.41e-04 | 4186.13 ms | 32.3% bf16 MFU | 123689 tok/s step 14438/19560 | loss 3.317941 (-0.07z)| norm 0.2368 (-0.78z)| lr 1.41e-04 | 4170.87 ms | 32.4% bf16 MFU | 123790 tok/s step 14439/19560 | loss 3.304936 (-0.41z)| norm 0.2341 (-0.98z)| lr 1.41e-04 | 4245.42 ms | 31.8% bf16 MFU | 123775 tok/s step 14440/19560 | loss 3.343200 (+0.63z)| norm 0.2316 (-1.15z)| lr 1.41e-04 | 4179.53 ms | 32.3% bf16 MFU | 123858 tok/s step 14441/19560 | loss 3.356237 (+0.98z)| norm 0.2268 (-1.50z)| lr 1.41e-04 | 4313.68 ms | 31.3% bf16 MFU | 123742 tok/s step 14442/19560 | loss 3.247009 (-2.02z)| norm 0.2539 (+0.52z)| lr 1.41e-04 | 4162.89 ms | 32.4% bf16 MFU | 123852 tok/s step 14443/19560 | loss 3.284956 (-0.96z)| norm 0.2422 (-0.35z)| lr 1.41e-04 | 4168.51 ms | 32.4% bf16 MFU | 123949 tok/s step 14444/19560 | loss 3.306681 (-0.36z)| norm 0.2440 (-0.21z)| lr 1.41e-04 | 4190.82 ms | 32.2% bf16 MFU | 124006 tok/s step 14445/19560 | loss 3.279295 (-1.09z)| norm 0.2276 (-1.41z)| lr 1.41e-04 | 4174.11 ms | 32.3% bf16 MFU | 124086 tok/s step 14446/19560 | loss 3.272812 (-1.26z)| norm 0.2373 (-0.68z)| lr 1.41e-04 | 4184.85 ms | 32.3% bf16 MFU | 124146 tok/s step 14447/19560 | loss 3.335114 (+0.43z)| norm 0.2350 (-0.84z)| lr 1.41e-04 | 4184.19 ms | 32.3% bf16 MFU | 124204 tok/s step 14448/19560 | loss 3.383687 (+1.72z)| norm 0.2334 (-0.95z)| lr 1.41e-04 | 4211.16 ms | 32.1% bf16 MFU | 124219 tok/s step 14449/19560 | loss 3.293205 (-0.70z)| norm 0.2360 (-0.76z)| lr 1.41e-04 | 4188.97 ms | 32.2% bf16 MFU | 124266 tok/s step 14450/19560 | loss 3.277935 (-1.13z)| norm 0.2367 (-0.70z)| lr 1.41e-04 | 4176.39 ms | 32.3% bf16 MFU | 124329 tok/s step 14451/19560 | loss 3.262875 (-1.52z)| norm 0.2435 (-0.21z)| lr 1.41e-04 | 4167.96 ms | 32.4% bf16 MFU | 124402 tok/s step 14452/19560 | loss 3.314729 (-0.12z)| norm 0.2461 (-0.03z)| lr 1.40e-04 | 4192.85 ms | 32.2% bf16 MFU | 124434 tok/s step 14453/19560 | loss 3.267783 (-1.36z)| norm 0.2484 (+0.14z)| lr 1.40e-04 | 4184.81 ms | 32.3% bf16 MFU | 124477 tok/s step 14454/19560 | loss 3.327366 (+0.23z)| norm 0.2461 (-0.03z)| lr 1.40e-04 | 4285.20 ms | 31.5% bf16 MFU | 124370 tok/s step 14455/19560 | loss 3.372606 (+1.42z)| norm 0.2444 (-0.16z)| lr 1.40e-04 | 4176.05 ms | 32.3% bf16 MFU | 124429 tok/s step 14456/19560 | loss 3.320801 (+0.04z)| norm 0.2489 (+0.17z)| lr 1.40e-04 | 4190.16 ms | 32.2% bf16 MFU | 124464 tok/s step 14457/19560 | loss 3.321752 (+0.06z)| norm 0.2478 (+0.08z)| lr 1.40e-04 | 4181.35 ms | 32.3% bf16 MFU | 124510 tok/s step 14458/19560 | loss 3.309224 (-0.27z)| norm 0.2482 (+0.12z)| lr 1.40e-04 | 4171.86 ms | 32.4% bf16 MFU | 124568 tok/s step 14459/19560 | loss 3.271926 (-1.26z)| norm 0.2358 (-0.80z)| lr 1.40e-04 | 4170.57 ms | 32.4% bf16 MFU | 124625 tok/s step 14460/19560 | loss 3.283806 (-0.93z)| norm 0.2360 (-0.77z)| lr 1.40e-04 | 4193.38 ms | 32.2% bf16 MFU | 124645 tok/s step 14461/19560 | loss 3.339127 (+0.53z)| norm 0.2427 (-0.27z)| lr 1.40e-04 | 4180.86 ms | 32.3% bf16 MFU | 124683 tok/s step 14462/19560 | loss 3.283140 (-0.96z)| norm 0.2359 (-0.78z)| lr 1.40e-04 | 4184.59 ms | 32.3% bf16 MFU | 124714 tok/s step 14463/19560 | loss 3.288717 (-0.80z)| norm 0.2630 (+1.27z)| lr 1.40e-04 | 4204.90 ms | 32.1% bf16 MFU | 124712 tok/s step 14464/19560 | loss 3.297956 (-0.55z)| norm 0.2297 (-1.26z)| lr 1.40e-04 | 4219.31 ms | 32.0% bf16 MFU | 124690 tok/s step 14465/19560 | loss 3.282545 (-0.95z)| norm 0.2697 (+1.87z)| lr 1.40e-04 | 4173.85 ms | 32.3% bf16 MFU | 124736 tok/s step 14466/19560 | loss 3.298067 (-0.54z)| norm 0.2394 (-0.53z)| lr 1.40e-04 | 4190.31 ms | 32.2% bf16 MFU | 124755 tok/s step 14467/19560 | loss 3.373179 (+1.43z)| norm 0.2328 (-1.05z)| lr 1.40e-04 | 4193.73 ms | 32.2% bf16 MFU | 124768 tok/s step 14468/19560 | loss 3.331036 (+0.31z)| norm 0.2388 (-0.56z)| lr 1.40e-04 | 4173.44 ms | 32.4% bf16 MFU | 124811 tok/s step 14469/19560 | loss 3.303776 (-0.42z)| norm 0.2323 (-1.06z)| lr 1.40e-04 | 4218.60 ms | 32.0% bf16 MFU | 124784 tok/s step 14470/19560 | loss 3.294818 (-0.66z)| norm 0.2388 (-0.54z)| lr 1.40e-04 | 4178.95 ms | 32.3% bf16 MFU | 124818 tok/s step 14471/19560 | loss 3.279353 (-1.08z)| norm 0.2414 (-0.34z)| lr 1.40e-04 | 4190.54 ms | 32.2% bf16 MFU | 124833 tok/s step 14472/19560 | loss 3.296969 (-0.60z)| norm 0.2521 (+0.50z)| lr 1.39e-04 | 4182.47 ms | 32.3% bf16 MFU | 124859 tok/s step 14473/19560 | loss 3.335253 (+0.42z)| norm 0.2759 (+2.40z)| lr 1.39e-04 | 4181.40 ms | 32.3% bf16 MFU | 124885 tok/s step 14474/19560 | loss 3.279940 (-1.10z)| norm 0.2356 (-0.80z)| lr 1.39e-04 | 4173.46 ms | 32.4% bf16 MFU | 124922 tok/s step 14475/19560 | loss 3.304188 (-0.40z)| norm 0.2357 (-0.77z)| lr 1.39e-04 | 4176.50 ms | 32.3% bf16 MFU | 124953 tok/s step 14476/19560 | loss 3.320146 (+0.05z)| norm 0.2557 (+0.81z)| lr 1.39e-04 | 4449.88 ms | 30.3% bf16 MFU | 124596 tok/s step 14477/19560 | loss 3.320615 (+0.06z)| norm 0.2572 (+0.94z)| lr 1.39e-04 | 4174.99 ms | 32.3% bf16 MFU | 124645 tok/s step 14478/19560 | loss 3.304855 (-0.40z)| norm 0.2435 (-0.15z)| lr 1.39e-04 | 4178.94 ms | 32.3% bf16 MFU | 124686 tok/s step 14479/19560 | loss 3.306895 (-0.32z)| norm 0.2471 (+0.15z)| lr 1.39e-04 | 4168.23 ms | 32.4% bf16 MFU | 124741 tok/s step 14480/19560 | loss 3.279810 (-1.11z)| norm 0.2450 (-0.02z)| lr 1.39e-04 | 4178.36 ms | 32.3% bf16 MFU | 124777 tok/s step 14481/19560 | loss 3.306634 (-0.32z)| norm 0.2407 (-0.37z)| lr 1.39e-04 | 4166.79 ms | 32.4% bf16 MFU | 124830 tok/s step 14482/19560 | loss 3.323539 (+0.20z)| norm 0.2512 (+0.57z)| lr 1.39e-04 | 4173.85 ms | 32.3% bf16 MFU | 124869 tok/s step 14483/19560 | loss 3.376580 (+1.79z)| norm 0.2652 (+1.79z)| lr 1.39e-04 | 4189.48 ms | 32.2% bf16 MFU | 124883 tok/s step 14484/19560 | loss 3.361951 (+1.33z)| norm 0.2509 (+0.53z)| lr 1.39e-04 | 4183.45 ms | 32.3% bf16 MFU | 124905 tok/s step 14485/19560 | loss 3.324012 (+0.19z)| norm 0.2532 (+0.73z)| lr 1.39e-04 | 4175.76 ms | 32.3% bf16 MFU | 124937 tok/s step 14486/19560 | loss 3.334324 (+0.49z)| norm 0.2387 (-0.55z)| lr 1.39e-04 | 4209.25 ms | 32.1% bf16 MFU | 124918 tok/s step 14487/19560 | loss 3.318328 (+0.02z)| norm 0.2550 (+0.88z)| lr 1.39e-04 | 4179.22 ms | 32.3% bf16 MFU | 124945 tok/s step 14488/19560 | loss 3.432028 (+3.31z)| norm 0.2497 (+0.41z)| lr 1.39e-04 | 4194.12 ms | 32.2% bf16 MFU | 124948 tok/s step 14489/19560 | loss 3.272036 (-1.35z)| norm 0.2322 (-1.11z)| lr 1.39e-04 | 4176.61 ms | 32.3% bf16 MFU | 124977 tok/s step 14490/19560 | loss 3.295229 (-0.67z)| norm 0.2494 (+0.39z)| lr 1.39e-04 | 4188.86 ms | 32.2% bf16 MFU | 124986 tok/s step 14491/19560 | loss 3.309060 (-0.26z)| norm 0.2389 (-0.53z)| lr 1.38e-04 | 4180.06 ms | 32.3% bf16 MFU | 125008 tok/s step 14492/19560 | loss 3.347514 (+0.85z)| norm 0.2539 (+0.78z)| lr 1.38e-04 | 4173.71 ms | 32.3% bf16 MFU | 125039 tok/s step 14493/19560 | loss 3.340901 (+0.66z)| norm 0.2485 (+0.30z)| lr 1.38e-04 | 4197.32 ms | 32.2% bf16 MFU | 125032 tok/s step 14494/19560 | loss 3.370283 (+1.49z)| norm 0.2506 (+0.48z)| lr 1.38e-04 | 4555.10 ms | 29.6% bf16 MFU | 124536 tok/s step 14495/19560 | loss 3.371040 (+1.49z)| norm 0.2526 (+0.64z)| lr 1.38e-04 | 4172.31 ms | 32.4% bf16 MFU | 124592 tok/s step 14496/19560 | loss 3.310711 (-0.24z)| norm 0.2593 (+1.21z)| lr 1.38e-04 | 4191.86 ms | 32.2% bf16 MFU | 124616 tok/s step 14497/19560 | loss 3.327113 (+0.23z)| norm 0.2552 (+0.84z)| lr 1.38e-04 | 4171.79 ms | 32.4% bf16 MFU | 124669 tok/s step 14498/19560 | loss 3.322905 (+0.12z)| norm 0.2512 (+0.47z)| lr 1.38e-04 | 4177.23 ms | 32.3% bf16 MFU | 124711 tok/s step 14499/19560 | loss 3.293192 (-0.72z)| norm 0.2415 (-0.40z)| lr 1.38e-04 | 4364.70 ms | 30.9% bf16 MFU | 124481 tok/s step 14500/19560 | loss 3.315283 (-0.09z)| norm 0.2547 (+0.79z)| lr 1.38e-04 | 4177.86 ms | 32.3% bf16 MFU | 124532 tok/s val loss 3.290668 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3007/10042 = 0.299442 step 14501/19560 | loss 3.293334 (-0.71z)| norm 0.2500 (+0.35z)| lr 1.38e-04 | 4169.88 ms | 32.4% bf16 MFU | 124592 tok/s step 14502/19560 | loss 3.319160 (+0.04z)| norm 0.2425 (-0.35z)| lr 1.38e-04 | 4186.54 ms | 32.3% bf16 MFU | 124624 tok/s step 14503/19560 | loss 3.304662 (-0.38z)| norm 0.2383 (-0.74z)| lr 1.38e-04 | 4327.36 ms | 31.2% bf16 MFU | 124451 tok/s step 14504/19560 | loss 3.324558 (+0.19z)| norm 0.2454 (-0.09z)| lr 1.38e-04 | 4243.08 ms | 31.8% bf16 MFU | 124406 tok/s step 14505/19560 | loss 3.342768 (+0.71z)| norm 0.2574 (+1.03z)| lr 1.38e-04 | 4185.05 ms | 32.3% bf16 MFU | 124450 tok/s step 14506/19560 | loss 3.311283 (-0.21z)| norm 0.2564 (+0.92z)| lr 1.38e-04 | 4220.97 ms | 32.0% bf16 MFU | 124438 tok/s step 14507/19560 | loss 3.304426 (-0.41z)| norm 0.2411 (-0.52z)| lr 1.38e-04 | 4173.96 ms | 32.3% bf16 MFU | 124496 tok/s step 14508/19560 | loss 3.297708 (-0.61z)| norm 0.2509 (+0.40z)| lr 1.38e-04 | 4179.27 ms | 32.3% bf16 MFU | 124544 tok/s step 14509/19560 | loss 3.313082 (-0.14z)| norm 0.2418 (-0.45z)| lr 1.38e-04 | 4172.97 ms | 32.4% bf16 MFU | 124599 tok/s step 14510/19560 | loss 3.362553 (+1.31z)| norm 0.2340 (-1.17z)| lr 1.38e-04 | 4201.21 ms | 32.1% bf16 MFU | 124608 tok/s step 14511/19560 | loss 3.290425 (-0.81z)| norm 0.2345 (-1.11z)| lr 1.37e-04 | 4170.23 ms | 32.4% bf16 MFU | 124664 tok/s step 14512/19560 | loss 3.302372 (-0.44z)| norm 0.2426 (-0.34z)| lr 1.37e-04 | 4305.17 ms | 31.4% bf16 MFU | 124520 tok/s step 14513/19560 | loss 3.355827 (+1.15z)| norm 0.2407 (-0.54z)| lr 1.37e-04 | 4173.14 ms | 32.4% bf16 MFU | 124576 tok/s step 14514/19560 | loss 3.230660 (-2.54z)| norm 0.2387 (-0.73z)| lr 1.37e-04 | 4170.29 ms | 32.4% bf16 MFU | 124633 tok/s step 14515/19560 | loss 3.281762 (-1.01z)| norm 0.2398 (-0.62z)| lr 1.37e-04 | 4181.09 ms | 32.3% bf16 MFU | 124671 tok/s step 14516/19560 | loss 3.302104 (-0.41z)| norm 0.2301 (-1.52z)| lr 1.37e-04 | 4203.01 ms | 32.1% bf16 MFU | 124674 tok/s step 14517/19560 | loss 3.288945 (-0.79z)| norm 0.2411 (-0.48z)| lr 1.37e-04 | 4247.49 ms | 31.8% bf16 MFU | 124612 tok/s step 14518/19560 | loss 3.278253 (-1.10z)| norm 0.2439 (-0.20z)| lr 1.37e-04 | 4173.43 ms | 32.4% bf16 MFU | 124663 tok/s step 14519/19560 | loss 3.341286 (+0.75z)| norm 0.2316 (-1.37z)| lr 1.37e-04 | 4174.73 ms | 32.3% bf16 MFU | 124709 tok/s step 14520/19560 | loss 3.288183 (-0.81z)| norm 0.2421 (-0.37z)| lr 1.37e-04 | 4169.15 ms | 32.4% bf16 MFU | 124762 tok/s step 14521/19560 | loss 3.340986 (+0.74z)| norm 0.2545 (+0.81z)| lr 1.37e-04 | 4160.21 ms | 32.5% bf16 MFU | 124825 tok/s step 14522/19560 | loss 3.329104 (+0.43z)| norm 0.2275 (-1.74z)| lr 1.37e-04 | 4162.18 ms | 32.4% bf16 MFU | 124882 tok/s step 14523/19560 | loss 3.410780 (+2.86z)| norm 0.2438 (-0.17z)| lr 1.37e-04 | 4246.03 ms | 31.8% bf16 MFU | 124811 tok/s step 14524/19560 | loss 3.307499 (-0.26z)| norm 0.2526 (+0.69z)| lr 1.37e-04 | 4209.30 ms | 32.1% bf16 MFU | 124799 tok/s step 14525/19560 | loss 3.318163 (+0.05z)| norm 0.2482 (+0.26z)| lr 1.37e-04 | 4177.37 ms | 32.3% bf16 MFU | 124834 tok/s step 14526/19560 | loss 3.348434 (+0.97z)| norm 0.2694 (+2.31z)| lr 1.37e-04 | 4164.06 ms | 32.4% bf16 MFU | 124888 tok/s step 14527/19560 | loss 3.308133 (-0.26z)| norm 0.2456 (+0.02z)| lr 1.37e-04 | 4176.38 ms | 32.3% bf16 MFU | 124920 tok/s step 14528/19560 | loss 3.385747 (+2.08z)| norm 0.2660 (+1.97z)| lr 1.37e-04 | 4193.27 ms | 32.2% bf16 MFU | 124926 tok/s step 14529/19560 | loss 3.333296 (+0.49z)| norm 0.2413 (-0.39z)| lr 1.37e-04 | 4184.00 ms | 32.3% bf16 MFU | 124945 tok/s step 14530/19560 | loss 3.319365 (+0.06z)| norm 0.2659 (+1.93z)| lr 1.36e-04 | 4177.08 ms | 32.3% bf16 MFU | 124973 tok/s step 14531/19560 | loss 3.307006 (-0.31z)| norm 0.2666 (+1.97z)| lr 1.36e-04 | 4175.77 ms | 32.3% bf16 MFU | 125002 tok/s step 14532/19560 | loss 3.356101 (+1.17z)| norm 0.2485 (+0.27z)| lr 1.36e-04 | 4174.38 ms | 32.3% bf16 MFU | 125032 tok/s step 14533/19560 | loss 3.401722 (+2.47z)| norm 0.2603 (+1.36z)| lr 1.36e-04 | 4180.46 ms | 32.3% bf16 MFU | 125051 tok/s step 14534/19560 | loss 3.348380 (+0.89z)| norm 0.2572 (+1.06z)| lr 1.36e-04 | 4177.69 ms | 32.3% bf16 MFU | 125073 tok/s step 14535/19560 | loss 3.375300 (+1.65z)| norm 0.2687 (+2.07z)| lr 1.36e-04 | 4178.31 ms | 32.3% bf16 MFU | 125094 tok/s step 14536/19560 | loss 3.376695 (+1.65z)| norm 0.2536 (+0.68z)| lr 1.36e-04 | 4171.72 ms | 32.4% bf16 MFU | 125123 tok/s step 14537/19560 | loss 3.366433 (+1.34z)| norm 0.2493 (+0.29z)| lr 1.36e-04 | 4168.75 ms | 32.4% bf16 MFU | 125155 tok/s step 14538/19560 | loss 3.319886 (+0.00z)| norm 0.2633 (+1.54z)| lr 1.36e-04 | 4175.18 ms | 32.3% bf16 MFU | 125176 tok/s step 14539/19560 | loss 3.311865 (-0.23z)| norm 0.2415 (-0.44z)| lr 1.36e-04 | 4184.76 ms | 32.3% bf16 MFU | 125181 tok/s step 14540/19560 | loss 3.276761 (-1.24z)| norm 0.2647 (+1.63z)| lr 1.36e-04 | 4171.73 ms | 32.4% bf16 MFU | 125206 tok/s step 14541/19560 | loss 3.277892 (-1.19z)| norm 0.2624 (+1.41z)| lr 1.36e-04 | 4168.46 ms | 32.4% bf16 MFU | 125235 tok/s step 14542/19560 | loss 3.333662 (+0.43z)| norm 0.2675 (+1.82z)| lr 1.36e-04 | 4179.41 ms | 32.3% bf16 MFU | 125245 tok/s step 14543/19560 | loss 3.296776 (-0.64z)| norm 0.2666 (+1.79z)| lr 1.36e-04 | 4180.27 ms | 32.3% bf16 MFU | 125254 tok/s step 14544/19560 | loss 3.274297 (-1.28z)| norm 0.2852 (+3.30z)| lr 1.36e-04 | 4184.68 ms | 32.3% bf16 MFU | 125256 tok/s step 14545/19560 | loss 3.311780 (-0.19z)| norm 0.2527 (+0.48z)| lr 1.36e-04 | 4174.47 ms | 32.3% bf16 MFU | 125272 tok/s step 14546/19560 | loss 3.353316 (+0.99z)| norm 0.2801 (+2.79z)| lr 1.36e-04 | 4161.78 ms | 32.4% bf16 MFU | 125308 tok/s step 14547/19560 | loss 3.304775 (-0.39z)| norm 0.2587 (+0.97z)| lr 1.36e-04 | 4171.99 ms | 32.4% bf16 MFU | 125326 tok/s step 14548/19560 | loss 3.300009 (-0.52z)| norm 0.2652 (+1.50z)| lr 1.36e-04 | 4176.27 ms | 32.3% bf16 MFU | 125336 tok/s step 14549/19560 | loss 3.330962 (+0.37z)| norm 0.2590 (+0.97z)| lr 1.36e-04 | 4183.67 ms | 32.3% bf16 MFU | 125335 tok/s step 14550/19560 | loss 3.308585 (-0.27z)| norm 0.2613 (+1.16z)| lr 1.35e-04 | 4165.38 ms | 32.4% bf16 MFU | 125362 tok/s step 14551/19560 | loss 3.311205 (-0.19z)| norm 0.2367 (-0.90z)| lr 1.35e-04 | 4173.08 ms | 32.4% bf16 MFU | 125376 tok/s step 14552/19560 | loss 3.280598 (-1.07z)| norm 0.2457 (-0.13z)| lr 1.35e-04 | 4173.88 ms | 32.3% bf16 MFU | 125388 tok/s step 14553/19560 | loss 3.303028 (-0.41z)| norm 0.2433 (-0.35z)| lr 1.35e-04 | 4169.00 ms | 32.4% bf16 MFU | 125406 tok/s step 14554/19560 | loss 3.318793 (+0.07z)| norm 0.2524 (+0.43z)| lr 1.35e-04 | 4180.01 ms | 32.3% bf16 MFU | 125407 tok/s step 14555/19560 | loss 3.313002 (-0.10z)| norm 0.2408 (-0.57z)| lr 1.35e-04 | 4170.95 ms | 32.4% bf16 MFU | 125422 tok/s step 14556/19560 | loss 3.299323 (-0.51z)| norm 0.2526 (+0.45z)| lr 1.35e-04 | 4170.57 ms | 32.4% bf16 MFU | 125436 tok/s step 14557/19560 | loss 3.361315 (+1.32z)| norm 0.2590 (+0.99z)| lr 1.35e-04 | 4163.83 ms | 32.4% bf16 MFU | 125460 tok/s step 14558/19560 | loss 3.397986 (+2.36z)| norm 0.2487 (+0.10z)| lr 1.35e-04 | 4197.76 ms | 32.2% bf16 MFU | 125432 tok/s step 14559/19560 | loss 3.318554 (+0.04z)| norm 0.2314 (-1.38z)| lr 1.35e-04 | 4174.70 ms | 32.3% bf16 MFU | 125440 tok/s step 14560/19560 | loss 3.359972 (+1.23z)| norm 0.2546 (+0.60z)| lr 1.35e-04 | 4173.09 ms | 32.4% bf16 MFU | 125450 tok/s step 14561/19560 | loss 3.325620 (+0.22z)| norm 0.2448 (-0.23z)| lr 1.35e-04 | 4177.88 ms | 32.3% bf16 MFU | 125452 tok/s step 14562/19560 | loss 3.324357 (+0.18z)| norm 0.2312 (-1.38z)| lr 1.35e-04 | 4181.43 ms | 32.3% bf16 MFU | 125448 tok/s step 14563/19560 | loss 3.322994 (+0.13z)| norm 0.2380 (-0.79z)| lr 1.35e-04 | 4167.65 ms | 32.4% bf16 MFU | 125466 tok/s step 14564/19560 | loss 3.323808 (+0.16z)| norm 0.2515 (+0.34z)| lr 1.35e-04 | 4163.15 ms | 32.4% bf16 MFU | 125489 tok/s step 14565/19560 | loss 3.311804 (-0.20z)| norm 0.2393 (-0.71z)| lr 1.35e-04 | 4166.83 ms | 32.4% bf16 MFU | 125506 tok/s step 14566/19560 | loss 3.285963 (-0.95z)| norm 0.2417 (-0.51z)| lr 1.35e-04 | 4166.41 ms | 32.4% bf16 MFU | 125523 tok/s step 14567/19560 | loss 3.322371 (+0.12z)| norm 0.2524 (+0.40z)| lr 1.35e-04 | 4173.23 ms | 32.4% bf16 MFU | 125528 tok/s step 14568/19560 | loss 3.228913 (-2.54z)| norm 0.2325 (-1.32z)| lr 1.35e-04 | 4169.38 ms | 32.4% bf16 MFU | 125539 tok/s step 14569/19560 | loss 3.295543 (-0.62z)| norm 0.2422 (-0.49z)| lr 1.35e-04 | 4171.90 ms | 32.4% bf16 MFU | 125546 tok/s step 14570/19560 | loss 3.372231 (+1.57z)| norm 0.2482 (+0.04z)| lr 1.34e-04 | 4190.36 ms | 32.2% bf16 MFU | 125524 tok/s step 14571/19560 | loss 3.277405 (-1.17z)| norm 0.2538 (+0.52z)| lr 1.34e-04 | 4187.08 ms | 32.2% bf16 MFU | 125509 tok/s step 14572/19560 | loss 3.349027 (+0.89z)| norm 0.2371 (-0.95z)| lr 1.34e-04 | 4158.66 ms | 32.5% bf16 MFU | 125537 tok/s step 14573/19560 | loss 3.405123 (+2.43z)| norm 0.2676 (+1.71z)| lr 1.34e-04 | 4156.69 ms | 32.5% bf16 MFU | 125567 tok/s step 14574/19560 | loss 3.283135 (-1.03z)| norm 0.2328 (-1.34z)| lr 1.34e-04 | 4170.51 ms | 32.4% bf16 MFU | 125574 tok/s step 14575/19560 | loss 3.323412 (+0.12z)| norm 0.2452 (-0.26z)| lr 1.34e-04 | 4179.73 ms | 32.3% bf16 MFU | 125567 tok/s step 14576/19560 | loss 3.277468 (-1.17z)| norm 0.2464 (-0.17z)| lr 1.34e-04 | 4162.75 ms | 32.4% bf16 MFU | 125586 tok/s step 14577/19560 | loss 3.257142 (-1.73z)| norm 0.2271 (-1.85z)| lr 1.34e-04 | 4174.32 ms | 32.3% bf16 MFU | 125587 tok/s step 14578/19560 | loss 3.327355 (+0.25z)| norm 0.2495 (+0.10z)| lr 1.34e-04 | 4187.04 ms | 32.2% bf16 MFU | 125568 tok/s step 14579/19560 | loss 3.276477 (-1.20z)| norm 0.2185 (-2.55z)| lr 1.34e-04 | 4181.14 ms | 32.3% bf16 MFU | 125559 tok/s step 14580/19560 | loss 3.297360 (-0.60z)| norm 0.2401 (-0.69z)| lr 1.34e-04 | 4177.77 ms | 32.3% bf16 MFU | 125556 tok/s step 14581/19560 | loss 3.283483 (-1.01z)| norm 0.2300 (-1.52z)| lr 1.34e-04 | 4183.13 ms | 32.3% bf16 MFU | 125545 tok/s step 14582/19560 | loss 3.310543 (-0.23z)| norm 0.2310 (-1.42z)| lr 1.34e-04 | 4177.93 ms | 32.3% bf16 MFU | 125542 tok/s step 14583/19560 | loss 3.433645 (+3.18z)| norm 0.2543 (+0.54z)| lr 1.34e-04 | 4196.46 ms | 32.2% bf16 MFU | 125512 tok/s step 14584/19560 | loss 3.288243 (-0.84z)| norm 0.2345 (-1.11z)| lr 1.34e-04 | 4173.13 ms | 32.4% bf16 MFU | 125518 tok/s step 14585/19560 | loss 3.278155 (-1.11z)| norm 0.2562 (+0.70z)| lr 1.34e-04 | 4265.11 ms | 31.7% bf16 MFU | 125388 tok/s step 14586/19560 | loss 3.342428 (+0.65z)| norm 0.2516 (+0.31z)| lr 1.34e-04 | 4207.15 ms | 32.1% bf16 MFU | 125350 tok/s step 14587/19560 | loss 3.249381 (-1.88z)| norm 0.2646 (+1.38z)| lr 1.34e-04 | 4174.36 ms | 32.3% bf16 MFU | 125362 tok/s step 14588/19560 | loss 3.270861 (-1.29z)| norm 0.2575 (+0.77z)| lr 1.34e-04 | 4235.50 ms | 31.9% bf16 MFU | 125283 tok/s step 14589/19560 | loss 3.307977 (-0.28z)| norm 0.2394 (-0.74z)| lr 1.33e-04 | 4165.66 ms | 32.4% bf16 MFU | 125312 tok/s step 14590/19560 | loss 3.325827 (+0.20z)| norm 0.2384 (-0.82z)| lr 1.33e-04 | 4176.62 ms | 32.3% bf16 MFU | 125323 tok/s step 14591/19560 | loss 3.320776 (+0.06z)| norm 0.2461 (-0.17z)| lr 1.33e-04 | 4272.03 ms | 31.6% bf16 MFU | 125193 tok/s step 14592/19560 | loss 3.331231 (+0.34z)| norm 0.2394 (-0.75z)| lr 1.33e-04 | 4170.81 ms | 32.4% bf16 MFU | 125219 tok/s step 14593/19560 | loss 3.299180 (-0.55z)| norm 0.2502 (+0.18z)| lr 1.33e-04 | 4171.12 ms | 32.4% bf16 MFU | 125243 tok/s step 14594/19560 | loss 3.300143 (-0.52z)| norm 0.2416 (-0.56z)| lr 1.33e-04 | 4170.90 ms | 32.4% bf16 MFU | 125265 tok/s step 14595/19560 | loss 3.220356 (-2.63z)| norm 0.2524 (+0.36z)| lr 1.33e-04 | 4212.97 ms | 32.0% bf16 MFU | 125225 tok/s step 14596/19560 | loss 3.322154 (+0.12z)| norm 0.2442 (-0.35z)| lr 1.33e-04 | 4177.41 ms | 32.3% bf16 MFU | 125239 tok/s step 14597/19560 | loss 3.358634 (+1.08z)| norm 0.2336 (-1.27z)| lr 1.33e-04 | 4179.52 ms | 32.3% bf16 MFU | 125249 tok/s step 14598/19560 | loss 3.291663 (-0.71z)| norm 0.2453 (-0.27z)| lr 1.33e-04 | 4528.98 ms | 29.8% bf16 MFU | 124774 tok/s step 14599/19560 | loss 3.309439 (-0.24z)| norm 0.2316 (-1.44z)| lr 1.33e-04 | 4754.88 ms | 28.4% bf16 MFU | 124049 tok/s step 14600/19560 | loss 3.291581 (-0.72z)| norm 0.2196 (-2.39z)| lr 1.33e-04 | 4497.78 ms | 30.0% bf16 MFU | 123675 tok/s step 14601/19560 | loss 3.291974 (-0.70z)| norm 0.2383 (-0.81z)| lr 1.33e-04 | 4359.44 ms | 31.0% bf16 MFU | 123504 tok/s step 14602/19560 | loss 3.326992 (+0.23z)| norm 0.2388 (-0.77z)| lr 1.33e-04 | 4326.89 ms | 31.2% bf16 MFU | 123388 tok/s step 14603/19560 | loss 3.330090 (+0.31z)| norm 0.2509 (+0.26z)| lr 1.33e-04 | 4198.58 ms | 32.2% bf16 MFU | 123462 tok/s step 14604/19560 | loss 3.338224 (+0.52z)| norm 0.2529 (+0.44z)| lr 1.33e-04 | 4308.01 ms | 31.3% bf16 MFU | 123374 tok/s step 14605/19560 | loss 3.365761 (+1.25z)| norm 0.2622 (+1.23z)| lr 1.33e-04 | 4164.90 ms | 32.4% bf16 MFU | 123499 tok/s step 14606/19560 | loss 3.359946 (+1.08z)| norm 0.2578 (+0.85z)| lr 1.33e-04 | 4242.75 ms | 31.8% bf16 MFU | 123503 tok/s step 14607/19560 | loss 3.320679 (+0.03z)| norm 0.2420 (-0.51z)| lr 1.33e-04 | 4207.15 ms | 32.1% bf16 MFU | 123559 tok/s step 14608/19560 | loss 3.332806 (+0.34z)| norm 0.2476 (-0.03z)| lr 1.33e-04 | 4319.73 ms | 31.3% bf16 MFU | 123449 tok/s step 14609/19560 | loss 3.249373 (-1.86z)| norm 0.2419 (-0.52z)| lr 1.32e-04 | 4238.40 ms | 31.9% bf16 MFU | 123462 tok/s step 14610/19560 | loss 3.338972 (+0.51z)| norm 0.2496 (+0.14z)| lr 1.32e-04 | 4164.57 ms | 32.4% bf16 MFU | 123583 tok/s step 14611/19560 | loss 3.296751 (-0.60z)| norm 0.2293 (-1.58z)| lr 1.32e-04 | 4410.24 ms | 30.6% bf16 MFU | 123348 tok/s step 14612/19560 | loss 3.324665 (+0.16z)| norm 0.2438 (-0.33z)| lr 1.32e-04 | 4248.48 ms | 31.8% bf16 MFU | 123351 tok/s step 14613/19560 | loss 3.285737 (-0.88z)| norm 0.2514 (+0.33z)| lr 1.32e-04 | 4175.36 ms | 32.3% bf16 MFU | 123462 tok/s step 14614/19560 | loss 3.304653 (-0.37z)| norm 0.2751 (+2.29z)| lr 1.32e-04 | 4160.12 ms | 32.5% bf16 MFU | 123590 tok/s step 14615/19560 | loss 3.339190 (+0.55z)| norm 0.2320 (-1.32z)| lr 1.32e-04 | 4209.39 ms | 32.1% bf16 MFU | 123638 tok/s step 14616/19560 | loss 3.413755 (+2.57z)| norm 0.2506 (+0.24z)| lr 1.32e-04 | 4218.17 ms | 32.0% bf16 MFU | 123671 tok/s step 14617/19560 | loss 3.355069 (+0.98z)| norm 0.2427 (-0.43z)| lr 1.32e-04 | 4169.27 ms | 32.4% bf16 MFU | 123775 tok/s step 14618/19560 | loss 3.320557 (+0.04z)| norm 0.2326 (-1.27z)| lr 1.32e-04 | 4178.59 ms | 32.3% bf16 MFU | 123860 tok/s step 14619/19560 | loss 3.375947 (+1.51z)| norm 0.2539 (+0.51z)| lr 1.32e-04 | 4190.98 ms | 32.2% bf16 MFU | 123922 tok/s step 14620/19560 | loss 3.297959 (-0.57z)| norm 0.2607 (+1.07z)| lr 1.32e-04 | 4198.29 ms | 32.2% bf16 MFU | 123970 tok/s step 14621/19560 | loss 3.207637 (-2.88z)| norm 0.2336 (-1.18z)| lr 1.32e-04 | 4267.26 ms | 31.6% bf16 MFU | 123914 tok/s step 14622/19560 | loss 3.303081 (-0.38z)| norm 0.2659 (+1.49z)| lr 1.32e-04 | 4161.34 ms | 32.4% bf16 MFU | 124018 tok/s step 14623/19560 | loss 3.368424 (+1.33z)| norm 0.2557 (+0.65z)| lr 1.32e-04 | 4163.20 ms | 32.4% bf16 MFU | 124114 tok/s step 14624/19560 | loss 3.332458 (+0.38z)| norm 0.2498 (+0.17z)| lr 1.32e-04 | 4179.02 ms | 32.3% bf16 MFU | 124181 tok/s step 14625/19560 | loss 3.291066 (-0.69z)| norm 0.2508 (+0.25z)| lr 1.32e-04 | 4176.82 ms | 32.3% bf16 MFU | 124248 tok/s step 14626/19560 | loss 3.292491 (-0.65z)| norm 0.2414 (-0.52z)| lr 1.32e-04 | 4184.39 ms | 32.3% bf16 MFU | 124300 tok/s step 14627/19560 | loss 3.376853 (+1.52z)| norm 0.2704 (+1.84z)| lr 1.32e-04 | 4225.04 ms | 32.0% bf16 MFU | 124290 tok/s step 14628/19560 | loss 3.300057 (-0.46z)| norm 0.2637 (+1.28z)| lr 1.32e-04 | 4165.28 ms | 32.4% bf16 MFU | 124369 tok/s step 14629/19560 | loss 3.397652 (+2.01z)| norm 0.2620 (+1.13z)| lr 1.31e-04 | 4202.21 ms | 32.1% bf16 MFU | 124389 tok/s step 14630/19560 | loss 3.362007 (+1.09z)| norm 0.2559 (+0.62z)| lr 1.31e-04 | 4158.28 ms | 32.5% bf16 MFU | 124474 tok/s step 14631/19560 | loss 3.315901 (-0.08z)| norm 0.2571 (+0.71z)| lr 1.31e-04 | 4216.27 ms | 32.0% bf16 MFU | 124467 tok/s step 14632/19560 | loss 3.351195 (+0.81z)| norm 0.2495 (+0.09z)| lr 1.31e-04 | 4159.08 ms | 32.5% bf16 MFU | 124547 tok/s step 14633/19560 | loss 3.339763 (+0.52z)| norm 0.2559 (+0.61z)| lr 1.31e-04 | 4167.20 ms | 32.4% bf16 MFU | 124610 tok/s step 14634/19560 | loss 3.301809 (-0.44z)| norm 0.2683 (+1.59z)| lr 1.31e-04 | 4162.23 ms | 32.4% bf16 MFU | 124678 tok/s step 14635/19560 | loss 3.403070 (+2.07z)| norm 0.2690 (+1.62z)| lr 1.31e-04 | 4223.40 ms | 32.0% bf16 MFU | 124651 tok/s step 14636/19560 | loss 3.287395 (-0.81z)| norm 0.2540 (+0.42z)| lr 1.31e-04 | 4168.56 ms | 32.4% bf16 MFU | 124707 tok/s step 14637/19560 | loss 3.241840 (-1.90z)| norm 0.2594 (+0.84z)| lr 1.31e-04 | 4167.10 ms | 32.4% bf16 MFU | 124762 tok/s step 14638/19560 | loss 3.332009 (+0.32z)| norm 0.2397 (-0.73z)| lr 1.31e-04 | 4179.68 ms | 32.3% bf16 MFU | 124796 tok/s step 14639/19560 | loss 3.293235 (-0.64z)| norm 0.2568 (+0.62z)| lr 1.31e-04 | 4175.06 ms | 32.3% bf16 MFU | 124835 tok/s step 14640/19560 | loss 3.320809 (+0.04z)| norm 0.2513 (+0.18z)| lr 1.31e-04 | 4219.69 ms | 32.0% bf16 MFU | 124806 tok/s step 14641/19560 | loss 3.329994 (+0.27z)| norm 0.2533 (+0.32z)| lr 1.31e-04 | 4163.97 ms | 32.4% bf16 MFU | 124861 tok/s step 14642/19560 | loss 3.324466 (+0.12z)| norm 0.2361 (-1.05z)| lr 1.31e-04 | 4180.26 ms | 32.3% bf16 MFU | 124889 tok/s step 14643/19560 | loss 3.271653 (-1.21z)| norm 0.2409 (-0.66z)| lr 1.31e-04 | 4225.01 ms | 32.0% bf16 MFU | 124849 tok/s step 14644/19560 | loss 3.317202 (-0.07z)| norm 0.2442 (-0.41z)| lr 1.31e-04 | 4169.03 ms | 32.4% bf16 MFU | 124895 tok/s step 14645/19560 | loss 3.331004 (+0.27z)| norm 0.2561 (+0.54z)| lr 1.31e-04 | 4164.46 ms | 32.4% bf16 MFU | 124945 tok/s step 14646/19560 | loss 3.301650 (-0.48z)| norm 0.2316 (-1.42z)| lr 1.31e-04 | 4175.94 ms | 32.3% bf16 MFU | 124975 tok/s step 14647/19560 | loss 3.341991 (+0.55z)| norm 0.6659 (+10.68z)| lr 1.31e-04 | 4404.73 ms | 30.7% bf16 MFU | 124678 tok/s step 14648/19560 | loss 3.334674 (+0.35z)| norm 0.2440 (-0.23z)| lr 1.31e-04 | 4191.61 ms | 32.2% bf16 MFU | 124698 tok/s step 14649/19560 | loss 3.256969 (-1.59z)| norm 0.2335 (-0.49z)| lr 1.30e-04 | 4282.19 ms | 31.5% bf16 MFU | 124584 tok/s step 14650/19560 | loss 3.311390 (-0.22z)| norm 0.2575 (+0.12z)| lr 1.30e-04 | 4170.99 ms | 32.4% bf16 MFU | 124640 tok/s step 14651/19560 | loss 3.371623 (+1.32z)| norm 0.2496 (-0.08z)| lr 1.30e-04 | 4953.42 ms | 27.3% bf16 MFU | 123700 tok/s step 14652/19560 | loss 3.299776 (-0.51z)| norm 0.2459 (-0.18z)| lr 1.30e-04 | 4186.29 ms | 32.3% bf16 MFU | 123777 tok/s step 14653/19560 | loss 3.279925 (-1.00z)| norm 0.2323 (-0.53z)| lr 1.30e-04 | 4205.71 ms | 32.1% bf16 MFU | 123821 tok/s step 14654/19560 | loss 3.360312 (+1.04z)| norm 0.2349 (-0.45z)| lr 1.30e-04 | 4168.72 ms | 32.4% bf16 MFU | 123919 tok/s step 14655/19560 | loss 3.301273 (-0.46z)| norm 0.2509 (-0.04z)| lr 1.30e-04 | 4166.47 ms | 32.4% bf16 MFU | 124015 tok/s step 14656/19560 | loss 3.289630 (-0.74z)| norm 0.2354 (-0.43z)| lr 1.30e-04 | 4254.88 ms | 31.7% bf16 MFU | 123975 tok/s step 14657/19560 | loss 3.345329 (+0.68z)| norm 0.2421 (-0.26z)| lr 1.30e-04 | 4170.65 ms | 32.4% bf16 MFU | 124062 tok/s step 14658/19560 | loss 3.287398 (-0.79z)| norm 0.2400 (-0.31z)| lr 1.30e-04 | 4165.10 ms | 32.4% bf16 MFU | 124152 tok/s step 14659/19560 | loss 3.340145 (+0.54z)| norm 0.2566 (+0.12z)| lr 1.30e-04 | 4174.68 ms | 32.3% bf16 MFU | 124224 tok/s step 14660/19560 | loss 3.313417 (-0.13z)| norm 0.2364 (-0.40z)| lr 1.30e-04 | 4233.06 ms | 31.9% bf16 MFU | 124206 tok/s step 14661/19560 | loss 3.352639 (+0.90z)| norm 0.2492 (-0.07z)| lr 1.30e-04 | 4654.24 ms | 29.0% bf16 MFU | 123628 tok/s step 14662/19560 | loss 3.260016 (-1.47z)| norm 0.2331 (-0.48z)| lr 1.30e-04 | 4214.57 ms | 32.0% bf16 MFU | 123666 tok/s step 14663/19560 | loss 3.277830 (-1.00z)| norm 0.2455 (-0.15z)| lr 1.30e-04 | 4175.13 ms | 32.3% bf16 MFU | 123762 tok/s step 14664/19560 | loss 3.299828 (-0.42z)| norm 0.2613 (+0.25z)| lr 1.30e-04 | 4234.49 ms | 31.9% bf16 MFU | 123764 tok/s step 14665/19560 | loss 3.401015 (+2.19z)| norm 0.2336 (-0.46z)| lr 1.30e-04 | 4307.70 ms | 31.3% bf16 MFU | 123662 tok/s step 14666/19560 | loss 3.300295 (-0.41z)| norm 0.2358 (-0.40z)| lr 1.30e-04 | 4168.25 ms | 32.4% bf16 MFU | 123768 tok/s step 14667/19560 | loss 3.322902 (+0.17z)| norm 0.2353 (-0.41z)| lr 1.30e-04 | 4179.61 ms | 32.3% bf16 MFU | 123851 tok/s step 14668/19560 | loss 3.334997 (+0.48z)| norm 0.2459 (-0.13z)| lr 1.30e-04 | 4199.51 ms | 32.2% bf16 MFU | 123901 tok/s step 14669/19560 | loss 3.286404 (-0.78z)| norm 0.2346 (-0.41z)| lr 1.29e-04 | 4181.87 ms | 32.3% bf16 MFU | 123974 tok/s step 14670/19560 | loss 3.365030 (+1.24z)| norm 0.2475 (-0.08z)| lr 1.29e-04 | 4163.21 ms | 32.4% bf16 MFU | 124072 tok/s step 14671/19560 | loss 3.315071 (-0.05z)| norm 0.2380 (-0.32z)| lr 1.29e-04 | 4285.67 ms | 31.5% bf16 MFU | 123985 tok/s step 14672/19560 | loss 3.370755 (+1.36z)| norm 0.2471 (-0.08z)| lr 1.29e-04 | 4180.53 ms | 32.3% bf16 MFU | 124057 tok/s step 14673/19560 | loss 3.310627 (-0.18z)| norm 0.2316 (-0.47z)| lr 1.29e-04 | 4177.13 ms | 32.3% bf16 MFU | 124130 tok/s step 14674/19560 | loss 3.337476 (+0.51z)| norm 0.2475 (-0.05z)| lr 1.29e-04 | 4180.56 ms | 32.3% bf16 MFU | 124194 tok/s step 14675/19560 | loss 3.372221 (+1.38z)| norm 0.2397 (-0.25z)| lr 1.29e-04 | 4160.69 ms | 32.5% bf16 MFU | 124285 tok/s step 14676/19560 | loss 3.354483 (+0.91z)| norm 0.2478 (-0.04z)| lr 1.29e-04 | 4184.60 ms | 32.3% bf16 MFU | 124335 tok/s step 14677/19560 | loss 3.333953 (+0.39z)| norm 0.2324 (-0.43z)| lr 1.29e-04 | 4173.02 ms | 32.4% bf16 MFU | 124400 tok/s step 14678/19560 | loss 3.264716 (-1.36z)| norm 0.2305 (-0.48z)| lr 1.29e-04 | 4175.15 ms | 32.3% bf16 MFU | 124459 tok/s step 14679/19560 | loss 3.315691 (-0.07z)| norm 0.2457 (-0.08z)| lr 1.29e-04 | 4171.66 ms | 32.4% bf16 MFU | 124520 tok/s step 14680/19560 | loss 3.288628 (-0.76z)| norm 0.2376 (-0.29z)| lr 1.29e-04 | 4192.86 ms | 32.2% bf16 MFU | 124546 tok/s step 14681/19560 | loss 3.343754 (+0.63z)| norm 0.2299 (-0.49z)| lr 1.29e-04 | 4705.16 ms | 28.7% bf16 MFU | 123890 tok/s step 14682/19560 | loss 3.296885 (-0.55z)| norm 0.2337 (-0.39z)| lr 1.29e-04 | 4208.12 ms | 32.1% bf16 MFU | 123925 tok/s step 14683/19560 | loss 3.367662 (+1.23z)| norm 0.2407 (-0.20z)| lr 1.29e-04 | 4184.96 ms | 32.3% bf16 MFU | 123993 tok/s step 14684/19560 | loss 3.318309 (-0.02z)| norm 0.2303 (-0.47z)| lr 1.29e-04 | 4273.73 ms | 31.6% bf16 MFU | 123927 tok/s step 14685/19560 | loss 3.340076 (+0.53z)| norm 0.2345 (-0.36z)| lr 1.29e-04 | 4170.64 ms | 32.4% bf16 MFU | 124016 tok/s step 14686/19560 | loss 3.321845 (+0.09z)| norm 0.2351 (-0.34z)| lr 1.29e-04 | 4160.81 ms | 32.4% bf16 MFU | 124115 tok/s step 14687/19560 | loss 3.305414 (-0.33z)| norm 0.2389 (-0.24z)| lr 1.29e-04 | 4219.16 ms | 32.0% bf16 MFU | 124123 tok/s step 14688/19560 | loss 3.300833 (-0.44z)| norm 0.2552 (+0.18z)| lr 1.29e-04 | 4176.90 ms | 32.3% bf16 MFU | 124193 tok/s step 14689/19560 | loss 3.314732 (-0.08z)| norm 0.2278 (-0.52z)| lr 1.28e-04 | 4177.49 ms | 32.3% bf16 MFU | 124258 tok/s step 14690/19560 | loss 3.294620 (-0.59z)| norm 0.2281 (-0.51z)| lr 1.28e-04 | 4177.51 ms | 32.3% bf16 MFU | 124320 tok/s step 14691/19560 | loss 3.391426 (+1.86z)| norm 0.2386 (-0.25z)| lr 1.28e-04 | 4178.20 ms | 32.3% bf16 MFU | 124379 tok/s step 14692/19560 | loss 3.337701 (+0.50z)| norm 0.2465 (-0.04z)| lr 1.28e-04 | 4185.89 ms | 32.3% bf16 MFU | 124422 tok/s step 14693/19560 | loss 3.328180 (+0.25z)| norm 0.2341 (-0.36z)| lr 1.28e-04 | 4285.59 ms | 31.5% bf16 MFU | 124318 tok/s step 14694/19560 | loss 3.369208 (+1.27z)| norm 0.2469 (-0.03z)| lr 1.28e-04 | 4172.59 ms | 32.4% bf16 MFU | 124385 tok/s step 14695/19560 | loss 3.355378 (+0.91z)| norm 0.2437 (-0.11z)| lr 1.28e-04 | 4184.22 ms | 32.3% bf16 MFU | 124430 tok/s step 14696/19560 | loss 3.353237 (+0.85z)| norm 0.2538 (+0.15z)| lr 1.28e-04 | 4180.11 ms | 32.3% bf16 MFU | 124480 tok/s step 14697/19560 | loss 3.372299 (+1.31z)| norm 0.2469 (-0.03z)| lr 1.28e-04 | 4191.11 ms | 32.2% bf16 MFU | 124511 tok/s step 14698/19560 | loss 3.352345 (+0.81z)| norm 0.2549 (+0.17z)| lr 1.28e-04 | 4183.06 ms | 32.3% bf16 MFU | 124552 tok/s step 14699/19560 | loss 3.329873 (+0.23z)| norm 0.2521 (+0.10z)| lr 1.28e-04 | 4164.18 ms | 32.4% bf16 MFU | 124620 tok/s step 14700/19560 | loss 3.334396 (+0.35z)| norm 0.2429 (-0.14z)| lr 1.28e-04 | 4189.68 ms | 32.2% bf16 MFU | 124646 tok/s step 14701/19560 | loss 3.320292 (+0.00z)| norm 0.2607 (+0.32z)| lr 1.28e-04 | 4180.29 ms | 32.3% bf16 MFU | 124684 tok/s step 14702/19560 | loss 3.280330 (-1.05z)| norm 0.2534 (+0.13z)| lr 1.28e-04 | 4313.30 ms | 31.3% bf16 MFU | 124528 tok/s step 14703/19560 | loss 3.223888 (-2.45z)| norm 0.2491 (+0.02z)| lr 1.28e-04 | 4171.22 ms | 32.4% bf16 MFU | 124586 tok/s step 14704/19560 | loss 3.311574 (-0.21z)| norm 0.2645 (+0.41z)| lr 1.28e-04 | 4168.71 ms | 32.4% bf16 MFU | 124645 tok/s step 14705/19560 | loss 3.301471 (-0.48z)| norm 0.2467 (-0.05z)| lr 1.28e-04 | 4172.87 ms | 32.4% bf16 MFU | 124695 tok/s step 14706/19560 | loss 3.331777 (+0.30z)| norm 0.2404 (-0.21z)| lr 1.28e-04 | 4185.90 ms | 32.3% bf16 MFU | 124723 tok/s step 14707/19560 | loss 3.341307 (+0.54z)| norm 0.2442 (-0.12z)| lr 1.28e-04 | 4176.53 ms | 32.3% bf16 MFU | 124763 tok/s step 14708/19560 | loss 3.345327 (+0.64z)| norm 0.2578 (+0.23z)| lr 1.28e-04 | 4344.98 ms | 31.1% bf16 MFU | 124558 tok/s step 14709/19560 | loss 3.262712 (-1.51z)| norm 0.2293 (-0.51z)| lr 1.27e-04 | 4177.64 ms | 32.3% bf16 MFU | 124605 tok/s step 14710/19560 | loss 3.377043 (+1.43z)| norm 0.2482 (-0.02z)| lr 1.27e-04 | 4176.78 ms | 32.3% bf16 MFU | 124651 tok/s step 14711/19560 | loss 3.306112 (-0.38z)| norm 0.2345 (-0.37z)| lr 1.27e-04 | 4182.72 ms | 32.3% bf16 MFU | 124686 tok/s step 14712/19560 | loss 3.316027 (-0.12z)| norm 0.2284 (-0.53z)| lr 1.27e-04 | 4189.53 ms | 32.2% bf16 MFU | 124709 tok/s step 14713/19560 | loss 3.264743 (-1.48z)| norm 0.2492 (+0.01z)| lr 1.27e-04 | 4201.68 ms | 32.1% bf16 MFU | 124712 tok/s step 14714/19560 | loss 3.318019 (-0.06z)| norm 0.2418 (-0.18z)| lr 1.27e-04 | 4170.25 ms | 32.4% bf16 MFU | 124763 tok/s step 14715/19560 | loss 3.292098 (-0.77z)| norm 0.2480 (-0.02z)| lr 1.27e-04 | 4168.70 ms | 32.4% bf16 MFU | 124813 tok/s step 14716/19560 | loss 3.376463 (+1.48z)| norm 0.2323 (-0.42z)| lr 1.27e-04 | 4178.12 ms | 32.3% bf16 MFU | 124847 tok/s step 14717/19560 | loss 3.359826 (+1.02z)| norm 0.2460 (-0.06z)| lr 1.27e-04 | 4180.20 ms | 32.3% bf16 MFU | 124875 tok/s step 14718/19560 | loss 3.310421 (-0.30z)| norm 0.2555 (+0.18z)| lr 1.27e-04 | 4172.71 ms | 32.4% bf16 MFU | 124914 tok/s step 14719/19560 | loss 3.323564 (+0.05z)| norm 0.2285 (-0.52z)| lr 1.27e-04 | 4198.39 ms | 32.2% bf16 MFU | 124912 tok/s step 14720/19560 | loss 3.340195 (+0.49z)| norm 0.2585 (+0.26z)| lr 1.27e-04 | 4175.00 ms | 32.3% bf16 MFU | 124945 tok/s step 14721/19560 | loss 3.319036 (-0.08z)| norm 0.2256 (-0.59z)| lr 1.27e-04 | 4180.25 ms | 32.3% bf16 MFU | 124969 tok/s step 14722/19560 | loss 3.351243 (+0.77z)| norm 0.2567 (+0.21z)| lr 1.27e-04 | 4180.45 ms | 32.3% bf16 MFU | 124991 tok/s step 14723/19560 | loss 3.306850 (-0.45z)| norm 0.2336 (-0.38z)| lr 1.27e-04 | 4184.30 ms | 32.3% bf16 MFU | 125007 tok/s step 14724/19560 | loss 3.322960 (-0.00z)| norm 0.2363 (-0.31z)| lr 1.27e-04 | 4170.51 ms | 32.4% bf16 MFU | 125042 tok/s step 14725/19560 | loss 3.337544 (+0.41z)| norm 0.2321 (-0.42z)| lr 1.27e-04 | 4194.74 ms | 32.2% bf16 MFU | 125039 tok/s step 14726/19560 | loss 3.256100 (-1.82z)| norm 0.2416 (-0.17z)| lr 1.27e-04 | 4160.82 ms | 32.4% bf16 MFU | 125088 tok/s step 14727/19560 | loss 3.263378 (-1.60z)| norm 0.2474 (-0.03z)| lr 1.27e-04 | 4170.46 ms | 32.4% bf16 MFU | 125119 tok/s step 14728/19560 | loss 3.320485 (-0.05z)| norm 0.2399 (-0.23z)| lr 1.27e-04 | 4176.35 ms | 32.3% bf16 MFU | 125140 tok/s step 14729/19560 | loss 3.313011 (-0.26z)| norm 0.2364 (-0.31z)| lr 1.26e-04 | 4185.54 ms | 32.3% bf16 MFU | 125146 tok/s step 14730/19560 | loss 3.350719 (+0.76z)| norm 0.2410 (-0.20z)| lr 1.26e-04 | 4178.31 ms | 32.3% bf16 MFU | 125163 tok/s step 14731/19560 | loss 3.367269 (+1.20z)| norm 0.2467 (-0.05z)| lr 1.26e-04 | 4226.63 ms | 31.9% bf16 MFU | 125107 tok/s step 14732/19560 | loss 3.342006 (+0.51z)| norm 0.2241 (-0.63z)| lr 1.26e-04 | 4186.62 ms | 32.2% bf16 MFU | 125113 tok/s step 14733/19560 | loss 3.273016 (-1.33z)| norm 0.2318 (-0.42z)| lr 1.26e-04 | 4187.40 ms | 32.2% bf16 MFU | 125117 tok/s step 14734/19560 | loss 3.313220 (-0.24z)| norm 0.2345 (-0.35z)| lr 1.26e-04 | 4184.75 ms | 32.3% bf16 MFU | 125126 tok/s step 14735/19560 | loss 3.280911 (-1.10z)| norm 0.2284 (-0.50z)| lr 1.26e-04 | 4265.56 ms | 31.7% bf16 MFU | 125015 tok/s step 14736/19560 | loss 3.423470 (+2.65z)| norm 0.2450 (-0.07z)| lr 1.26e-04 | 4180.68 ms | 32.3% bf16 MFU | 125035 tok/s step 14737/19560 | loss 3.299094 (-0.63z)| norm 0.2400 (-0.20z)| lr 1.26e-04 | 4173.79 ms | 32.3% bf16 MFU | 125064 tok/s step 14738/19560 | loss 3.352542 (+0.79z)| norm 0.2463 (-0.04z)| lr 1.26e-04 | 4180.31 ms | 32.3% bf16 MFU | 125082 tok/s step 14739/19560 | loss 3.325019 (+0.05z)| norm 0.2413 (-0.17z)| lr 1.26e-04 | 4166.28 ms | 32.4% bf16 MFU | 125119 tok/s step 14740/19560 | loss 3.306438 (-0.44z)| norm 0.2542 (+0.16z)| lr 1.26e-04 | 4172.82 ms | 32.4% bf16 MFU | 125146 tok/s step 14741/19560 | loss 3.298906 (-0.64z)| norm 0.2420 (-0.15z)| lr 1.26e-04 | 4169.47 ms | 32.4% bf16 MFU | 125176 tok/s step 14742/19560 | loss 3.327952 (+0.12z)| norm 0.2475 (-0.00z)| lr 1.26e-04 | 4184.80 ms | 32.3% bf16 MFU | 125181 tok/s step 14743/19560 | loss 3.345909 (+0.60z)| norm 0.2548 (+0.18z)| lr 1.26e-04 | 4199.52 ms | 32.2% bf16 MFU | 125164 tok/s step 14744/19560 | loss 3.231789 (-2.40z)| norm 0.2326 (-0.39z)| lr 1.26e-04 | 4198.80 ms | 32.2% bf16 MFU | 125149 tok/s step 14745/19560 | loss 3.324651 (+0.08z)| norm 0.2500 (+0.06z)| lr 1.26e-04 | 4184.32 ms | 32.3% bf16 MFU | 125157 tok/s step 14746/19560 | loss 3.289557 (-0.85z)| norm 0.2337 (-0.36z)| lr 1.26e-04 | 4174.20 ms | 32.3% bf16 MFU | 125179 tok/s step 14747/19560 | loss 3.277218 (-1.16z)| norm 0.2486 (+0.02z)| lr 1.26e-04 | 4182.83 ms | 32.3% bf16 MFU | 125187 tok/s step 14748/19560 | loss 3.329280 (+0.22z)| norm 0.2490 (+0.04z)| lr 1.26e-04 | 4185.44 ms | 32.3% bf16 MFU | 125191 tok/s step 14749/19560 | loss 3.260250 (-1.68z)| norm 0.2607 (+0.34z)| lr 1.26e-04 | 4169.02 ms | 32.4% bf16 MFU | 125219 tok/s step 14750/19560 | loss 3.331611 (+0.28z)| norm 0.2401 (-0.19z)| lr 1.25e-04 | 4165.57 ms | 32.4% bf16 MFU | 125252 tok/s val loss 3.287387 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3016/10042 = 0.300339 step 14751/19560 | loss 3.324341 (+0.09z)| norm 0.2512 (+0.10z)| lr 1.25e-04 | 4288.75 ms | 31.5% bf16 MFU | 125101 tok/s step 14752/19560 | loss 3.289523 (-0.86z)| norm 0.2510 (+0.09z)| lr 1.25e-04 | 5202.97 ms | 26.0% bf16 MFU | 123885 tok/s step 14753/19560 | loss 3.373769 (+1.43z)| norm 0.2666 (+0.49z)| lr 1.25e-04 | 4526.90 ms | 29.8% bf16 MFU | 123481 tok/s step 14754/19560 | loss 3.350159 (+0.77z)| norm 0.2375 (-0.26z)| lr 1.25e-04 | 4360.37 ms | 31.0% bf16 MFU | 123319 tok/s step 14755/19560 | loss 3.254445 (-1.82z)| norm 0.2498 (+0.06z)| lr 1.25e-04 | 4294.42 ms | 31.4% bf16 MFU | 123257 tok/s step 14756/19560 | loss 3.283809 (-1.01z)| norm 0.2422 (-0.13z)| lr 1.25e-04 | 4436.71 ms | 30.4% bf16 MFU | 123003 tok/s step 14757/19560 | loss 3.294839 (-0.70z)| norm 0.2465 (-0.02z)| lr 1.25e-04 | 4172.75 ms | 32.4% bf16 MFU | 123135 tok/s step 14758/19560 | loss 3.347171 (+0.76z)| norm 0.2418 (-0.14z)| lr 1.25e-04 | 4254.27 ms | 31.7% bf16 MFU | 123140 tok/s step 14759/19560 | loss 3.342565 (+0.62z)| norm 0.2483 (+0.03z)| lr 1.25e-04 | 4235.09 ms | 31.9% bf16 MFU | 123173 tok/s step 14760/19560 | loss 3.288555 (-0.86z)| norm 0.2516 (+0.12z)| lr 1.25e-04 | 4287.80 ms | 31.5% bf16 MFU | 123128 tok/s step 14761/19560 | loss 3.316262 (-0.09z)| norm 0.2360 (-0.28z)| lr 1.25e-04 | 4375.11 ms | 30.9% bf16 MFU | 122964 tok/s step 14762/19560 | loss 3.271743 (-1.31z)| norm 0.2494 (+0.07z)| lr 1.25e-04 | 4166.09 ms | 32.4% bf16 MFU | 123108 tok/s step 14763/19560 | loss 3.339229 (+0.58z)| norm 0.2490 (+0.06z)| lr 1.25e-04 | 4216.34 ms | 32.0% bf16 MFU | 123170 tok/s step 14764/19560 | loss 3.315015 (-0.11z)| norm 0.2348 (-0.30z)| lr 1.25e-04 | 4310.93 ms | 31.3% bf16 MFU | 123092 tok/s step 14765/19560 | loss 3.327374 (+0.22z)| norm 0.2408 (-0.14z)| lr 1.25e-04 | 4189.91 ms | 32.2% bf16 MFU | 123194 tok/s step 14766/19560 | loss 3.381029 (+1.74z)| norm 0.2496 (+0.08z)| lr 1.25e-04 | 4310.74 ms | 31.3% bf16 MFU | 123115 tok/s step 14767/19560 | loss 3.310957 (-0.26z)| norm 0.2347 (-0.30z)| lr 1.25e-04 | 4224.73 ms | 32.0% bf16 MFU | 123165 tok/s step 14768/19560 | loss 3.306219 (-0.39z)| norm 0.2333 (-0.33z)| lr 1.25e-04 | 4160.31 ms | 32.5% bf16 MFU | 123308 tok/s step 14769/19560 | loss 3.301998 (-0.51z)| norm 0.2402 (-0.15z)| lr 1.25e-04 | 4325.44 ms | 31.2% bf16 MFU | 123203 tok/s step 14770/19560 | loss 3.343601 (+0.67z)| norm 0.2447 (-0.03z)| lr 1.24e-04 | 4241.27 ms | 31.8% bf16 MFU | 123223 tok/s step 14771/19560 | loss 3.276731 (-1.23z)| norm 0.2556 (+0.25z)| lr 1.24e-04 | 4210.34 ms | 32.1% bf16 MFU | 123288 tok/s step 14772/19560 | loss 3.363611 (+1.22z)| norm 0.2431 (-0.08z)| lr 1.24e-04 | 4287.31 ms | 31.5% bf16 MFU | 123238 tok/s step 14773/19560 | loss 3.387690 (+1.87z)| norm 0.2626 (+0.43z)| lr 1.24e-04 | 4322.22 ms | 31.2% bf16 MFU | 123141 tok/s step 14774/19560 | loss 3.369740 (+1.34z)| norm 0.2467 (+0.01z)| lr 1.24e-04 | 4263.48 ms | 31.7% bf16 MFU | 123133 tok/s step 14775/19560 | loss 3.283401 (-1.04z)| norm 0.2342 (-0.95z)| lr 1.24e-04 | 4293.80 ms | 31.4% bf16 MFU | 123082 tok/s step 14776/19560 | loss 3.276121 (-1.22z)| norm 0.2366 (-0.68z)| lr 1.24e-04 | 4167.20 ms | 32.4% bf16 MFU | 123218 tok/s step 14777/19560 | loss 3.377840 (+1.56z)| norm 0.2520 (+0.98z)| lr 1.24e-04 | 4266.40 ms | 31.6% bf16 MFU | 123202 tok/s step 14778/19560 | loss 3.366388 (+1.22z)| norm 0.2474 (+0.49z)| lr 1.24e-04 | 4276.30 ms | 31.6% bf16 MFU | 123172 tok/s step 14779/19560 | loss 3.303291 (-0.50z)| norm 0.2383 (-0.50z)| lr 1.24e-04 | 4246.96 ms | 31.8% bf16 MFU | 123186 tok/s step 14780/19560 | loss 3.272434 (-1.33z)| norm 0.2511 (+0.90z)| lr 1.24e-04 | 4211.52 ms | 32.1% bf16 MFU | 123251 tok/s step 14781/19560 | loss 3.356356 (+0.95z)| norm 0.2542 (+1.22z)| lr 1.24e-04 | 4162.90 ms | 32.4% bf16 MFU | 123385 tok/s step 14782/19560 | loss 3.302720 (-0.51z)| norm 0.2438 (+0.08z)| lr 1.24e-04 | 4177.08 ms | 32.3% bf16 MFU | 123492 tok/s step 14783/19560 | loss 3.311667 (-0.27z)| norm 0.2637 (+2.21z)| lr 1.24e-04 | 4163.36 ms | 32.4% bf16 MFU | 123614 tok/s step 14784/19560 | loss 3.338382 (+0.46z)| norm 0.2474 (+0.44z)| lr 1.24e-04 | 4183.58 ms | 32.3% bf16 MFU | 123699 tok/s step 14785/19560 | loss 3.274353 (-1.29z)| norm 0.2550 (+1.25z)| lr 1.24e-04 | 4244.71 ms | 31.8% bf16 MFU | 123690 tok/s step 14786/19560 | loss 3.271119 (-1.37z)| norm 0.2616 (+1.92z)| lr 1.24e-04 | 4180.74 ms | 32.3% bf16 MFU | 123776 tok/s step 14787/19560 | loss 3.269994 (-1.37z)| norm 0.2436 (+0.01z)| lr 1.24e-04 | 4170.33 ms | 32.4% bf16 MFU | 123873 tok/s step 14788/19560 | loss 3.384327 (+1.71z)| norm 0.2500 (+0.69z)| lr 1.24e-04 | 4212.08 ms | 32.1% bf16 MFU | 123903 tok/s step 14789/19560 | loss 3.409880 (+2.34z)| norm 0.2641 (+2.15z)| lr 1.24e-04 | 4661.12 ms | 29.0% bf16 MFU | 123332 tok/s step 14790/19560 | loss 3.332680 (+0.29z)| norm 0.2577 (+1.45z)| lr 1.23e-04 | 4621.84 ms | 29.2% bf16 MFU | 122837 tok/s step 14791/19560 | loss 3.325826 (+0.09z)| norm 0.2586 (+1.52z)| lr 1.23e-04 | 4669.33 ms | 28.9% bf16 MFU | 122309 tok/s step 14792/19560 | loss 3.319005 (-0.09z)| norm 0.2718 (+2.82z)| lr 1.23e-04 | 4466.96 ms | 30.2% bf16 MFU | 122062 tok/s step 14793/19560 | loss 3.304084 (-0.48z)| norm 0.2589 (+1.48z)| lr 1.23e-04 | 4537.38 ms | 29.8% bf16 MFU | 121737 tok/s step 14794/19560 | loss 3.377783 (+1.50z)| norm 0.2536 (+0.93z)| lr 1.23e-04 | 4297.43 ms | 31.4% bf16 MFU | 121750 tok/s step 14795/19560 | loss 3.339276 (+0.45z)| norm 0.2549 (+1.05z)| lr 1.23e-04 | 4243.52 ms | 31.8% bf16 MFU | 121840 tok/s step 14796/19560 | loss 3.368196 (+1.23z)| norm 0.2755 (+3.00z)| lr 1.23e-04 | 4177.22 ms | 32.3% bf16 MFU | 122023 tok/s step 14797/19560 | loss 3.296942 (-0.70z)| norm 0.2415 (-0.33z)| lr 1.23e-04 | 4270.09 ms | 31.6% bf16 MFU | 122061 tok/s step 14798/19560 | loss 3.288531 (-0.91z)| norm 0.2527 (+0.76z)| lr 1.23e-04 | 4252.96 ms | 31.7% bf16 MFU | 122122 tok/s step 14799/19560 | loss 3.258517 (-1.69z)| norm 0.2518 (+0.67z)| lr 1.23e-04 | 4301.59 ms | 31.4% bf16 MFU | 122110 tok/s step 14800/19560 | loss 3.304938 (-0.44z)| norm 0.2395 (-0.53z)| lr 1.23e-04 | 4159.10 ms | 32.5% bf16 MFU | 122307 tok/s step 14801/19560 | loss 3.249613 (-1.89z)| norm 0.2411 (-0.38z)| lr 1.23e-04 | 4170.26 ms | 32.4% bf16 MFU | 122478 tok/s step 14802/19560 | loss 3.285314 (-0.93z)| norm 0.2404 (-0.45z)| lr 1.23e-04 | 4215.08 ms | 32.0% bf16 MFU | 122573 tok/s step 14803/19560 | loss 3.246367 (-1.92z)| norm 0.2259 (-1.84z)| lr 1.23e-04 | 4162.30 ms | 32.4% bf16 MFU | 122743 tok/s step 14804/19560 | loss 3.273886 (-1.18z)| norm 0.2567 (+1.13z)| lr 1.23e-04 | 4173.46 ms | 32.4% bf16 MFU | 122887 tok/s step 14805/19560 | loss 3.305038 (-0.36z)| norm 0.2546 (+0.92z)| lr 1.23e-04 | 4205.31 ms | 32.1% bf16 MFU | 122976 tok/s step 14806/19560 | loss 3.325544 (+0.17z)| norm 0.2572 (+1.15z)| lr 1.23e-04 | 4166.58 ms | 32.4% bf16 MFU | 123119 tok/s step 14807/19560 | loss 3.263620 (-1.44z)| norm 0.2278 (-1.67z)| lr 1.23e-04 | 4168.39 ms | 32.4% bf16 MFU | 123252 tok/s step 14808/19560 | loss 3.312330 (-0.17z)| norm 0.2580 (+1.22z)| lr 1.23e-04 | 4218.34 ms | 32.0% bf16 MFU | 123304 tok/s step 14809/19560 | loss 3.255853 (-1.62z)| norm 0.2511 (+0.54z)| lr 1.23e-04 | 4260.89 ms | 31.7% bf16 MFU | 123291 tok/s step 14810/19560 | loss 3.221653 (-2.44z)| norm 0.2373 (-0.79z)| lr 1.23e-04 | 4234.51 ms | 31.9% bf16 MFU | 123317 tok/s step 14811/19560 | loss 3.370605 (+1.35z)| norm 0.2522 (+0.63z)| lr 1.22e-04 | 4242.64 ms | 31.8% bf16 MFU | 123330 tok/s step 14812/19560 | loss 3.251338 (-1.65z)| norm 0.2376 (-0.78z)| lr 1.22e-04 | 4193.46 ms | 32.2% bf16 MFU | 123415 tok/s step 14813/19560 | loss 3.266056 (-1.26z)| norm 0.2262 (-1.87z)| lr 1.22e-04 | 4163.72 ms | 32.4% bf16 MFU | 123540 tok/s step 14814/19560 | loss 3.297142 (-0.48z)| norm 0.2376 (-0.78z)| lr 1.22e-04 | 4163.79 ms | 32.4% bf16 MFU | 123659 tok/s step 14815/19560 | loss 3.335279 (+0.47z)| norm 0.2454 (-0.03z)| lr 1.22e-04 | 4189.31 ms | 32.2% bf16 MFU | 123733 tok/s step 14816/19560 | loss 3.301713 (-0.37z)| norm 0.2303 (-1.46z)| lr 1.22e-04 | 4173.15 ms | 32.4% bf16 MFU | 123828 tok/s step 14817/19560 | loss 3.447521 (+3.12z)| norm 0.2698 (+2.28z)| lr 1.22e-04 | 4165.39 ms | 32.4% bf16 MFU | 123930 tok/s step 14818/19560 | loss 3.310084 (-0.18z)| norm 0.2489 (+0.28z)| lr 1.22e-04 | 4158.74 ms | 32.5% bf16 MFU | 124037 tok/s step 14819/19560 | loss 3.278182 (-0.94z)| norm 0.2482 (+0.21z)| lr 1.22e-04 | 4162.81 ms | 32.4% bf16 MFU | 124133 tok/s step 14820/19560 | loss 3.320441 (+0.09z)| norm 0.2281 (-1.70z)| lr 1.22e-04 | 4211.24 ms | 32.1% bf16 MFU | 124151 tok/s step 14821/19560 | loss 3.282929 (-0.81z)| norm 0.2549 (+0.84z)| lr 1.22e-04 | 4176.06 ms | 32.3% bf16 MFU | 124221 tok/s step 14822/19560 | loss 3.280732 (-0.85z)| norm 0.2342 (-1.12z)| lr 1.22e-04 | 4204.92 ms | 32.1% bf16 MFU | 124244 tok/s step 14823/19560 | loss 3.281281 (-0.82z)| norm 0.2411 (-0.46z)| lr 1.22e-04 | 4165.12 ms | 32.4% bf16 MFU | 124325 tok/s step 14824/19560 | loss 3.295442 (-0.47z)| norm 0.2405 (-0.50z)| lr 1.22e-04 | 4160.87 ms | 32.4% bf16 MFU | 124409 tok/s step 14825/19560 | loss 3.273428 (-0.99z)| norm 0.2392 (-0.63z)| lr 1.22e-04 | 4314.50 ms | 31.3% bf16 MFU | 124265 tok/s step 14826/19560 | loss 3.320906 (+0.18z)| norm 0.2565 (+1.02z)| lr 1.22e-04 | 4194.47 ms | 32.2% bf16 MFU | 124301 tok/s step 14827/19560 | loss 3.327682 (+0.35z)| norm 0.2348 (-1.02z)| lr 1.22e-04 | 4167.63 ms | 32.4% bf16 MFU | 124376 tok/s step 14828/19560 | loss 3.323679 (+0.25z)| norm 0.2448 (-0.08z)| lr 1.22e-04 | 4166.43 ms | 32.4% bf16 MFU | 124449 tok/s step 14829/19560 | loss 3.288408 (-0.61z)| norm 0.2427 (-0.27z)| lr 1.22e-04 | 4225.96 ms | 31.9% bf16 MFU | 124430 tok/s step 14830/19560 | loss 3.304386 (-0.22z)| norm 0.2376 (-0.74z)| lr 1.22e-04 | 4172.58 ms | 32.4% bf16 MFU | 124491 tok/s step 14831/19560 | loss 3.354850 (+1.01z)| norm 0.2858 (+3.62z)| lr 1.21e-04 | 4159.84 ms | 32.5% bf16 MFU | 124568 tok/s step 14832/19560 | loss 3.341018 (+0.66z)| norm 0.2622 (+1.49z)| lr 1.21e-04 | 4166.46 ms | 32.4% bf16 MFU | 124632 tok/s step 14833/19560 | loss 3.307975 (-0.17z)| norm 0.2407 (-0.44z)| lr 1.21e-04 | 4171.35 ms | 32.4% bf16 MFU | 124684 tok/s step 14834/19560 | loss 3.239468 (-1.84z)| norm 0.2659 (+1.79z)| lr 1.21e-04 | 4166.75 ms | 32.4% bf16 MFU | 124741 tok/s step 14835/19560 | loss 3.310097 (-0.09z)| norm 0.2427 (-0.28z)| lr 1.21e-04 | 4169.07 ms | 32.4% bf16 MFU | 124792 tok/s step 14836/19560 | loss 3.251359 (-1.51z)| norm 0.2473 (+0.14z)| lr 1.21e-04 | 4172.45 ms | 32.4% bf16 MFU | 124835 tok/s step 14837/19560 | loss 3.262079 (-1.25z)| norm 0.2321 (-1.23z)| lr 1.21e-04 | 4262.50 ms | 31.7% bf16 MFU | 124744 tok/s step 14838/19560 | loss 3.248342 (-1.56z)| norm 0.2495 (+0.34z)| lr 1.21e-04 | 4174.62 ms | 32.3% bf16 MFU | 124786 tok/s step 14839/19560 | loss 3.226622 (-2.05z)| norm 0.2371 (-0.78z)| lr 1.21e-04 | 4165.76 ms | 32.4% bf16 MFU | 124839 tok/s step 14840/19560 | loss 3.266146 (-1.08z)| norm 0.2288 (-1.53z)| lr 1.21e-04 | 4170.00 ms | 32.4% bf16 MFU | 124884 tok/s step 14841/19560 | loss 3.310491 (-0.02z)| norm 0.2399 (-0.53z)| lr 1.21e-04 | 4179.27 ms | 32.3% bf16 MFU | 124912 tok/s step 14842/19560 | loss 3.236307 (-1.78z)| norm 0.2330 (-1.14z)| lr 1.21e-04 | 4191.79 ms | 32.2% bf16 MFU | 124920 tok/s step 14843/19560 | loss 3.277922 (-0.78z)| norm 0.2466 (+0.09z)| lr 1.21e-04 | 4170.98 ms | 32.4% bf16 MFU | 124959 tok/s step 14844/19560 | loss 3.315063 (+0.12z)| norm 0.2356 (-0.90z)| lr 1.21e-04 | 4163.80 ms | 32.4% bf16 MFU | 125007 tok/s step 14845/19560 | loss 3.318244 (+0.20z)| norm 0.2277 (-1.59z)| lr 1.21e-04 | 4175.10 ms | 32.3% bf16 MFU | 125035 tok/s step 14846/19560 | loss 3.302001 (-0.19z)| norm 0.2449 (-0.04z)| lr 1.21e-04 | 4170.98 ms | 32.4% bf16 MFU | 125069 tok/s step 14847/19560 | loss 3.272252 (-0.89z)| norm 0.2591 (+1.20z)| lr 1.21e-04 | 4334.18 ms | 31.2% bf16 MFU | 124863 tok/s step 14848/19560 | loss 3.328138 (+0.46z)| norm 0.2286 (-1.51z)| lr 1.21e-04 | 4176.74 ms | 32.3% bf16 MFU | 124897 tok/s step 14849/19560 | loss 3.356058 (+1.12z)| norm 0.2493 (+0.33z)| lr 1.21e-04 | 4172.78 ms | 32.4% bf16 MFU | 124934 tok/s step 14850/19560 | loss 3.328589 (+0.46z)| norm 0.2458 (+0.02z)| lr 1.21e-04 | 4161.65 ms | 32.4% bf16 MFU | 124986 tok/s step 14851/19560 | loss 3.258525 (-1.21z)| norm 0.2318 (-1.24z)| lr 1.21e-04 | 4219.47 ms | 32.0% bf16 MFU | 124950 tok/s step 14852/19560 | loss 3.302751 (-0.15z)| norm 0.2232 (-1.99z)| lr 1.20e-04 | 4172.85 ms | 32.4% bf16 MFU | 124984 tok/s step 14853/19560 | loss 3.321781 (+0.31z)| norm 0.2573 (+1.04z)| lr 1.20e-04 | 4168.34 ms | 32.4% bf16 MFU | 125024 tok/s step 14854/19560 | loss 3.283890 (-0.60z)| norm 0.2264 (-1.69z)| lr 1.20e-04 | 4174.54 ms | 32.3% bf16 MFU | 125052 tok/s step 14855/19560 | loss 3.307815 (-0.03z)| norm 0.2302 (-1.34z)| lr 1.20e-04 | 4174.50 ms | 32.3% bf16 MFU | 125080 tok/s step 14856/19560 | loss 3.283427 (-0.62z)| norm 0.2380 (-0.65z)| lr 1.20e-04 | 4177.83 ms | 32.3% bf16 MFU | 125100 tok/s step 14857/19560 | loss 3.334667 (+0.62z)| norm 0.2281 (-1.50z)| lr 1.20e-04 | 4169.65 ms | 32.4% bf16 MFU | 125132 tok/s step 14858/19560 | loss 3.327836 (+0.46z)| norm 0.2395 (-0.50z)| lr 1.20e-04 | 4162.06 ms | 32.4% bf16 MFU | 125174 tok/s step 14859/19560 | loss 3.277313 (-0.76z)| norm 0.2281 (-1.48z)| lr 1.20e-04 | 4181.49 ms | 32.3% bf16 MFU | 125184 tok/s step 14860/19560 | loss 3.288844 (-0.46z)| norm 0.2439 (-0.12z)| lr 1.20e-04 | 4161.51 ms | 32.4% bf16 MFU | 125224 tok/s step 14861/19560 | loss 3.274084 (-0.83z)| norm 0.2386 (-0.60z)| lr 1.20e-04 | 4163.99 ms | 32.4% bf16 MFU | 125259 tok/s step 14862/19560 | loss 3.290176 (-0.43z)| norm 0.2324 (-1.14z)| lr 1.20e-04 | 4172.10 ms | 32.4% bf16 MFU | 125279 tok/s step 14863/19560 | loss 3.258418 (-1.19z)| norm 0.2369 (-0.75z)| lr 1.20e-04 | 4177.21 ms | 32.3% bf16 MFU | 125291 tok/s step 14864/19560 | loss 3.250396 (-1.39z)| norm 0.2471 (+0.15z)| lr 1.20e-04 | 4178.67 ms | 32.3% bf16 MFU | 125300 tok/s step 14865/19560 | loss 3.278792 (-0.68z)| norm 0.2361 (-0.82z)| lr 1.20e-04 | 4173.92 ms | 32.3% bf16 MFU | 125315 tok/s step 14866/19560 | loss 3.252941 (-1.30z)| norm 0.2456 (+0.02z)| lr 1.20e-04 | 4174.80 ms | 32.3% bf16 MFU | 125328 tok/s step 14867/19560 | loss 3.312196 (+0.18z)| norm 0.2381 (-0.64z)| lr 1.20e-04 | 4171.15 ms | 32.4% bf16 MFU | 125347 tok/s step 14868/19560 | loss 3.336752 (+0.78z)| norm 0.2661 (+1.80z)| lr 1.20e-04 | 4167.18 ms | 32.4% bf16 MFU | 125370 tok/s step 14869/19560 | loss 3.370819 (+1.60z)| norm 0.2737 (+2.39z)| lr 1.20e-04 | 4175.55 ms | 32.3% bf16 MFU | 125380 tok/s step 14870/19560 | loss 3.376732 (+1.72z)| norm 0.2411 (-0.38z)| lr 1.20e-04 | 4179.82 ms | 32.3% bf16 MFU | 125382 tok/s step 14871/19560 | loss 3.287438 (-0.45z)| norm 0.2530 (+0.63z)| lr 1.20e-04 | 4169.74 ms | 32.4% bf16 MFU | 125400 tok/s step 14872/19560 | loss 3.328472 (+0.54z)| norm 0.2436 (-0.18z)| lr 1.20e-04 | 4166.78 ms | 32.4% bf16 MFU | 125421 tok/s step 14873/19560 | loss 3.306959 (+0.01z)| norm 0.2400 (-0.48z)| lr 1.19e-04 | 4194.64 ms | 32.2% bf16 MFU | 125400 tok/s step 14874/19560 | loss 3.343149 (+0.90z)| norm 0.2299 (-1.35z)| lr 1.19e-04 | 4171.57 ms | 32.4% bf16 MFU | 125414 tok/s step 14875/19560 | loss 3.276121 (-0.76z)| norm 0.2427 (-0.25z)| lr 1.19e-04 | 4171.10 ms | 32.4% bf16 MFU | 125428 tok/s step 14876/19560 | loss 3.266416 (-0.98z)| norm 0.2458 (+0.02z)| lr 1.19e-04 | 4175.66 ms | 32.3% bf16 MFU | 125434 tok/s step 14877/19560 | loss 3.278048 (-0.70z)| norm 0.2280 (-1.47z)| lr 1.19e-04 | 4177.44 ms | 32.3% bf16 MFU | 125438 tok/s step 14878/19560 | loss 3.279335 (-0.66z)| norm 0.2430 (-0.19z)| lr 1.19e-04 | 4177.85 ms | 32.3% bf16 MFU | 125441 tok/s step 14879/19560 | loss 3.282871 (-0.56z)| norm 0.2224 (-1.91z)| lr 1.19e-04 | 4181.29 ms | 32.3% bf16 MFU | 125438 tok/s step 14880/19560 | loss 3.441613 (+3.19z)| norm 0.2605 (+1.28z)| lr 1.19e-04 | 4172.35 ms | 32.4% bf16 MFU | 125449 tok/s step 14881/19560 | loss 3.292359 (-0.33z)| norm 0.2383 (-0.56z)| lr 1.19e-04 | 4248.32 ms | 31.8% bf16 MFU | 125347 tok/s step 14882/19560 | loss 3.373058 (+1.58z)| norm 0.2375 (-0.63z)| lr 1.19e-04 | 4168.25 ms | 32.4% bf16 MFU | 125369 tok/s step 14883/19560 | loss 3.296234 (-0.25z)| norm 0.2411 (-0.32z)| lr 1.19e-04 | 4173.07 ms | 32.4% bf16 MFU | 125382 tok/s step 14884/19560 | loss 3.247511 (-1.40z)| norm 0.2419 (-0.25z)| lr 1.19e-04 | 4179.56 ms | 32.3% bf16 MFU | 125385 tok/s step 14885/19560 | loss 3.227104 (-1.85z)| norm 0.2466 (+0.15z)| lr 1.19e-04 | 4169.60 ms | 32.4% bf16 MFU | 125403 tok/s step 14886/19560 | loss 3.258215 (-1.10z)| norm 0.2244 (-1.70z)| lr 1.19e-04 | 4172.83 ms | 32.4% bf16 MFU | 125415 tok/s step 14887/19560 | loss 3.275443 (-0.69z)| norm 0.2494 (+0.39z)| lr 1.19e-04 | 4161.63 ms | 32.4% bf16 MFU | 125443 tok/s step 14888/19560 | loss 3.263087 (-0.97z)| norm 0.2278 (-1.40z)| lr 1.19e-04 | 4166.94 ms | 32.4% bf16 MFU | 125462 tok/s step 14889/19560 | loss 3.232807 (-1.64z)| norm 0.2339 (-0.89z)| lr 1.19e-04 | 4169.19 ms | 32.4% bf16 MFU | 125477 tok/s step 14890/19560 | loss 3.291158 (-0.30z)| norm 0.2201 (-1.98z)| lr 1.19e-04 | 4167.00 ms | 32.4% bf16 MFU | 125494 tok/s step 14891/19560 | loss 3.292413 (-0.26z)| norm 0.2287 (-1.26z)| lr 1.19e-04 | 4166.28 ms | 32.4% bf16 MFU | 125511 tok/s step 14892/19560 | loss 3.276343 (-0.63z)| norm 0.2422 (-0.16z)| lr 1.19e-04 | 4176.42 ms | 32.3% bf16 MFU | 125512 tok/s step 14893/19560 | loss 3.306537 (+0.08z)| norm 0.2256 (-1.50z)| lr 1.19e-04 | 4175.33 ms | 32.3% bf16 MFU | 125515 tok/s step 14894/19560 | loss 3.370222 (+1.56z)| norm 0.2407 (-0.27z)| lr 1.18e-04 | 4168.41 ms | 32.4% bf16 MFU | 125528 tok/s step 14895/19560 | loss 3.282967 (-0.46z)| norm 0.2257 (-1.47z)| lr 1.18e-04 | 4176.72 ms | 32.3% bf16 MFU | 125528 tok/s step 14896/19560 | loss 3.243738 (-1.36z)| norm 0.2403 (-0.30z)| lr 1.18e-04 | 4165.15 ms | 32.4% bf16 MFU | 125545 tok/s step 14897/19560 | loss 3.250316 (-1.19z)| norm 0.2367 (-0.59z)| lr 1.18e-04 | 4179.53 ms | 32.3% bf16 MFU | 125540 tok/s step 14898/19560 | loss 3.256127 (-1.04z)| norm 0.2325 (-0.91z)| lr 1.18e-04 | 4167.45 ms | 32.4% bf16 MFU | 125554 tok/s step 14899/19560 | loss 3.301485 (-0.00z)| norm 0.2396 (-0.33z)| lr 1.18e-04 | 4162.27 ms | 32.4% bf16 MFU | 125574 tok/s step 14900/19560 | loss 3.340405 (+0.90z)| norm 0.2416 (-0.17z)| lr 1.18e-04 | 4175.91 ms | 32.3% bf16 MFU | 125573 tok/s step 14901/19560 | loss 3.297522 (-0.07z)| norm 0.2573 (+1.10z)| lr 1.18e-04 | 4166.86 ms | 32.4% bf16 MFU | 125585 tok/s step 14902/19560 | loss 3.292489 (-0.18z)| norm 0.2444 (+0.06z)| lr 1.18e-04 | 4169.69 ms | 32.4% bf16 MFU | 125593 tok/s step 14903/19560 | loss 3.277376 (-0.54z)| norm 0.2478 (+0.33z)| lr 1.18e-04 | 4172.53 ms | 32.4% bf16 MFU | 125596 tok/s step 14904/19560 | loss 3.337900 (+0.88z)| norm 0.2494 (+0.45z)| lr 1.18e-04 | 4178.96 ms | 32.3% bf16 MFU | 125589 tok/s step 14905/19560 | loss 3.260939 (-0.92z)| norm 0.2520 (+0.66z)| lr 1.18e-04 | 4233.80 ms | 31.9% bf16 MFU | 125501 tok/s step 14906/19560 | loss 3.348524 (+1.18z)| norm 0.2511 (+0.58z)| lr 1.18e-04 | 4158.46 ms | 32.5% bf16 MFU | 125530 tok/s step 14907/19560 | loss 3.235323 (-1.51z)| norm 0.2356 (-0.67z)| lr 1.18e-04 | 4164.67 ms | 32.4% bf16 MFU | 125548 tok/s step 14908/19560 | loss 3.317618 (+0.44z)| norm 0.2641 (+1.61z)| lr 1.18e-04 | 4170.65 ms | 32.4% bf16 MFU | 125556 tok/s step 14909/19560 | loss 3.267937 (-0.73z)| norm 0.2310 (-1.03z)| lr 1.18e-04 | 4165.47 ms | 32.4% bf16 MFU | 125572 tok/s step 14910/19560 | loss 3.347136 (+1.15z)| norm 0.2561 (+0.98z)| lr 1.18e-04 | 4163.94 ms | 32.4% bf16 MFU | 125589 tok/s step 14911/19560 | loss 3.328083 (+0.69z)| norm 0.2389 (-0.39z)| lr 1.18e-04 | 4178.64 ms | 32.3% bf16 MFU | 125583 tok/s step 14912/19560 | loss 3.260880 (-0.89z)| norm 0.2452 (+0.12z)| lr 1.18e-04 | 4160.52 ms | 32.5% bf16 MFU | 125604 tok/s step 14913/19560 | loss 3.306584 (+0.19z)| norm 0.2235 (-1.59z)| lr 1.18e-04 | 4162.58 ms | 32.4% bf16 MFU | 125622 tok/s step 14914/19560 | loss 3.294097 (-0.11z)| norm 0.2403 (-0.24z)| lr 1.17e-04 | 4180.80 ms | 32.3% bf16 MFU | 125611 tok/s step 14915/19560 | loss 3.303410 (+0.10z)| norm 0.2217 (-1.71z)| lr 1.17e-04 | 4163.73 ms | 32.4% bf16 MFU | 125626 tok/s step 14916/19560 | loss 3.300585 (+0.05z)| norm 0.2244 (-1.46z)| lr 1.17e-04 | 4165.39 ms | 32.4% bf16 MFU | 125638 tok/s step 14917/19560 | loss 3.262351 (-0.87z)| norm 0.2311 (-0.92z)| lr 1.17e-04 | 4171.81 ms | 32.4% bf16 MFU | 125640 tok/s step 14918/19560 | loss 3.238739 (-1.44z)| norm 0.2280 (-1.15z)| lr 1.17e-04 | 4172.82 ms | 32.4% bf16 MFU | 125640 tok/s step 14919/19560 | loss 3.280668 (-0.39z)| norm 0.2478 (+0.44z)| lr 1.17e-04 | 4177.62 ms | 32.3% bf16 MFU | 125633 tok/s step 14920/19560 | loss 3.282567 (-0.33z)| norm 0.2569 (+1.20z)| lr 1.17e-04 | 4161.50 ms | 32.4% bf16 MFU | 125651 tok/s step 14921/19560 | loss 3.361713 (+1.61z)| norm 0.2530 (+0.88z)| lr 1.17e-04 | 4162.18 ms | 32.4% bf16 MFU | 125666 tok/s step 14922/19560 | loss 3.215392 (-1.96z)| norm 0.2375 (-0.38z)| lr 1.17e-04 | 4158.03 ms | 32.5% bf16 MFU | 125688 tok/s step 14923/19560 | loss 3.319209 (+0.60z)| norm 0.2436 (+0.14z)| lr 1.17e-04 | 4177.86 ms | 32.3% bf16 MFU | 125678 tok/s step 14924/19560 | loss 3.273125 (-0.53z)| norm 0.2447 (+0.25z)| lr 1.17e-04 | 4164.43 ms | 32.4% bf16 MFU | 125689 tok/s step 14925/19560 | loss 3.281927 (-0.31z)| norm 0.2474 (+0.48z)| lr 1.17e-04 | 4169.09 ms | 32.4% bf16 MFU | 125692 tok/s step 14926/19560 | loss 3.252350 (-1.03z)| norm 0.2339 (-0.66z)| lr 1.17e-04 | 4160.61 ms | 32.5% bf16 MFU | 125708 tok/s step 14927/19560 | loss 3.289014 (-0.13z)| norm 0.2336 (-0.68z)| lr 1.17e-04 | 4174.55 ms | 32.3% bf16 MFU | 125702 tok/s step 14928/19560 | loss 3.281012 (-0.32z)| norm 0.2410 (-0.04z)| lr 1.17e-04 | 4160.25 ms | 32.5% bf16 MFU | 125718 tok/s step 14929/19560 | loss 3.309385 (+0.38z)| norm 0.2359 (-0.47z)| lr 1.17e-04 | 4160.29 ms | 32.5% bf16 MFU | 125733 tok/s step 14930/19560 | loss 3.230912 (-1.57z)| norm 0.2369 (-0.39z)| lr 1.17e-04 | 4170.63 ms | 32.4% bf16 MFU | 125732 tok/s step 14931/19560 | loss 3.291623 (-0.07z)| norm 0.2385 (-0.26z)| lr 1.17e-04 | 4167.86 ms | 32.4% bf16 MFU | 125735 tok/s step 14932/19560 | loss 3.253718 (-1.01z)| norm 0.2256 (-1.34z)| lr 1.17e-04 | 4165.75 ms | 32.4% bf16 MFU | 125741 tok/s step 14933/19560 | loss 3.302857 (+0.22z)| norm 0.2362 (-0.42z)| lr 1.17e-04 | 4161.79 ms | 32.4% bf16 MFU | 125753 tok/s step 14934/19560 | loss 3.230108 (-1.56z)| norm 0.2398 (-0.11z)| lr 1.17e-04 | 4168.67 ms | 32.4% bf16 MFU | 125754 tok/s step 14935/19560 | loss 3.295160 (+0.04z)| norm 0.2200 (-1.81z)| lr 1.16e-04 | 4170.13 ms | 32.4% bf16 MFU | 125752 tok/s step 14936/19560 | loss 3.343908 (+1.23z)| norm 0.2546 (+1.18z)| lr 1.16e-04 | 4169.79 ms | 32.4% bf16 MFU | 125752 tok/s step 14937/19560 | loss 3.324744 (+0.75z)| norm 0.2527 (+1.02z)| lr 1.16e-04 | 4164.84 ms | 32.4% bf16 MFU | 125758 tok/s step 14938/19560 | loss 3.349584 (+1.34z)| norm 0.2473 (+0.55z)| lr 1.16e-04 | 4173.56 ms | 32.4% bf16 MFU | 125751 tok/s step 14939/19560 | loss 3.325808 (+0.77z)| norm 0.2436 (+0.24z)| lr 1.16e-04 | 4156.83 ms | 32.5% bf16 MFU | 125770 tok/s step 14940/19560 | loss 3.298608 (+0.08z)| norm 0.2734 (+2.71z)| lr 1.16e-04 | 4174.99 ms | 32.3% bf16 MFU | 125761 tok/s step 14941/19560 | loss 3.342067 (+1.16z)| norm 0.2818 (+3.25z)| lr 1.16e-04 | 4164.27 ms | 32.4% bf16 MFU | 125768 tok/s step 14942/19560 | loss 3.255974 (-0.99z)| norm 0.2537 (+0.97z)| lr 1.16e-04 | 4167.48 ms | 32.4% bf16 MFU | 125769 tok/s step 14943/19560 | loss 3.289814 (-0.14z)| norm 0.2508 (+0.72z)| lr 1.16e-04 | 4165.14 ms | 32.4% bf16 MFU | 125775 tok/s step 14944/19560 | loss 3.447361 (+3.60z)| norm 0.2798 (+2.94z)| lr 1.16e-04 | 4166.61 ms | 32.4% bf16 MFU | 125778 tok/s step 14945/19560 | loss 3.306246 (+0.27z)| norm 0.2552 (+1.04z)| lr 1.16e-04 | 4165.10 ms | 32.4% bf16 MFU | 125782 tok/s step 14946/19560 | loss 3.304013 (+0.22z)| norm 0.2400 (-0.16z)| lr 1.16e-04 | 4165.45 ms | 32.4% bf16 MFU | 125787 tok/s step 14947/19560 | loss 3.263302 (-0.80z)| norm 0.2569 (+1.17z)| lr 1.16e-04 | 4166.65 ms | 32.4% bf16 MFU | 125789 tok/s step 14948/19560 | loss 3.316457 (+0.53z)| norm 0.2522 (+0.78z)| lr 1.16e-04 | 4172.93 ms | 32.4% bf16 MFU | 125781 tok/s step 14949/19560 | loss 3.304407 (+0.23z)| norm 0.2346 (-0.59z)| lr 1.16e-04 | 4165.26 ms | 32.4% bf16 MFU | 125786 tok/s step 14950/19560 | loss 3.304151 (+0.22z)| norm 0.2461 (+0.31z)| lr 1.16e-04 | 4173.98 ms | 32.3% bf16 MFU | 125777 tok/s step 14951/19560 | loss 3.259151 (-0.91z)| norm 0.2365 (-0.45z)| lr 1.16e-04 | 4173.48 ms | 32.4% bf16 MFU | 125769 tok/s step 14952/19560 | loss 3.310029 (+0.36z)| norm 0.2343 (-0.62z)| lr 1.16e-04 | 4169.57 ms | 32.4% bf16 MFU | 125768 tok/s step 14953/19560 | loss 3.304909 (+0.23z)| norm 0.2366 (-0.43z)| lr 1.16e-04 | 4162.64 ms | 32.4% bf16 MFU | 125777 tok/s step 14954/19560 | loss 3.291549 (-0.10z)| norm 0.2540 (+0.95z)| lr 1.16e-04 | 4171.52 ms | 32.4% bf16 MFU | 125772 tok/s step 14955/19560 | loss 3.323551 (+0.71z)| norm 0.2281 (-1.10z)| lr 1.16e-04 | 4169.28 ms | 32.4% bf16 MFU | 125771 tok/s step 14956/19560 | loss 3.281435 (-0.34z)| norm 0.2468 (+0.38z)| lr 1.15e-04 | 4169.79 ms | 32.4% bf16 MFU | 125769 tok/s step 14957/19560 | loss 3.235151 (-1.49z)| norm 0.2274 (-1.14z)| lr 1.15e-04 | 4166.56 ms | 32.4% bf16 MFU | 125773 tok/s step 14958/19560 | loss 3.290785 (-0.10z)| norm 0.2457 (+0.29z)| lr 1.15e-04 | 4221.03 ms | 32.0% bf16 MFU | 125694 tok/s step 14959/19560 | loss 3.314458 (+0.51z)| norm 0.2349 (-0.55z)| lr 1.15e-04 | 4296.45 ms | 31.4% bf16 MFU | 125511 tok/s step 14960/19560 | loss 3.301357 (+0.19z)| norm 0.2323 (-0.76z)| lr 1.15e-04 | 4247.10 ms | 31.8% bf16 MFU | 125408 tok/s step 14961/19560 | loss 3.267580 (-0.66z)| norm 0.2491 (+0.64z)| lr 1.15e-04 | 4282.83 ms | 31.5% bf16 MFU | 125258 tok/s step 14962/19560 | loss 3.262394 (-0.80z)| norm 0.2398 (-0.12z)| lr 1.15e-04 | 4185.49 ms | 32.3% bf16 MFU | 125258 tok/s step 14963/19560 | loss 3.296126 (+0.06z)| norm 0.2456 (+0.37z)| lr 1.15e-04 | 4260.03 ms | 31.7% bf16 MFU | 125149 tok/s step 14964/19560 | loss 3.208805 (-2.12z)| norm 0.2414 (+0.02z)| lr 1.15e-04 | 4159.71 ms | 32.5% bf16 MFU | 125194 tok/s step 14965/19560 | loss 3.272192 (-0.53z)| norm 0.2360 (-0.45z)| lr 1.15e-04 | 4162.54 ms | 32.4% bf16 MFU | 125232 tok/s step 14966/19560 | loss 3.295148 (+0.03z)| norm 0.2486 (+0.62z)| lr 1.15e-04 | 4166.80 ms | 32.4% bf16 MFU | 125261 tok/s step 14967/19560 | loss 3.225574 (-1.72z)| norm 0.2326 (-0.73z)| lr 1.15e-04 | 4154.36 ms | 32.5% bf16 MFU | 125308 tok/s step 14968/19560 | loss 3.270022 (-0.60z)| norm 0.2428 (+0.13z)| lr 1.15e-04 | 4156.38 ms | 32.5% bf16 MFU | 125350 tok/s step 14969/19560 | loss 3.277212 (-0.41z)| norm 0.2329 (-0.71z)| lr 1.15e-04 | 4150.56 ms | 32.5% bf16 MFU | 125398 tok/s step 14970/19560 | loss 3.306395 (+0.31z)| norm 0.2248 (-1.38z)| lr 1.15e-04 | 4161.85 ms | 32.4% bf16 MFU | 125427 tok/s step 14971/19560 | loss 3.321013 (+0.67z)| norm 0.2317 (-0.79z)| lr 1.15e-04 | 4160.81 ms | 32.4% bf16 MFU | 125456 tok/s step 14972/19560 | loss 3.301945 (+0.19z)| norm 0.2433 (+0.19z)| lr 1.15e-04 | 4155.20 ms | 32.5% bf16 MFU | 125492 tok/s step 14973/19560 | loss 3.268596 (-0.65z)| norm 0.2203 (-1.74z)| lr 1.15e-04 | 4153.21 ms | 32.5% bf16 MFU | 125529 tok/s step 14974/19560 | loss 3.296160 (+0.06z)| norm 0.2365 (-0.37z)| lr 1.15e-04 | 4158.63 ms | 32.5% bf16 MFU | 125556 tok/s step 14975/19560 | loss 3.313615 (+0.49z)| norm 0.2514 (+0.88z)| lr 1.15e-04 | 4161.82 ms | 32.4% bf16 MFU | 125577 tok/s step 14976/19560 | loss 3.385207 (+2.26z)| norm 0.2394 (-0.14z)| lr 1.15e-04 | 4166.62 ms | 32.4% bf16 MFU | 125590 tok/s step 14977/19560 | loss 3.279402 (-0.37z)| norm 0.2331 (-0.66z)| lr 1.14e-04 | 4157.60 ms | 32.5% bf16 MFU | 125616 tok/s step 14978/19560 | loss 3.327006 (+0.83z)| norm 0.2453 (+0.37z)| lr 1.14e-04 | 4173.09 ms | 32.4% bf16 MFU | 125617 tok/s step 14979/19560 | loss 3.313569 (+0.48z)| norm 0.2280 (-1.08z)| lr 1.14e-04 | 4160.77 ms | 32.5% bf16 MFU | 125636 tok/s step 14980/19560 | loss 3.318613 (+0.60z)| norm 0.2549 (+1.17z)| lr 1.14e-04 | 5209.57 ms | 25.9% bf16 MFU | 124386 tok/s step 14981/19560 | loss 3.270776 (-0.59z)| norm 0.2218 (-1.61z)| lr 1.14e-04 | 4576.38 ms | 29.5% bf16 MFU | 123895 tok/s step 14982/19560 | loss 3.282663 (-0.29z)| norm 0.2343 (-0.56z)| lr 1.14e-04 | 4470.71 ms | 30.2% bf16 MFU | 123564 tok/s step 14983/19560 | loss 3.298535 (+0.11z)| norm 0.2458 (+0.41z)| lr 1.14e-04 | 4377.07 ms | 30.8% bf16 MFU | 123375 tok/s step 14984/19560 | loss 3.284195 (-0.25z)| norm 0.2360 (-0.43z)| lr 1.14e-04 | 4410.59 ms | 30.6% bf16 MFU | 123150 tok/s step 14985/19560 | loss 3.260100 (-0.85z)| norm 0.2446 (+0.29z)| lr 1.14e-04 | 4317.08 ms | 31.3% bf16 MFU | 123065 tok/s step 14986/19560 | loss 3.267323 (-0.65z)| norm 0.2418 (+0.06z)| lr 1.14e-04 | 4295.48 ms | 31.4% bf16 MFU | 123014 tok/s step 14987/19560 | loss 3.326469 (+0.83z)| norm 0.2734 (+2.66z)| lr 1.14e-04 | 4223.30 ms | 32.0% bf16 MFU | 123070 tok/s step 14988/19560 | loss 3.283553 (-0.25z)| norm 0.2341 (-0.61z)| lr 1.14e-04 | 4262.85 ms | 31.7% bf16 MFU | 123066 tok/s step 14989/19560 | loss 3.318972 (+0.63z)| norm 0.2306 (-0.90z)| lr 1.14e-04 | 4203.69 ms | 32.1% bf16 MFU | 123149 tok/s step 14990/19560 | loss 3.274051 (-0.50z)| norm 0.2362 (-0.43z)| lr 1.14e-04 | 4157.08 ms | 32.5% bf16 MFU | 123298 tok/s step 14991/19560 | loss 3.329816 (+0.90z)| norm 0.2377 (-0.30z)| lr 1.14e-04 | 4167.85 ms | 32.4% bf16 MFU | 123422 tok/s step 14992/19560 | loss 3.246136 (-1.21z)| norm 0.2291 (-1.01z)| lr 1.14e-04 | 4335.02 ms | 31.1% bf16 MFU | 123298 tok/s step 14993/19560 | loss 3.317998 (+0.59z)| norm 0.2359 (-0.44z)| lr 1.14e-04 | 4161.08 ms | 32.4% bf16 MFU | 123433 tok/s step 14994/19560 | loss 3.259545 (-0.88z)| norm 0.2307 (-0.87z)| lr 1.14e-04 | 4226.85 ms | 31.9% bf16 MFU | 123464 tok/s step 14995/19560 | loss 3.308157 (+0.34z)| norm 0.2281 (-1.07z)| lr 1.14e-04 | 4479.16 ms | 30.1% bf16 MFU | 123143 tok/s step 14996/19560 | loss 3.327495 (+0.83z)| norm 0.2401 (-0.07z)| lr 1.14e-04 | 4154.90 ms | 32.5% bf16 MFU | 123295 tok/s step 14997/19560 | loss 3.267059 (-0.68z)| norm 0.2275 (-1.12z)| lr 1.14e-04 | 4199.21 ms | 32.2% bf16 MFU | 123373 tok/s step 14998/19560 | loss 3.329692 (+0.94z)| norm 0.2321 (-0.71z)| lr 1.13e-04 | 4213.08 ms | 32.0% bf16 MFU | 123427 tok/s step 14999/19560 | loss 3.357960 (+1.65z)| norm 0.2357 (-0.39z)| lr 1.13e-04 | 4158.31 ms | 32.5% bf16 MFU | 123559 tok/s step 15000/19560 | loss 3.228058 (-1.66z)| norm 0.2222 (-1.52z)| lr 1.13e-04 | 4254.91 ms | 31.7% bf16 MFU | 123542 tok/s val loss 3.283647 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3056/10042 = 0.304322 Writing checkpoint at step 15000 Writing model to log124M/model_00015000.bin Writing state to log124M/state_00015000_00000.bin step 15001/19560 | loss 3.239987 (-1.33z)| norm 0.2184 (-1.82z)| lr 1.13e-04 | 4232.13 ms | 31.9% bf16 MFU | 123559 tok/s step 15002/19560 | loss 3.417430 (+3.05z)| norm 0.2319 (-0.68z)| lr 1.13e-04 | 4222.46 ms | 32.0% bf16 MFU | 123590 tok/s step 15003/19560 | loss 3.333754 (+0.98z)| norm 0.3135 (+5.40z)| lr 1.13e-04 | 4158.11 ms | 32.5% bf16 MFU | 123715 tok/s step 15004/19560 | loss 3.271824 (-0.53z)| norm 0.2563 (+1.15z)| lr 1.13e-04 | 4193.82 ms | 32.2% bf16 MFU | 123780 tok/s step 15005/19560 | loss 3.333576 (+0.96z)| norm 0.2393 (-0.11z)| lr 1.13e-04 | 4186.46 ms | 32.3% bf16 MFU | 123852 tok/s step 15006/19560 | loss 3.333263 (+0.94z)| norm 0.2329 (-0.57z)| lr 1.13e-04 | 4208.75 ms | 32.1% bf16 MFU | 123888 tok/s step 15007/19560 | loss 3.291679 (-0.07z)| norm 0.2546 (+1.01z)| lr 1.13e-04 | 4269.96 ms | 31.6% bf16 MFU | 123833 tok/s step 15008/19560 | loss 3.276356 (-0.43z)| norm 0.2358 (-0.37z)| lr 1.13e-04 | 4375.41 ms | 30.9% bf16 MFU | 123633 tok/s step 15009/19560 | loss 3.240110 (-1.34z)| norm 0.2373 (-0.25z)| lr 1.13e-04 | 4154.51 ms | 32.5% bf16 MFU | 123761 tok/s step 15010/19560 | loss 3.261700 (-0.78z)| norm 0.2299 (-0.80z)| lr 1.13e-04 | 4160.33 ms | 32.5% bf16 MFU | 123874 tok/s step 15011/19560 | loss 3.285405 (-0.17z)| norm 0.2354 (-0.38z)| lr 1.13e-04 | 4208.33 ms | 32.1% bf16 MFU | 123909 tok/s step 15012/19560 | loss 3.447264 (+3.76z)| norm 0.2686 (+2.04z)| lr 1.13e-04 | 4167.56 ms | 32.4% bf16 MFU | 124004 tok/s step 15013/19560 | loss 3.350726 (+1.38z)| norm 0.2542 (+0.98z)| lr 1.13e-04 | 4263.93 ms | 31.7% bf16 MFU | 123952 tok/s step 15014/19560 | loss 3.315432 (+0.51z)| norm 0.2389 (-0.15z)| lr 1.13e-04 | 4163.32 ms | 32.4% bf16 MFU | 124051 tok/s step 15015/19560 | loss 3.280775 (-0.35z)| norm 0.2528 (+0.87z)| lr 1.13e-04 | 4191.40 ms | 32.2% bf16 MFU | 124103 tok/s step 15016/19560 | loss 3.367869 (+1.76z)| norm 0.2525 (+0.83z)| lr 1.13e-04 | 4217.08 ms | 32.0% bf16 MFU | 124114 tok/s step 15017/19560 | loss 3.277787 (-0.45z)| norm 0.2392 (-0.14z)| lr 1.13e-04 | 4204.26 ms | 32.1% bf16 MFU | 124143 tok/s step 15018/19560 | loss 3.297152 (+0.03z)| norm 0.2489 (+0.56z)| lr 1.13e-04 | 4169.69 ms | 32.4% bf16 MFU | 124223 tok/s step 15019/19560 | loss 3.209583 (-2.07z)| norm 0.2367 (-0.36z)| lr 1.13e-04 | 4210.10 ms | 32.1% bf16 MFU | 124238 tok/s step 15020/19560 | loss 3.308855 (+0.32z)| norm 0.2506 (+0.67z)| lr 1.12e-04 | 4303.15 ms | 31.4% bf16 MFU | 124118 tok/s step 15021/19560 | loss 3.257174 (-0.92z)| norm 0.2575 (+1.16z)| lr 1.12e-04 | 4167.70 ms | 32.4% bf16 MFU | 124202 tok/s step 15022/19560 | loss 3.247401 (-1.14z)| norm 0.2260 (-1.16z)| lr 1.12e-04 | 4165.18 ms | 32.4% bf16 MFU | 124286 tok/s step 15023/19560 | loss 3.286824 (-0.18z)| norm 0.2460 (+0.31z)| lr 1.12e-04 | 4202.63 ms | 32.1% bf16 MFU | 124309 tok/s step 15024/19560 | loss 3.252099 (-1.03z)| norm 0.2360 (-0.43z)| lr 1.12e-04 | 4170.40 ms | 32.4% bf16 MFU | 124380 tok/s step 15025/19560 | loss 3.329439 (+0.84z)| norm 0.2260 (-1.16z)| lr 1.12e-04 | 4158.91 ms | 32.5% bf16 MFU | 124464 tok/s step 15026/19560 | loss 3.286556 (-0.21z)| norm 0.2391 (-0.20z)| lr 1.12e-04 | 4207.28 ms | 32.1% bf16 MFU | 124471 tok/s step 15027/19560 | loss 3.267802 (-0.67z)| norm 0.2364 (-0.40z)| lr 1.12e-04 | 4193.73 ms | 32.2% bf16 MFU | 124499 tok/s step 15028/19560 | loss 3.328696 (+0.82z)| norm 0.2283 (-0.98z)| lr 1.12e-04 | 4159.57 ms | 32.5% bf16 MFU | 124576 tok/s step 15029/19560 | loss 3.379494 (+2.02z)| norm 0.2491 (+0.55z)| lr 1.12e-04 | 4163.55 ms | 32.4% bf16 MFU | 124643 tok/s step 15030/19560 | loss 3.298366 (+0.06z)| norm 0.2322 (-0.69z)| lr 1.12e-04 | 4158.61 ms | 32.5% bf16 MFU | 124715 tok/s step 15031/19560 | loss 3.239203 (-1.34z)| norm 0.2415 (+0.01z)| lr 1.12e-04 | 4170.37 ms | 32.4% bf16 MFU | 124765 tok/s step 15032/19560 | loss 3.292057 (-0.07z)| norm 0.2331 (-0.61z)| lr 1.12e-04 | 4159.25 ms | 32.5% bf16 MFU | 124829 tok/s step 15033/19560 | loss 3.295565 (+0.01z)| norm 0.2517 (+0.77z)| lr 1.12e-04 | 4163.62 ms | 32.4% bf16 MFU | 124884 tok/s step 15034/19560 | loss 3.341138 (+1.11z)| norm 0.2395 (-0.13z)| lr 1.12e-04 | 4166.18 ms | 32.4% bf16 MFU | 124932 tok/s step 15035/19560 | loss 3.294969 (-0.02z)| norm 0.2391 (-0.16z)| lr 1.12e-04 | 4156.17 ms | 32.5% bf16 MFU | 124993 tok/s step 15036/19560 | loss 3.297639 (+0.05z)| norm 0.2442 (+0.23z)| lr 1.12e-04 | 4164.12 ms | 32.4% bf16 MFU | 125038 tok/s step 15037/19560 | loss 3.239986 (-1.34z)| norm 0.2379 (-0.24z)| lr 1.12e-04 | 4164.94 ms | 32.4% bf16 MFU | 125080 tok/s step 15038/19560 | loss 3.321627 (+0.64z)| norm 0.2333 (-0.57z)| lr 1.12e-04 | 4168.28 ms | 32.4% bf16 MFU | 125115 tok/s step 15039/19560 | loss 3.237219 (-1.38z)| norm 0.2486 (+0.57z)| lr 1.12e-04 | 4161.53 ms | 32.4% bf16 MFU | 125159 tok/s step 15040/19560 | loss 3.257103 (-0.90z)| norm 0.2540 (+0.97z)| lr 1.12e-04 | 4166.75 ms | 32.4% bf16 MFU | 125192 tok/s step 15041/19560 | loss 3.287901 (-0.15z)| norm 0.2364 (-0.36z)| lr 1.11e-04 | 4245.00 ms | 31.8% bf16 MFU | 125108 tok/s step 15042/19560 | loss 3.245578 (-1.16z)| norm 0.2281 (-0.98z)| lr 1.11e-04 | 4174.31 ms | 32.3% bf16 MFU | 125133 tok/s step 15043/19560 | loss 3.223273 (-1.67z)| norm 0.2397 (-0.12z)| lr 1.11e-04 | 4171.04 ms | 32.4% bf16 MFU | 125161 tok/s step 15044/19560 | loss 3.255468 (-0.89z)| norm 0.2437 (+0.18z)| lr 1.11e-04 | 4159.62 ms | 32.5% bf16 MFU | 125205 tok/s step 15045/19560 | loss 3.255372 (-0.89z)| norm 0.2246 (-1.27z)| lr 1.11e-04 | 4158.01 ms | 32.5% bf16 MFU | 125249 tok/s step 15046/19560 | loss 3.290322 (-0.07z)| norm 0.2546 (+0.99z)| lr 1.11e-04 | 4162.18 ms | 32.4% bf16 MFU | 125285 tok/s step 15047/19560 | loss 3.258278 (-0.83z)| norm 0.2594 (+1.34z)| lr 1.11e-04 | 4160.01 ms | 32.5% bf16 MFU | 125322 tok/s step 15048/19560 | loss 3.306478 (+0.32z)| norm 0.2415 (-0.01z)| lr 1.11e-04 | 4167.78 ms | 32.4% bf16 MFU | 125346 tok/s step 15049/19560 | loss 3.227073 (-1.55z)| norm 0.2266 (-1.12z)| lr 1.11e-04 | 4172.51 ms | 32.4% bf16 MFU | 125361 tok/s step 15050/19560 | loss 3.290129 (-0.06z)| norm 0.2400 (-0.10z)| lr 1.11e-04 | 4201.12 ms | 32.1% bf16 MFU | 125333 tok/s step 15051/19560 | loss 3.229474 (-1.50z)| norm 0.2533 (+0.90z)| lr 1.11e-04 | 4161.67 ms | 32.4% bf16 MFU | 125365 tok/s step 15052/19560 | loss 3.355211 (+1.49z)| norm 0.2321 (-0.70z)| lr 1.11e-04 | 4165.54 ms | 32.4% bf16 MFU | 125390 tok/s step 15053/19560 | loss 3.239056 (-1.26z)| norm 0.2551 (+1.03z)| lr 1.11e-04 | 4153.63 ms | 32.5% bf16 MFU | 125432 tok/s step 15054/19560 | loss 3.275956 (-0.39z)| norm 0.2478 (+0.47z)| lr 1.11e-04 | 4170.69 ms | 32.4% bf16 MFU | 125446 tok/s step 15055/19560 | loss 3.245355 (-1.11z)| norm 0.2402 (-0.10z)| lr 1.11e-04 | 4158.49 ms | 32.5% bf16 MFU | 125477 tok/s step 15056/19560 | loss 3.328753 (+0.85z)| norm 0.2546 (+0.97z)| lr 1.11e-04 | 4165.99 ms | 32.4% bf16 MFU | 125496 tok/s step 15057/19560 | loss 3.294679 (+0.05z)| norm 0.2561 (+1.07z)| lr 1.11e-04 | 4159.50 ms | 32.5% bf16 MFU | 125523 tok/s step 15058/19560 | loss 3.302153 (+0.22z)| norm 0.2652 (+1.71z)| lr 1.11e-04 | 4160.46 ms | 32.5% bf16 MFU | 125548 tok/s step 15059/19560 | loss 3.245942 (-1.10z)| norm 0.2511 (+0.66z)| lr 1.11e-04 | 4216.09 ms | 32.0% bf16 MFU | 125488 tok/s step 15060/19560 | loss 3.253633 (-0.92z)| norm 0.2480 (+0.42z)| lr 1.11e-04 | 4202.39 ms | 32.1% bf16 MFU | 125452 tok/s step 15061/19560 | loss 3.267710 (-0.58z)| norm 0.2539 (+0.85z)| lr 1.11e-04 | 4168.38 ms | 32.4% bf16 MFU | 125468 tok/s step 15062/19560 | loss 3.302075 (+0.22z)| norm 0.2480 (+0.40z)| lr 1.10e-04 | 4165.37 ms | 32.4% bf16 MFU | 125488 tok/s step 15063/19560 | loss 3.245750 (-1.11z)| norm 0.2481 (+0.40z)| lr 1.10e-04 | 4158.89 ms | 32.5% bf16 MFU | 125517 tok/s step 15064/19560 | loss 3.262173 (-0.71z)| norm 0.2391 (-0.26z)| lr 1.10e-04 | 4171.08 ms | 32.4% bf16 MFU | 125526 tok/s step 15065/19560 | loss 3.274518 (-0.41z)| norm 0.2359 (-0.49z)| lr 1.10e-04 | 4169.35 ms | 32.4% bf16 MFU | 125537 tok/s step 15066/19560 | loss 3.329184 (+0.90z)| norm 0.2490 (+0.49z)| lr 1.10e-04 | 4195.38 ms | 32.2% bf16 MFU | 125509 tok/s step 15067/19560 | loss 3.384084 (+2.17z)| norm 0.2356 (-0.51z)| lr 1.10e-04 | 4262.55 ms | 31.7% bf16 MFU | 125383 tok/s step 15068/19560 | loss 3.298694 (+0.16z)| norm 0.2308 (-0.86z)| lr 1.10e-04 | 4164.53 ms | 32.4% bf16 MFU | 125409 tok/s step 15069/19560 | loss 3.320079 (+0.67z)| norm 0.2493 (+0.59z)| lr 1.10e-04 | 4160.00 ms | 32.5% bf16 MFU | 125440 tok/s step 15070/19560 | loss 3.304427 (+0.29z)| norm 0.2594 (+1.38z)| lr 1.10e-04 | 4158.47 ms | 32.5% bf16 MFU | 125472 tok/s step 15071/19560 | loss 3.351231 (+1.38z)| norm 0.2560 (+1.10z)| lr 1.10e-04 | 4182.30 ms | 32.3% bf16 MFU | 125466 tok/s step 15072/19560 | loss 3.270033 (-0.52z)| norm 0.2546 (+1.04z)| lr 1.10e-04 | 4159.48 ms | 32.5% bf16 MFU | 125495 tok/s step 15073/19560 | loss 3.340641 (+1.21z)| norm 0.2602 (+1.49z)| lr 1.10e-04 | 4167.86 ms | 32.4% bf16 MFU | 125510 tok/s step 15074/19560 | loss 3.411922 (+2.86z)| norm 1.1238 (+11.13z)| lr 1.10e-04 | 4163.84 ms | 32.4% bf16 MFU | 125530 tok/s step 15075/19560 | loss 3.259496 (-0.78z)| norm 0.2799 (+0.39z)| lr 1.10e-04 | 4208.50 ms | 32.1% bf16 MFU | 125483 tok/s step 15076/19560 | loss 3.259559 (-0.77z)| norm 0.2822 (+0.42z)| lr 1.10e-04 | 4162.81 ms | 32.4% bf16 MFU | 125506 tok/s step 15077/19560 | loss 3.284754 (-0.16z)| norm 0.2676 (+0.23z)| lr 1.10e-04 | 4162.97 ms | 32.4% bf16 MFU | 125527 tok/s step 15078/19560 | loss 3.300795 (+0.22z)| norm 0.2917 (+0.53z)| lr 1.10e-04 | 4165.35 ms | 32.4% bf16 MFU | 125545 tok/s step 15079/19560 | loss 3.317874 (+0.62z)| norm 0.2624 (+0.16z)| lr 1.10e-04 | 4165.52 ms | 32.4% bf16 MFU | 125560 tok/s step 15080/19560 | loss 3.289481 (-0.06z)| norm 0.2677 (+0.22z)| lr 1.10e-04 | 4161.53 ms | 32.4% bf16 MFU | 125582 tok/s step 15081/19560 | loss 3.224763 (-1.57z)| norm 0.2667 (+0.21z)| lr 1.10e-04 | 4160.48 ms | 32.5% bf16 MFU | 125603 tok/s step 15082/19560 | loss 3.266882 (-0.57z)| norm 0.2435 (-0.09z)| lr 1.10e-04 | 4150.72 ms | 32.5% bf16 MFU | 125639 tok/s step 15083/19560 | loss 3.263343 (-0.64z)| norm 0.2679 (+0.22z)| lr 1.10e-04 | 4165.41 ms | 32.4% bf16 MFU | 125650 tok/s step 15084/19560 | loss 3.287401 (-0.08z)| norm 0.2624 (+0.15z)| lr 1.09e-04 | 4157.93 ms | 32.5% bf16 MFU | 125672 tok/s step 15085/19560 | loss 3.281537 (-0.23z)| norm 0.2605 (+0.12z)| lr 1.09e-04 | 4162.88 ms | 32.4% bf16 MFU | 125686 tok/s step 15086/19560 | loss 3.309161 (+0.43z)| norm 0.2631 (+0.15z)| lr 1.09e-04 | 4163.82 ms | 32.4% bf16 MFU | 125697 tok/s step 15087/19560 | loss 3.308237 (+0.41z)| norm 0.2464 (-0.06z)| lr 1.09e-04 | 4159.90 ms | 32.5% bf16 MFU | 125714 tok/s step 15088/19560 | loss 3.261229 (-0.70z)| norm 0.2603 (+0.11z)| lr 1.09e-04 | 4158.84 ms | 32.5% bf16 MFU | 125732 tok/s step 15089/19560 | loss 3.273582 (-0.41z)| norm 0.2447 (-0.09z)| lr 1.09e-04 | 4162.61 ms | 32.4% bf16 MFU | 125743 tok/s step 15090/19560 | loss 3.292146 (+0.03z)| norm 0.2405 (-0.14z)| lr 1.09e-04 | 4163.26 ms | 32.4% bf16 MFU | 125752 tok/s step 15091/19560 | loss 3.251896 (-0.92z)| norm 0.2434 (-0.10z)| lr 1.09e-04 | 4156.90 ms | 32.5% bf16 MFU | 125771 tok/s step 15092/19560 | loss 3.322170 (+0.73z)| norm 0.2472 (-0.05z)| lr 1.09e-04 | 4165.17 ms | 32.4% bf16 MFU | 125776 tok/s step 15093/19560 | loss 3.253796 (-0.90z)| norm 0.2278 (-0.30z)| lr 1.09e-04 | 4158.90 ms | 32.5% bf16 MFU | 125790 tok/s step 15094/19560 | loss 3.332806 (+0.98z)| norm 0.2538 (+0.03z)| lr 1.09e-04 | 4165.26 ms | 32.4% bf16 MFU | 125795 tok/s step 15095/19560 | loss 3.221946 (-1.67z)| norm 0.2345 (-0.22z)| lr 1.09e-04 | 4161.90 ms | 32.4% bf16 MFU | 125803 tok/s step 15096/19560 | loss 3.238393 (-1.26z)| norm 0.2511 (-0.01z)| lr 1.09e-04 | 4176.57 ms | 32.3% bf16 MFU | 125790 tok/s step 15097/19560 | loss 3.253807 (-0.89z)| norm 0.2413 (-0.13z)| lr 1.09e-04 | 4165.55 ms | 32.4% bf16 MFU | 125793 tok/s step 15098/19560 | loss 3.306302 (+0.36z)| norm 0.2394 (-0.16z)| lr 1.09e-04 | 4154.41 ms | 32.5% bf16 MFU | 125814 tok/s step 15099/19560 | loss 3.239921 (-1.20z)| norm 0.2327 (-0.24z)| lr 1.09e-04 | 4180.16 ms | 32.3% bf16 MFU | 125794 tok/s step 15100/19560 | loss 3.242321 (-1.13z)| norm 0.2707 (+0.24z)| lr 1.09e-04 | 4158.37 ms | 32.5% bf16 MFU | 125809 tok/s step 15101/19560 | loss 3.350942 (+1.40z)| norm 0.2539 (+0.02z)| lr 1.09e-04 | 4154.78 ms | 32.5% bf16 MFU | 125828 tok/s step 15102/19560 | loss 3.231576 (-1.36z)| norm 0.2651 (+0.16z)| lr 1.09e-04 | 4157.85 ms | 32.5% bf16 MFU | 125841 tok/s step 15103/19560 | loss 3.322071 (+0.73z)| norm 0.2429 (-0.12z)| lr 1.09e-04 | 4152.12 ms | 32.5% bf16 MFU | 125862 tok/s step 15104/19560 | loss 3.289283 (-0.01z)| norm 0.2524 (-0.00z)| lr 1.09e-04 | 4162.01 ms | 32.4% bf16 MFU | 125868 tok/s step 15105/19560 | loss 3.311915 (+0.52z)| norm 0.2562 (+0.05z)| lr 1.08e-04 | 4170.85 ms | 32.4% bf16 MFU | 125860 tok/s step 15106/19560 | loss 3.309664 (+0.47z)| norm 0.2354 (-0.22z)| lr 1.08e-04 | 4159.38 ms | 32.5% bf16 MFU | 125869 tok/s step 15107/19560 | loss 3.288933 (-0.02z)| norm 0.2375 (-0.19z)| lr 1.08e-04 | 4159.61 ms | 32.5% bf16 MFU | 125878 tok/s step 15108/19560 | loss 3.330545 (+0.96z)| norm 0.2457 (-0.09z)| lr 1.08e-04 | 4162.62 ms | 32.4% bf16 MFU | 125881 tok/s step 15109/19560 | loss 3.237373 (-1.22z)| norm 0.2481 (-0.06z)| lr 1.08e-04 | 4166.42 ms | 32.4% bf16 MFU | 125879 tok/s step 15110/19560 | loss 3.268896 (-0.48z)| norm 0.2459 (-0.09z)| lr 1.08e-04 | 4166.97 ms | 32.4% bf16 MFU | 125876 tok/s step 15111/19560 | loss 3.230464 (-1.36z)| norm 0.2430 (-0.12z)| lr 1.08e-04 | 4186.51 ms | 32.3% bf16 MFU | 125844 tok/s step 15112/19560 | loss 3.346105 (+1.32z)| norm 0.2398 (-0.17z)| lr 1.08e-04 | 4157.20 ms | 32.5% bf16 MFU | 125858 tok/s step 15113/19560 | loss 3.251422 (-0.87z)| norm 0.2435 (-0.12z)| lr 1.08e-04 | 4158.50 ms | 32.5% bf16 MFU | 125869 tok/s step 15114/19560 | loss 3.330540 (+0.94z)| norm 0.2450 (-0.10z)| lr 1.08e-04 | 4156.08 ms | 32.5% bf16 MFU | 125883 tok/s step 15115/19560 | loss 3.278040 (-0.26z)| norm 0.2331 (-0.25z)| lr 1.08e-04 | 4161.40 ms | 32.4% bf16 MFU | 125888 tok/s step 15116/19560 | loss 3.233879 (-1.26z)| norm 0.2423 (-0.13z)| lr 1.08e-04 | 4155.97 ms | 32.5% bf16 MFU | 125901 tok/s step 15117/19560 | loss 3.298477 (+0.22z)| norm 0.2489 (-0.05z)| lr 1.08e-04 | 4163.94 ms | 32.4% bf16 MFU | 125902 tok/s step 15118/19560 | loss 3.248276 (-0.92z)| norm 0.2269 (-0.33z)| lr 1.08e-04 | 4155.95 ms | 32.5% bf16 MFU | 125914 tok/s step 15119/19560 | loss 3.220814 (-1.53z)| norm 0.2306 (-0.28z)| lr 1.08e-04 | 4166.80 ms | 32.4% bf16 MFU | 125910 tok/s step 15120/19560 | loss 3.261607 (-0.60z)| norm 0.2518 (-0.01z)| lr 1.08e-04 | 4206.83 ms | 32.1% bf16 MFU | 125846 tok/s step 15121/19560 | loss 3.335039 (+1.07z)| norm 0.2451 (-0.10z)| lr 1.08e-04 | 4163.79 ms | 32.4% bf16 MFU | 125849 tok/s step 15122/19560 | loss 3.304029 (+0.36z)| norm 0.2231 (-0.38z)| lr 1.08e-04 | 4167.52 ms | 32.4% bf16 MFU | 125847 tok/s step 15123/19560 | loss 3.256657 (-0.71z)| norm 0.2375 (-0.20z)| lr 1.08e-04 | 4233.20 ms | 31.9% bf16 MFU | 125747 tok/s step 15124/19560 | loss 3.285337 (-0.05z)| norm 0.2389 (-0.18z)| lr 1.08e-04 | 4164.08 ms | 32.4% bf16 MFU | 125755 tok/s step 15125/19560 | loss 3.249543 (-0.87z)| norm 0.2347 (-0.23z)| lr 1.08e-04 | 4173.49 ms | 32.4% bf16 MFU | 125749 tok/s step 15126/19560 | loss 3.304194 (+0.39z)| norm 0.2411 (-0.15z)| lr 1.08e-04 | 4160.72 ms | 32.5% bf16 MFU | 125762 tok/s step 15127/19560 | loss 3.258089 (-0.66z)| norm 0.2431 (-0.13z)| lr 1.07e-04 | 4166.15 ms | 32.4% bf16 MFU | 125766 tok/s step 15128/19560 | loss 3.240897 (-1.06z)| norm 0.2312 (-0.28z)| lr 1.07e-04 | 4167.15 ms | 32.4% bf16 MFU | 125768 tok/s step 15129/19560 | loss 3.300150 (+0.30z)| norm 0.2450 (-0.11z)| lr 1.07e-04 | 4176.00 ms | 32.3% bf16 MFU | 125757 tok/s step 15130/19560 | loss 3.335634 (+1.18z)| norm 0.2611 (+0.10z)| lr 1.07e-04 | 4161.04 ms | 32.4% bf16 MFU | 125769 tok/s step 15131/19560 | loss 3.276542 (-0.23z)| norm 0.2498 (-0.04z)| lr 1.07e-04 | 4163.53 ms | 32.4% bf16 MFU | 125777 tok/s step 15132/19560 | loss 3.254061 (-0.77z)| norm 0.2542 (+0.02z)| lr 1.07e-04 | 4162.46 ms | 32.4% bf16 MFU | 125786 tok/s step 15133/19560 | loss 3.267969 (-0.42z)| norm 0.2440 (-0.12z)| lr 1.07e-04 | 4165.80 ms | 32.4% bf16 MFU | 125789 tok/s step 15134/19560 | loss 3.310143 (+0.61z)| norm 0.2378 (-0.20z)| lr 1.07e-04 | 4150.60 ms | 32.5% bf16 MFU | 125816 tok/s step 15135/19560 | loss 3.362187 (+1.83z)| norm 0.2508 (-0.03z)| lr 1.07e-04 | 4159.63 ms | 32.5% bf16 MFU | 125827 tok/s step 15136/19560 | loss 3.246071 (-0.94z)| norm 0.2454 (-0.10z)| lr 1.07e-04 | 4180.18 ms | 32.3% bf16 MFU | 125807 tok/s step 15137/19560 | loss 3.359558 (+1.74z)| norm 0.2371 (-0.21z)| lr 1.07e-04 | 4159.26 ms | 32.5% bf16 MFU | 125819 tok/s step 15138/19560 | loss 3.320650 (+0.80z)| norm 0.2560 (+0.03z)| lr 1.07e-04 | 4160.02 ms | 32.5% bf16 MFU | 125830 tok/s step 15139/19560 | loss 3.266596 (-0.48z)| norm 0.2408 (-0.16z)| lr 1.07e-04 | 4241.49 ms | 31.8% bf16 MFU | 125719 tok/s step 15140/19560 | loss 3.226249 (-1.46z)| norm 0.2311 (-0.28z)| lr 1.07e-04 | 4171.17 ms | 32.4% bf16 MFU | 125717 tok/s step 15141/19560 | loss 3.284009 (-0.01z)| norm 0.2373 (-0.20z)| lr 1.07e-04 | 4160.78 ms | 32.5% bf16 MFU | 125732 tok/s step 15142/19560 | loss 3.239041 (-1.13z)| norm 0.2336 (-0.25z)| lr 1.07e-04 | 4170.08 ms | 32.4% bf16 MFU | 125732 tok/s step 15143/19560 | loss 3.290894 (+0.18z)| norm 0.2443 (-0.11z)| lr 1.07e-04 | 4154.36 ms | 32.5% bf16 MFU | 125755 tok/s step 15144/19560 | loss 3.295706 (+0.32z)| norm 0.2407 (-0.15z)| lr 1.07e-04 | 4157.45 ms | 32.5% bf16 MFU | 125773 tok/s step 15145/19560 | loss 3.275279 (-0.21z)| norm 0.2340 (-0.24z)| lr 1.07e-04 | 4153.43 ms | 32.5% bf16 MFU | 125796 tok/s step 15146/19560 | loss 3.247767 (-0.90z)| norm 0.2324 (-0.26z)| lr 1.07e-04 | 4158.24 ms | 32.5% bf16 MFU | 125810 tok/s step 15147/19560 | loss 3.282266 (-0.03z)| norm 0.2396 (-0.17z)| lr 1.07e-04 | 4165.37 ms | 32.4% bf16 MFU | 125813 tok/s step 15148/19560 | loss 3.244993 (-0.98z)| norm 0.2200 (-0.41z)| lr 1.07e-04 | 4151.32 ms | 32.5% bf16 MFU | 125837 tok/s step 15149/19560 | loss 3.280105 (-0.08z)| norm 0.2405 (-0.15z)| lr 1.06e-04 | 4155.61 ms | 32.5% bf16 MFU | 125853 tok/s step 15150/19560 | loss 3.289847 (+0.16z)| norm 0.2338 (-0.24z)| lr 1.06e-04 | 4153.74 ms | 32.5% bf16 MFU | 125872 tok/s step 15151/19560 | loss 3.218455 (-1.66z)| norm 0.2324 (-0.25z)| lr 1.06e-04 | 4206.79 ms | 32.1% bf16 MFU | 125810 tok/s step 15152/19560 | loss 3.270886 (-0.32z)| norm 0.2415 (-0.14z)| lr 1.06e-04 | 4169.36 ms | 32.4% bf16 MFU | 125807 tok/s step 15153/19560 | loss 3.307963 (+0.64z)| norm 0.2224 (-0.38z)| lr 1.06e-04 | 4162.93 ms | 32.4% bf16 MFU | 125813 tok/s step 15154/19560 | loss 3.317914 (+0.89z)| norm 0.2362 (-0.20z)| lr 1.06e-04 | 4159.22 ms | 32.5% bf16 MFU | 125825 tok/s step 15155/19560 | loss 3.292344 (+0.23z)| norm 0.2590 (+0.08z)| lr 1.06e-04 | 4158.25 ms | 32.5% bf16 MFU | 125838 tok/s step 15156/19560 | loss 3.244243 (-0.99z)| norm 0.2336 (-0.24z)| lr 1.06e-04 | 4157.52 ms | 32.5% bf16 MFU | 125852 tok/s step 15157/19560 | loss 3.281087 (-0.02z)| norm 0.2606 (+0.10z)| lr 1.06e-04 | 4156.11 ms | 32.5% bf16 MFU | 125866 tok/s step 15158/19560 | loss 3.314967 (+0.87z)| norm 0.2533 (+0.01z)| lr 1.06e-04 | 4159.11 ms | 32.5% bf16 MFU | 125876 tok/s step 15159/19560 | loss 3.291911 (+0.25z)| norm 0.2382 (-0.19z)| lr 1.06e-04 | 4171.18 ms | 32.4% bf16 MFU | 125867 tok/s step 15160/19560 | loss 3.265088 (-0.46z)| norm 0.2440 (-0.11z)| lr 1.06e-04 | 4152.07 ms | 32.5% bf16 MFU | 125887 tok/s step 15161/19560 | loss 3.266937 (-0.40z)| norm 0.2478 (-0.06z)| lr 1.06e-04 | 4157.27 ms | 32.5% bf16 MFU | 125898 tok/s step 15162/19560 | loss 3.237299 (-1.17z)| norm 0.2353 (-0.22z)| lr 1.06e-04 | 4164.88 ms | 32.4% bf16 MFU | 125898 tok/s step 15163/19560 | loss 3.271148 (-0.26z)| norm 0.2602 (+0.09z)| lr 1.06e-04 | 4156.96 ms | 32.5% bf16 MFU | 125909 tok/s step 15164/19560 | loss 3.252728 (-0.74z)| norm 0.2498 (-0.04z)| lr 1.06e-04 | 4158.90 ms | 32.5% bf16 MFU | 125917 tok/s step 15165/19560 | loss 3.237232 (-1.15z)| norm 0.2491 (-0.05z)| lr 1.06e-04 | 4168.23 ms | 32.4% bf16 MFU | 125910 tok/s step 15166/19560 | loss 3.281838 (+0.04z)| norm 0.2415 (-0.15z)| lr 1.06e-04 | 4149.49 ms | 32.5% bf16 MFU | 125932 tok/s step 15167/19560 | loss 3.259339 (-0.57z)| norm 0.2570 (+0.05z)| lr 1.06e-04 | 4161.30 ms | 32.4% bf16 MFU | 125935 tok/s step 15168/19560 | loss 3.281955 (+0.03z)| norm 0.2602 (+0.09z)| lr 1.06e-04 | 4162.20 ms | 32.4% bf16 MFU | 125936 tok/s step 15169/19560 | loss 3.298886 (+0.48z)| norm 0.2346 (-0.24z)| lr 1.06e-04 | 4161.56 ms | 32.4% bf16 MFU | 125939 tok/s step 15170/19560 | loss 3.304260 (+0.62z)| norm 0.2608 (+0.09z)| lr 1.05e-04 | 4351.26 ms | 31.0% bf16 MFU | 125666 tok/s step 15171/19560 | loss 3.293290 (+0.31z)| norm 0.2375 (-0.20z)| lr 1.05e-04 | 5099.49 ms | 26.5% bf16 MFU | 124524 tok/s step 15172/19560 | loss 3.327723 (+1.22z)| norm 0.2833 (+0.38z)| lr 1.05e-04 | 4606.21 ms | 29.3% bf16 MFU | 123989 tok/s step 15173/19560 | loss 3.235263 (-1.26z)| norm 0.2397 (-0.18z)| lr 1.05e-04 | 4547.11 ms | 29.7% bf16 MFU | 123554 tok/s step 15174/19560 | loss 3.292462 (+0.27z)| norm 0.2513 (-0.03z)| lr 1.05e-04 | 4391.60 ms | 30.7% bf16 MFU | 123346 tok/s step 15175/19560 | loss 3.308247 (+0.69z)| norm 0.2444 (-0.12z)| lr 1.05e-04 | 4262.88 ms | 31.7% bf16 MFU | 123328 tok/s step 15176/19560 | loss 3.235180 (-1.26z)| norm 0.2434 (-0.13z)| lr 1.05e-04 | 4433.66 ms | 30.5% bf16 MFU | 123074 tok/s step 15177/19560 | loss 3.333255 (+1.35z)| norm 0.2474 (-0.08z)| lr 1.05e-04 | 4258.83 ms | 31.7% bf16 MFU | 123076 tok/s step 15178/19560 | loss 3.305159 (+0.59z)| norm 0.2401 (-0.18z)| lr 1.05e-04 | 4314.23 ms | 31.3% bf16 MFU | 122998 tok/s step 15179/19560 | loss 3.280060 (-0.09z)| norm 0.2350 (-0.24z)| lr 1.05e-04 | 4398.82 ms | 30.7% bf16 MFU | 122808 tok/s step 15180/19560 | loss 3.303754 (+0.57z)| norm 0.2390 (-0.19z)| lr 1.05e-04 | 4193.94 ms | 32.2% bf16 MFU | 122918 tok/s step 15181/19560 | loss 3.328109 (+1.21z)| norm 0.2354 (-0.23z)| lr 1.05e-04 | 4169.16 ms | 32.4% bf16 MFU | 123060 tok/s step 15182/19560 | loss 3.266350 (-0.47z)| norm 0.2231 (-0.39z)| lr 1.05e-04 | 4200.50 ms | 32.1% bf16 MFU | 123147 tok/s step 15183/19560 | loss 3.226148 (-1.56z)| norm 0.2384 (-0.19z)| lr 1.05e-04 | 4290.68 ms | 31.5% bf16 MFU | 123100 tok/s step 15184/19560 | loss 3.250040 (-0.90z)| norm 0.2305 (-0.29z)| lr 1.05e-04 | 4332.11 ms | 31.2% bf16 MFU | 122996 tok/s step 15185/19560 | loss 3.252484 (-0.82z)| norm 0.2617 (+0.11z)| lr 1.05e-04 | 4292.04 ms | 31.5% bf16 MFU | 122954 tok/s step 15186/19560 | loss 3.338495 (+1.50z)| norm 0.2394 (-0.17z)| lr 1.05e-04 | 4169.66 ms | 32.4% bf16 MFU | 123093 tok/s step 15187/19560 | loss 3.294940 (+0.32z)| norm 0.2390 (-0.18z)| lr 1.05e-04 | 4155.42 ms | 32.5% bf16 MFU | 123247 tok/s step 15188/19560 | loss 3.262123 (-0.57z)| norm 0.2404 (-0.16z)| lr 1.05e-04 | 4169.40 ms | 32.4% bf16 MFU | 123372 tok/s step 15189/19560 | loss 3.287188 (+0.10z)| norm 0.2385 (-0.18z)| lr 1.05e-04 | 4167.29 ms | 32.4% bf16 MFU | 123494 tok/s step 15190/19560 | loss 3.341989 (+1.57z)| norm 0.2558 (+0.04z)| lr 1.05e-04 | 4255.21 ms | 31.7% bf16 MFU | 123480 tok/s step 15191/19560 | loss 3.303981 (+0.53z)| norm 0.2376 (-0.19z)| lr 1.05e-04 | 4305.19 ms | 31.4% bf16 MFU | 123395 tok/s step 15192/19560 | loss 3.304858 (+0.55z)| norm 0.2441 (-0.11z)| lr 1.04e-04 | 4593.01 ms | 29.4% bf16 MFU | 122932 tok/s step 15193/19560 | loss 3.335397 (+1.35z)| norm 0.2232 (-0.38z)| lr 1.04e-04 | 4157.95 ms | 32.5% bf16 MFU | 123090 tok/s step 15194/19560 | loss 3.310599 (+0.69z)| norm 0.2527 (-0.00z)| lr 1.04e-04 | 4166.77 ms | 32.4% bf16 MFU | 123227 tok/s step 15195/19560 | loss 3.256127 (-0.77z)| norm 0.2391 (-0.17z)| lr 1.04e-04 | 4172.23 ms | 32.4% bf16 MFU | 123349 tok/s step 15196/19560 | loss 3.283142 (-0.02z)| norm 0.2363 (-0.21z)| lr 1.04e-04 | 4209.04 ms | 32.1% bf16 MFU | 123410 tok/s step 15197/19560 | loss 3.244648 (-1.06z)| norm 0.2333 (-0.25z)| lr 1.04e-04 | 4238.75 ms | 31.9% bf16 MFU | 123424 tok/s step 15198/19560 | loss 3.281089 (-0.05z)| norm 0.2367 (-0.20z)| lr 1.04e-04 | 4188.38 ms | 32.2% bf16 MFU | 123511 tok/s step 15199/19560 | loss 3.337424 (+1.52z)| norm 0.2508 (-0.02z)| lr 1.04e-04 | 4168.66 ms | 32.4% bf16 MFU | 123624 tok/s step 15200/19560 | loss 3.330146 (+1.29z)| norm 0.2422 (-0.13z)| lr 1.04e-04 | 4156.11 ms | 32.5% bf16 MFU | 123750 tok/s step 15201/19560 | loss 3.258632 (-0.67z)| norm 0.2447 (-0.10z)| lr 1.04e-04 | 4161.59 ms | 32.4% bf16 MFU | 123862 tok/s step 15202/19560 | loss 3.308019 (+0.77z)| norm 0.2334 (-0.93z)| lr 1.04e-04 | 4167.84 ms | 32.4% bf16 MFU | 123958 tok/s step 15203/19560 | loss 3.327175 (+1.31z)| norm 0.2570 (+0.95z)| lr 1.04e-04 | 4164.02 ms | 32.4% bf16 MFU | 124056 tok/s step 15204/19560 | loss 3.300686 (+0.52z)| norm 0.2379 (-0.57z)| lr 1.04e-04 | 4156.08 ms | 32.5% bf16 MFU | 124161 tok/s step 15205/19560 | loss 3.248165 (-1.00z)| norm 0.2491 (+0.38z)| lr 1.04e-04 | 4174.62 ms | 32.3% bf16 MFU | 124232 tok/s step 15206/19560 | loss 3.272177 (-0.29z)| norm 0.2288 (-1.37z)| lr 1.04e-04 | 4215.09 ms | 32.0% bf16 MFU | 124240 tok/s step 15207/19560 | loss 3.296105 (+0.41z)| norm 0.2367 (-0.65z)| lr 1.04e-04 | 4200.18 ms | 32.1% bf16 MFU | 124269 tok/s step 15208/19560 | loss 3.330406 (+1.39z)| norm 0.2620 (+1.65z)| lr 1.04e-04 | 4189.51 ms | 32.2% bf16 MFU | 124313 tok/s step 15209/19560 | loss 3.277847 (-0.15z)| norm 0.2385 (-0.48z)| lr 1.04e-04 | 4173.56 ms | 32.4% bf16 MFU | 124378 tok/s step 15210/19560 | loss 3.342406 (+1.71z)| norm 0.2422 (-0.13z)| lr 1.04e-04 | 4176.80 ms | 32.3% bf16 MFU | 124435 tok/s step 15211/19560 | loss 3.255400 (-0.81z)| norm 0.2596 (+1.49z)| lr 1.04e-04 | 4171.09 ms | 32.4% bf16 MFU | 124498 tok/s step 15212/19560 | loss 3.374083 (+2.54z)| norm 0.2503 (+0.64z)| lr 1.04e-04 | 4170.31 ms | 32.4% bf16 MFU | 124559 tok/s step 15213/19560 | loss 3.225730 (-1.62z)| norm 0.2688 (+2.35z)| lr 1.04e-04 | 4167.21 ms | 32.4% bf16 MFU | 124622 tok/s step 15214/19560 | loss 3.233224 (-1.38z)| norm 0.2374 (-0.56z)| lr 1.03e-04 | 4176.42 ms | 32.3% bf16 MFU | 124668 tok/s step 15215/19560 | loss 3.254154 (-0.79z)| norm 0.2476 (+0.40z)| lr 1.03e-04 | 4173.21 ms | 32.4% bf16 MFU | 124716 tok/s step 15216/19560 | loss 3.338291 (+1.52z)| norm 0.2499 (+0.62z)| lr 1.03e-04 | 4167.24 ms | 32.4% bf16 MFU | 124771 tok/s step 15217/19560 | loss 3.396178 (+2.98z)| norm 0.2427 (-0.05z)| lr 1.03e-04 | 4177.17 ms | 32.3% bf16 MFU | 124808 tok/s step 15218/19560 | loss 3.352355 (+1.78z)| norm 0.2527 (+0.89z)| lr 1.03e-04 | 4180.12 ms | 32.3% bf16 MFU | 124839 tok/s step 15219/19560 | loss 3.284439 (-0.01z)| norm 0.2354 (-0.75z)| lr 1.03e-04 | 4176.54 ms | 32.3% bf16 MFU | 124873 tok/s step 15220/19560 | loss 3.364892 (+2.07z)| norm 0.2541 (+1.01z)| lr 1.03e-04 | 4162.12 ms | 32.4% bf16 MFU | 124928 tok/s step 15221/19560 | loss 3.266523 (-0.49z)| norm 0.2519 (+0.79z)| lr 1.03e-04 | 4178.14 ms | 32.3% bf16 MFU | 124956 tok/s step 15222/19560 | loss 3.370342 (+2.18z)| norm 0.2335 (-0.94z)| lr 1.03e-04 | 4168.31 ms | 32.4% bf16 MFU | 124997 tok/s step 15223/19560 | loss 3.313416 (+0.70z)| norm 0.2326 (-1.02z)| lr 1.03e-04 | 4174.26 ms | 32.3% bf16 MFU | 125027 tok/s step 15224/19560 | loss 3.353400 (+1.71z)| norm 0.2543 (+1.03z)| lr 1.03e-04 | 4174.24 ms | 32.3% bf16 MFU | 125056 tok/s step 15225/19560 | loss 3.302298 (+0.38z)| norm 0.2475 (+0.39z)| lr 1.03e-04 | 4173.53 ms | 32.4% bf16 MFU | 125084 tok/s step 15226/19560 | loss 3.301428 (+0.36z)| norm 0.2593 (+1.47z)| lr 1.03e-04 | 4172.94 ms | 32.4% bf16 MFU | 125112 tok/s step 15227/19560 | loss 3.293023 (+0.13z)| norm 0.2387 (-0.47z)| lr 1.03e-04 | 4176.42 ms | 32.3% bf16 MFU | 125133 tok/s step 15228/19560 | loss 3.297125 (+0.23z)| norm 0.2557 (+1.17z)| lr 1.03e-04 | 4169.35 ms | 32.4% bf16 MFU | 125164 tok/s step 15229/19560 | loss 3.292430 (+0.12z)| norm 0.2585 (+1.43z)| lr 1.03e-04 | 4172.94 ms | 32.4% bf16 MFU | 125188 tok/s step 15230/19560 | loss 3.236856 (-1.35z)| norm 0.2413 (-0.20z)| lr 1.03e-04 | 4217.99 ms | 32.0% bf16 MFU | 125143 tok/s step 15231/19560 | loss 3.320325 (+0.86z)| norm 0.2611 (+1.69z)| lr 1.03e-04 | 4160.59 ms | 32.5% bf16 MFU | 125187 tok/s step 15232/19560 | loss 3.282096 (-0.15z)| norm 0.2424 (-0.10z)| lr 1.03e-04 | 4169.58 ms | 32.4% bf16 MFU | 125214 tok/s step 15233/19560 | loss 3.335475 (+1.25z)| norm 0.2658 (+2.12z)| lr 1.03e-04 | 4159.78 ms | 32.5% bf16 MFU | 125255 tok/s step 15234/19560 | loss 3.349364 (+1.59z)| norm 0.2576 (+1.32z)| lr 1.03e-04 | 4174.26 ms | 32.3% bf16 MFU | 125273 tok/s step 15235/19560 | loss 3.312156 (+0.61z)| norm 0.2297 (-1.31z)| lr 1.03e-04 | 4169.88 ms | 32.4% bf16 MFU | 125296 tok/s step 15236/19560 | loss 3.321688 (+0.87z)| norm 0.2358 (-0.73z)| lr 1.02e-04 | 4182.79 ms | 32.3% bf16 MFU | 125298 tok/s step 15237/19560 | loss 3.340157 (+1.33z)| norm 0.2381 (-0.50z)| lr 1.02e-04 | 4175.18 ms | 32.3% bf16 MFU | 125312 tok/s step 15238/19560 | loss 3.294701 (+0.14z)| norm 0.2505 (+0.65z)| lr 1.02e-04 | 4174.86 ms | 32.3% bf16 MFU | 125325 tok/s step 15239/19560 | loss 3.312860 (+0.60z)| norm 0.2404 (-0.29z)| lr 1.02e-04 | 4165.76 ms | 32.4% bf16 MFU | 125352 tok/s step 15240/19560 | loss 3.306103 (+0.43z)| norm 0.2339 (-0.89z)| lr 1.02e-04 | 4189.03 ms | 32.2% bf16 MFU | 125342 tok/s step 15241/19560 | loss 3.281046 (-0.24z)| norm 0.2362 (-0.67z)| lr 1.02e-04 | 4185.67 ms | 32.3% bf16 MFU | 125338 tok/s step 15242/19560 | loss 3.317954 (+0.75z)| norm 0.2614 (+1.66z)| lr 1.02e-04 | 4170.94 ms | 32.4% bf16 MFU | 125356 tok/s step 15243/19560 | loss 3.284665 (-0.14z)| norm 0.2335 (-0.93z)| lr 1.02e-04 | 4174.51 ms | 32.3% bf16 MFU | 125368 tok/s step 15244/19560 | loss 3.384900 (+2.47z)| norm 0.2471 (+0.33z)| lr 1.02e-04 | 4172.02 ms | 32.4% bf16 MFU | 125383 tok/s step 15245/19560 | loss 3.255449 (-0.93z)| norm 0.2345 (-0.82z)| lr 1.02e-04 | 4169.96 ms | 32.4% bf16 MFU | 125400 tok/s step 15246/19560 | loss 3.285670 (-0.15z)| norm 0.2438 (+0.02z)| lr 1.02e-04 | 4182.62 ms | 32.3% bf16 MFU | 125398 tok/s step 15247/19560 | loss 3.333383 (+1.10z)| norm 0.2339 (-0.91z)| lr 1.02e-04 | 4165.97 ms | 32.4% bf16 MFU | 125420 tok/s step 15248/19560 | loss 3.265553 (-0.71z)| norm 0.2372 (-0.59z)| lr 1.02e-04 | 4165.01 ms | 32.4% bf16 MFU | 125443 tok/s step 15249/19560 | loss 3.280524 (-0.30z)| norm 0.2289 (-1.34z)| lr 1.02e-04 | 4166.66 ms | 32.4% bf16 MFU | 125463 tok/s step 15250/19560 | loss 3.366246 (+1.95z)| norm 0.2434 (-0.01z)| lr 1.02e-04 | 4169.62 ms | 32.4% bf16 MFU | 125476 tok/s val loss 3.282277 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3019/10042 = 0.300637 step 15251/19560 | loss 3.325408 (+0.86z)| norm 0.2444 (+0.08z)| lr 1.02e-04 | 4161.72 ms | 32.4% bf16 MFU | 125502 tok/s step 15252/19560 | loss 3.327163 (+0.90z)| norm 0.2379 (-0.53z)| lr 1.02e-04 | 4183.15 ms | 32.3% bf16 MFU | 125493 tok/s step 15253/19560 | loss 3.319395 (+0.68z)| norm 0.2388 (-0.45z)| lr 1.02e-04 | 4165.29 ms | 32.4% bf16 MFU | 125512 tok/s step 15254/19560 | loss 3.352928 (+1.54z)| norm 0.2433 (-0.03z)| lr 1.02e-04 | 4164.46 ms | 32.4% bf16 MFU | 125531 tok/s step 15255/19560 | loss 3.314633 (+0.53z)| norm 0.2328 (-1.01z)| lr 1.02e-04 | 4168.68 ms | 32.4% bf16 MFU | 125543 tok/s step 15256/19560 | loss 3.290400 (-0.12z)| norm 0.2462 (+0.24z)| lr 1.02e-04 | 4168.64 ms | 32.4% bf16 MFU | 125554 tok/s step 15257/19560 | loss 3.290554 (-0.11z)| norm 0.2397 (-0.37z)| lr 1.02e-04 | 4227.78 ms | 31.9% bf16 MFU | 125477 tok/s step 15258/19560 | loss 3.264528 (-0.78z)| norm 0.2660 (+2.10z)| lr 1.01e-04 | 4168.79 ms | 32.4% bf16 MFU | 125492 tok/s step 15259/19560 | loss 3.291212 (-0.08z)| norm 0.2288 (-1.37z)| lr 1.01e-04 | 4166.20 ms | 32.4% bf16 MFU | 125509 tok/s step 15260/19560 | loss 3.305171 (+0.28z)| norm 0.2367 (-0.62z)| lr 1.01e-04 | 4184.26 ms | 32.3% bf16 MFU | 125499 tok/s step 15261/19560 | loss 3.336040 (+1.08z)| norm 0.2515 (+0.75z)| lr 1.01e-04 | 4172.46 ms | 32.4% bf16 MFU | 125506 tok/s step 15262/19560 | loss 3.324179 (+0.77z)| norm 0.2280 (-1.42z)| lr 1.01e-04 | 4265.52 ms | 31.7% bf16 MFU | 125377 tok/s step 15263/19560 | loss 3.288721 (-0.16z)| norm 0.2359 (-0.68z)| lr 1.01e-04 | 4162.76 ms | 32.4% bf16 MFU | 125405 tok/s step 15264/19560 | loss 3.324561 (+0.79z)| norm 0.2295 (-1.25z)| lr 1.01e-04 | 4246.34 ms | 31.8% bf16 MFU | 125308 tok/s step 15265/19560 | loss 3.287582 (-0.20z)| norm 0.2513 (+0.75z)| lr 1.01e-04 | 4185.83 ms | 32.3% bf16 MFU | 125306 tok/s step 15266/19560 | loss 3.273159 (-0.58z)| norm 0.2378 (-0.48z)| lr 1.01e-04 | 4158.68 ms | 32.5% bf16 MFU | 125344 tok/s step 15267/19560 | loss 3.272480 (-0.60z)| norm 0.2358 (-0.67z)| lr 1.01e-04 | 4197.72 ms | 32.2% bf16 MFU | 125322 tok/s step 15268/19560 | loss 3.290757 (-0.12z)| norm 0.2442 (+0.10z)| lr 1.01e-04 | 4169.22 ms | 32.4% bf16 MFU | 125343 tok/s step 15269/19560 | loss 3.364752 (+1.89z)| norm 0.2640 (+1.89z)| lr 1.01e-04 | 4172.68 ms | 32.4% bf16 MFU | 125358 tok/s step 15270/19560 | loss 3.275285 (-0.57z)| norm 0.2298 (-1.23z)| lr 1.01e-04 | 4276.04 ms | 31.6% bf16 MFU | 125221 tok/s step 15271/19560 | loss 3.314899 (+0.52z)| norm 0.2430 (-0.02z)| lr 1.01e-04 | 4226.27 ms | 31.9% bf16 MFU | 125163 tok/s step 15272/19560 | loss 3.260004 (-0.98z)| norm 0.2439 (+0.05z)| lr 1.01e-04 | 4172.52 ms | 32.4% bf16 MFU | 125187 tok/s step 15273/19560 | loss 3.310263 (+0.39z)| norm 0.2372 (-0.56z)| lr 1.01e-04 | 4412.03 ms | 30.6% bf16 MFU | 124869 tok/s step 15274/19560 | loss 3.313419 (+0.46z)| norm 0.2321 (-1.03z)| lr 1.01e-04 | 4173.99 ms | 32.3% bf16 MFU | 124906 tok/s step 15275/19560 | loss 3.372783 (+2.05z)| norm 0.2475 (+0.38z)| lr 1.01e-04 | 4164.92 ms | 32.4% bf16 MFU | 124955 tok/s step 15276/19560 | loss 3.456883 (+4.03z)| norm 0.2632 (+1.79z)| lr 1.01e-04 | 4173.14 ms | 32.4% bf16 MFU | 124989 tok/s step 15277/19560 | loss 3.279317 (-0.50z)| norm 0.2511 (+0.67z)| lr 1.01e-04 | 4165.24 ms | 32.4% bf16 MFU | 125033 tok/s step 15278/19560 | loss 3.281674 (-0.44z)| norm 0.2582 (+1.29z)| lr 1.01e-04 | 4179.58 ms | 32.3% bf16 MFU | 125054 tok/s step 15279/19560 | loss 3.357050 (+1.47z)| norm 0.2481 (+0.36z)| lr 1.01e-04 | 4174.93 ms | 32.3% bf16 MFU | 125080 tok/s step 15280/19560 | loss 3.305695 (+0.14z)| norm 0.2348 (-0.85z)| lr 1.01e-04 | 4158.21 ms | 32.5% bf16 MFU | 125130 tok/s step 15281/19560 | loss 3.349127 (+1.25z)| norm 0.2528 (+0.78z)| lr 1.00e-04 | 4178.36 ms | 32.3% bf16 MFU | 125148 tok/s step 15282/19560 | loss 3.332728 (+0.82z)| norm 0.2318 (-1.16z)| lr 1.00e-04 | 4168.97 ms | 32.4% bf16 MFU | 125178 tok/s step 15283/19560 | loss 3.332382 (+0.80z)| norm 0.2670 (+2.08z)| lr 1.00e-04 | 4181.43 ms | 32.3% bf16 MFU | 125188 tok/s step 15284/19560 | loss 3.315070 (+0.35z)| norm 0.2317 (-1.15z)| lr 1.00e-04 | 4163.49 ms | 32.4% bf16 MFU | 125225 tok/s step 15285/19560 | loss 3.282791 (-0.48z)| norm 0.2499 (+0.52z)| lr 1.00e-04 | 4175.23 ms | 32.3% bf16 MFU | 125243 tok/s step 15286/19560 | loss 3.323784 (+0.57z)| norm 0.2534 (+0.84z)| lr 1.00e-04 | 4160.77 ms | 32.5% bf16 MFU | 125281 tok/s step 15287/19560 | loss 3.252174 (-1.25z)| norm 0.2443 (+0.00z)| lr 1.00e-04 | 4168.69 ms | 32.4% bf16 MFU | 125305 tok/s step 15288/19560 | loss 3.257674 (-1.11z)| norm 0.2516 (+0.67z)| lr 1.00e-04 | 4455.02 ms | 30.3% bf16 MFU | 124924 tok/s step 15289/19560 | loss 3.217108 (-2.10z)| norm 0.2483 (+0.36z)| lr 1.00e-04 | 4166.86 ms | 32.4% bf16 MFU | 124969 tok/s step 15290/19560 | loss 3.290578 (-0.27z)| norm 0.2368 (-0.70z)| lr 1.00e-04 | 4166.16 ms | 32.4% bf16 MFU | 125013 tok/s step 15291/19560 | loss 3.411512 (+2.70z)| norm 0.2634 (+1.75z)| lr 1.00e-04 | 4164.28 ms | 32.4% bf16 MFU | 125057 tok/s step 15292/19560 | loss 3.238084 (-1.58z)| norm 0.2569 (+1.14z)| lr 1.00e-04 | 4178.16 ms | 32.3% bf16 MFU | 125079 tok/s step 15293/19560 | loss 3.305695 (+0.07z)| norm 0.2608 (+1.48z)| lr 9.99e-05 | 4157.93 ms | 32.5% bf16 MFU | 125129 tok/s step 15294/19560 | loss 3.361113 (+1.42z)| norm 0.2498 (+0.47z)| lr 9.99e-05 | 4179.47 ms | 32.3% bf16 MFU | 125145 tok/s step 15295/19560 | loss 3.311172 (+0.18z)| norm 0.2532 (+0.78z)| lr 9.98e-05 | 4163.66 ms | 32.4% bf16 MFU | 125184 tok/s step 15296/19560 | loss 3.237838 (-1.61z)| norm 0.2374 (-0.64z)| lr 9.98e-05 | 4179.27 ms | 32.3% bf16 MFU | 125197 tok/s step 15297/19560 | loss 3.316458 (+0.32z)| norm 0.2366 (-0.71z)| lr 9.97e-05 | 4158.03 ms | 32.5% bf16 MFU | 125242 tok/s step 15298/19560 | loss 3.241294 (-1.50z)| norm 0.2530 (+0.80z)| lr 9.97e-05 | 4175.08 ms | 32.3% bf16 MFU | 125258 tok/s step 15299/19560 | loss 3.347513 (+1.07z)| norm 0.2546 (+0.93z)| lr 9.97e-05 | 4203.92 ms | 32.1% bf16 MFU | 125231 tok/s step 15300/19560 | loss 3.245207 (-1.39z)| norm 0.2356 (-0.82z)| lr 9.96e-05 | 4222.48 ms | 32.0% bf16 MFU | 125178 tok/s step 15301/19560 | loss 3.371980 (+1.64z)| norm 0.2361 (-0.77z)| lr 9.96e-05 | 4155.63 ms | 32.5% bf16 MFU | 125227 tok/s step 15302/19560 | loss 3.249522 (-1.29z)| norm 0.2450 (+0.10z)| lr 9.95e-05 | 4162.77 ms | 32.4% bf16 MFU | 125263 tok/s step 15303/19560 | loss 3.340495 (+0.87z)| norm 0.2409 (-0.30z)| lr 9.95e-05 | 4170.79 ms | 32.4% bf16 MFU | 125285 tok/s step 15304/19560 | loss 3.311917 (+0.18z)| norm 0.2510 (+0.67z)| lr 9.94e-05 | 4168.87 ms | 32.4% bf16 MFU | 125309 tok/s step 15305/19560 | loss 3.257567 (-1.11z)| norm 0.2322 (-1.13z)| lr 9.94e-05 | 4162.81 ms | 32.4% bf16 MFU | 125341 tok/s step 15306/19560 | loss 3.271338 (-0.77z)| norm 0.2561 (+1.15z)| lr 9.93e-05 | 4169.45 ms | 32.4% bf16 MFU | 125361 tok/s step 15307/19560 | loss 3.258322 (-1.08z)| norm 0.2335 (-1.01z)| lr 9.93e-05 | 4177.10 ms | 32.3% bf16 MFU | 125369 tok/s step 15308/19560 | loss 3.382429 (+1.85z)| norm 0.2646 (+1.92z)| lr 9.92e-05 | 4172.18 ms | 32.4% bf16 MFU | 125384 tok/s step 15309/19560 | loss 3.241962 (-1.44z)| norm 0.2224 (-2.03z)| lr 9.92e-05 | 4174.11 ms | 32.3% bf16 MFU | 125395 tok/s step 15310/19560 | loss 3.281089 (-0.53z)| norm 0.2446 (+0.03z)| lr 9.92e-05 | 4179.40 ms | 32.3% bf16 MFU | 125397 tok/s step 15311/19560 | loss 3.333675 (+0.70z)| norm 0.2611 (+1.56z)| lr 9.91e-05 | 4170.66 ms | 32.4% bf16 MFU | 125413 tok/s step 15312/19560 | loss 3.304348 (-0.01z)| norm 0.2514 (+0.63z)| lr 9.91e-05 | 4184.35 ms | 32.3% bf16 MFU | 125407 tok/s step 15313/19560 | loss 3.576159 (+5.60z)| norm 0.2791 (+3.13z)| lr 9.90e-05 | 4171.90 ms | 32.4% bf16 MFU | 125420 tok/s step 15314/19560 | loss 3.299588 (-0.15z)| norm 0.2508 (+0.54z)| lr 9.90e-05 | 4172.34 ms | 32.4% bf16 MFU | 125432 tok/s step 15315/19560 | loss 3.317065 (+0.21z)| norm 0.2490 (+0.37z)| lr 9.89e-05 | 4161.99 ms | 32.4% bf16 MFU | 125459 tok/s step 15316/19560 | loss 3.235016 (-1.49z)| norm 0.2521 (+0.64z)| lr 9.89e-05 | 4173.50 ms | 32.4% bf16 MFU | 125467 tok/s step 15317/19560 | loss 3.370068 (+1.29z)| norm 0.2663 (+1.90z)| lr 9.88e-05 | 4169.75 ms | 32.4% bf16 MFU | 125481 tok/s step 15318/19560 | loss 3.308120 (+0.02z)| norm 0.2370 (-0.74z)| lr 9.88e-05 | 4167.63 ms | 32.4% bf16 MFU | 125497 tok/s step 15319/19560 | loss 3.320331 (+0.27z)| norm 0.2412 (-0.36z)| lr 9.88e-05 | 4180.81 ms | 32.3% bf16 MFU | 125492 tok/s step 15320/19560 | loss 3.375329 (+1.38z)| norm 0.2715 (+2.31z)| lr 9.87e-05 | 4172.42 ms | 32.4% bf16 MFU | 125500 tok/s step 15321/19560 | loss 3.282070 (-0.52z)| norm 0.2319 (-1.21z)| lr 9.87e-05 | 4167.98 ms | 32.4% bf16 MFU | 125515 tok/s step 15322/19560 | loss 3.328263 (+0.42z)| norm 0.2574 (+1.06z)| lr 9.86e-05 | 4176.89 ms | 32.3% bf16 MFU | 125515 tok/s step 15323/19560 | loss 3.238460 (-1.41z)| norm 0.2481 (+0.23z)| lr 9.86e-05 | 4184.19 ms | 32.3% bf16 MFU | 125504 tok/s step 15324/19560 | loss 3.292430 (-0.31z)| norm 0.2461 (+0.04z)| lr 9.85e-05 | 4162.63 ms | 32.4% bf16 MFU | 125527 tok/s step 15325/19560 | loss 3.413727 (+2.12z)| norm 0.2763 (+2.65z)| lr 9.85e-05 | 4161.27 ms | 32.4% bf16 MFU | 125550 tok/s step 15326/19560 | loss 3.294755 (-0.29z)| norm 0.2498 (+0.32z)| lr 9.84e-05 | 4165.73 ms | 32.4% bf16 MFU | 125565 tok/s step 15327/19560 | loss 3.242612 (-1.32z)| norm 0.2404 (-0.49z)| lr 9.84e-05 | 4275.47 ms | 31.6% bf16 MFU | 125418 tok/s step 15328/19560 | loss 3.292439 (-0.31z)| norm 0.2355 (-0.91z)| lr 9.84e-05 | 4168.70 ms | 32.4% bf16 MFU | 125436 tok/s step 15329/19560 | loss 3.352433 (+0.88z)| norm 0.2488 (+0.25z)| lr 9.83e-05 | 4167.01 ms | 32.4% bf16 MFU | 125455 tok/s step 15330/19560 | loss 3.279002 (-0.59z)| norm 0.2264 (-1.69z)| lr 9.83e-05 | 4173.51 ms | 32.4% bf16 MFU | 125463 tok/s step 15331/19560 | loss 3.288812 (-0.39z)| norm 0.2351 (-0.92z)| lr 9.82e-05 | 4163.18 ms | 32.4% bf16 MFU | 125487 tok/s step 15332/19560 | loss 3.307216 (-0.02z)| norm 0.2329 (-1.11z)| lr 9.82e-05 | 4162.91 ms | 32.4% bf16 MFU | 125510 tok/s step 15333/19560 | loss 3.324969 (+0.33z)| norm 0.2471 (+0.12z)| lr 9.81e-05 | 4165.41 ms | 32.4% bf16 MFU | 125528 tok/s step 15334/19560 | loss 3.381375 (+1.44z)| norm 0.2432 (-0.23z)| lr 9.81e-05 | 4174.85 ms | 32.3% bf16 MFU | 125530 tok/s step 15335/19560 | loss 3.339753 (+0.60z)| norm 0.2398 (-0.52z)| lr 9.80e-05 | 4164.32 ms | 32.4% bf16 MFU | 125549 tok/s step 15336/19560 | loss 3.290175 (-0.39z)| norm 0.2386 (-0.62z)| lr 9.80e-05 | 4165.48 ms | 32.4% bf16 MFU | 125565 tok/s step 15337/19560 | loss 3.268615 (-0.82z)| norm 0.2438 (-0.17z)| lr 9.80e-05 | 4169.80 ms | 32.4% bf16 MFU | 125573 tok/s step 15338/19560 | loss 3.330677 (+0.42z)| norm 0.2342 (-1.00z)| lr 9.79e-05 | 4171.40 ms | 32.4% bf16 MFU | 125579 tok/s step 15339/19560 | loss 3.337851 (+0.56z)| norm 0.2410 (-0.40z)| lr 9.79e-05 | 4173.03 ms | 32.4% bf16 MFU | 125582 tok/s step 15340/19560 | loss 3.299822 (-0.20z)| norm 0.2432 (-0.20z)| lr 9.78e-05 | 4176.56 ms | 32.3% bf16 MFU | 125579 tok/s step 15341/19560 | loss 3.320674 (+0.21z)| norm 0.2410 (-0.38z)| lr 9.78e-05 | 4182.34 ms | 32.3% bf16 MFU | 125568 tok/s step 15342/19560 | loss 3.325955 (+0.31z)| norm 0.2442 (-0.09z)| lr 9.77e-05 | 4169.97 ms | 32.4% bf16 MFU | 125576 tok/s step 15343/19560 | loss 3.247216 (-1.32z)| norm 0.2601 (+1.32z)| lr 9.77e-05 | 4180.69 ms | 32.3% bf16 MFU | 125568 tok/s step 15344/19560 | loss 3.358877 (+0.98z)| norm 0.2408 (-0.40z)| lr 9.76e-05 | 4164.31 ms | 32.4% bf16 MFU | 125584 tok/s step 15345/19560 | loss 3.311205 (+0.01z)| norm 0.2635 (+1.60z)| lr 9.76e-05 | 4160.73 ms | 32.5% bf16 MFU | 125606 tok/s step 15346/19560 | loss 3.347484 (+0.77z)| norm 0.2638 (+1.60z)| lr 9.76e-05 | 4162.26 ms | 32.4% bf16 MFU | 125623 tok/s step 15347/19560 | loss 3.297402 (-0.28z)| norm 0.2556 (+0.87z)| lr 9.75e-05 | 4171.31 ms | 32.4% bf16 MFU | 125627 tok/s step 15348/19560 | loss 3.267739 (-0.88z)| norm 0.2447 (-0.08z)| lr 9.75e-05 | 4178.07 ms | 32.3% bf16 MFU | 125620 tok/s step 15349/19560 | loss 3.355599 (+0.94z)| norm 0.2549 (+0.81z)| lr 9.74e-05 | 4172.66 ms | 32.4% bf16 MFU | 125621 tok/s step 15350/19560 | loss 3.381407 (+1.48z)| norm 0.2594 (+1.19z)| lr 9.74e-05 | 4169.05 ms | 32.4% bf16 MFU | 125628 tok/s step 15351/19560 | loss 3.349650 (+0.81z)| norm 0.2534 (+0.65z)| lr 9.73e-05 | 4167.46 ms | 32.4% bf16 MFU | 125637 tok/s step 15352/19560 | loss 3.270350 (-0.83z)| norm 0.2470 (+0.09z)| lr 9.73e-05 | 4183.92 ms | 32.3% bf16 MFU | 125620 tok/s step 15353/19560 | loss 3.230696 (-1.63z)| norm 0.2485 (+0.22z)| lr 9.72e-05 | 4190.75 ms | 32.2% bf16 MFU | 125595 tok/s step 15354/19560 | loss 3.276245 (-0.69z)| norm 0.2665 (+1.79z)| lr 9.72e-05 | 4166.46 ms | 32.4% bf16 MFU | 125607 tok/s step 15355/19560 | loss 3.292105 (-0.36z)| norm 0.2879 (+3.46z)| lr 9.72e-05 | 4172.70 ms | 32.4% bf16 MFU | 125609 tok/s step 15356/19560 | loss 3.355683 (+0.94z)| norm 0.2290 (-1.42z)| lr 9.71e-05 | 4173.98 ms | 32.3% bf16 MFU | 125609 tok/s step 15357/19560 | loss 3.297028 (-0.27z)| norm 0.2509 (+0.40z)| lr 9.71e-05 | 4165.65 ms | 32.4% bf16 MFU | 125621 tok/s step 15358/19560 | loss 3.222961 (-1.78z)| norm 0.2340 (-1.00z)| lr 9.70e-05 | 4172.00 ms | 32.4% bf16 MFU | 125624 tok/s step 15359/19560 | loss 3.311262 (+0.03z)| norm 0.2336 (-1.02z)| lr 9.70e-05 | 4167.08 ms | 32.4% bf16 MFU | 125633 tok/s step 15360/19560 | loss 3.342078 (+0.65z)| norm 0.2487 (+0.23z)| lr 9.69e-05 | 4164.76 ms | 32.4% bf16 MFU | 125646 tok/s step 15361/19560 | loss 3.298513 (-0.24z)| norm 0.2481 (+0.20z)| lr 9.69e-05 | 4587.99 ms | 29.4% bf16 MFU | 125077 tok/s step 15362/19560 | loss 3.274276 (-0.72z)| norm 0.2408 (-0.41z)| lr 9.69e-05 | 4597.33 ms | 29.4% bf16 MFU | 124526 tok/s step 15363/19560 | loss 3.260776 (-0.99z)| norm 0.2412 (-0.38z)| lr 9.68e-05 | 4426.36 ms | 30.5% bf16 MFU | 124222 tok/s step 15364/19560 | loss 3.237818 (-1.43z)| norm 0.2360 (-0.82z)| lr 9.68e-05 | 4274.08 ms | 31.6% bf16 MFU | 124144 tok/s step 15365/19560 | loss 3.246907 (-1.23z)| norm 0.2375 (-0.69z)| lr 9.67e-05 | 4313.78 ms | 31.3% bf16 MFU | 124014 tok/s step 15366/19560 | loss 3.280370 (-0.55z)| norm 0.2646 (+1.58z)| lr 9.67e-05 | 4169.08 ms | 32.4% bf16 MFU | 124101 tok/s step 15367/19560 | loss 3.330917 (+0.47z)| norm 0.2301 (-1.31z)| lr 9.66e-05 | 4294.91 ms | 31.4% bf16 MFU | 123999 tok/s step 15368/19560 | loss 3.374694 (+1.33z)| norm 0.2547 (+0.73z)| lr 9.66e-05 | 4226.91 ms | 31.9% bf16 MFU | 124001 tok/s step 15369/19560 | loss 3.309772 (+0.03z)| norm 0.2385 (-0.62z)| lr 9.65e-05 | 4167.03 ms | 32.4% bf16 MFU | 124092 tok/s step 15370/19560 | loss 3.258353 (-0.99z)| norm 0.2372 (-0.71z)| lr 9.65e-05 | 4241.76 ms | 31.8% bf16 MFU | 124067 tok/s step 15371/19560 | loss 3.340596 (+0.64z)| norm 0.2711 (+2.08z)| lr 9.65e-05 | 4171.31 ms | 32.4% bf16 MFU | 124149 tok/s step 15372/19560 | loss 3.301776 (-0.12z)| norm 0.2329 (-1.07z)| lr 9.64e-05 | 4224.36 ms | 32.0% bf16 MFU | 124147 tok/s step 15373/19560 | loss 3.267580 (-0.81z)| norm 0.2355 (-0.86z)| lr 9.64e-05 | 4189.96 ms | 32.2% bf16 MFU | 124196 tok/s step 15374/19560 | loss 3.369981 (+1.23z)| norm 0.2558 (+0.81z)| lr 9.63e-05 | 4214.72 ms | 32.0% bf16 MFU | 124206 tok/s step 15375/19560 | loss 3.323456 (+0.30z)| norm 0.2417 (-0.36z)| lr 9.63e-05 | 4166.57 ms | 32.4% bf16 MFU | 124287 tok/s step 15376/19560 | loss 3.292687 (-0.32z)| norm 0.2398 (-0.52z)| lr 9.62e-05 | 4165.94 ms | 32.4% bf16 MFU | 124365 tok/s step 15377/19560 | loss 3.314483 (+0.11z)| norm 0.2567 (+0.87z)| lr 9.62e-05 | 4161.19 ms | 32.4% bf16 MFU | 124447 tok/s step 15378/19560 | loss 3.304294 (-0.08z)| norm 0.2433 (-0.25z)| lr 9.61e-05 | 4177.92 ms | 32.3% bf16 MFU | 124499 tok/s step 15379/19560 | loss 3.241117 (-1.34z)| norm 0.2435 (-0.24z)| lr 9.61e-05 | 4171.08 ms | 32.4% bf16 MFU | 124559 tok/s step 15380/19560 | loss 3.318890 (+0.22z)| norm 0.2474 (+0.08z)| lr 9.61e-05 | 4217.77 ms | 32.0% bf16 MFU | 124546 tok/s step 15381/19560 | loss 3.260339 (-0.94z)| norm 0.2452 (-0.11z)| lr 9.60e-05 | 4154.04 ms | 32.5% bf16 MFU | 124629 tok/s step 15382/19560 | loss 3.282226 (-0.49z)| norm 0.2258 (-1.69z)| lr 9.60e-05 | 4215.05 ms | 32.0% bf16 MFU | 124617 tok/s step 15383/19560 | loss 3.263570 (-0.86z)| norm 0.2322 (-1.17z)| lr 9.59e-05 | 4158.00 ms | 32.5% bf16 MFU | 124691 tok/s step 15384/19560 | loss 3.284693 (-0.43z)| norm 0.2448 (-0.12z)| lr 9.59e-05 | 4162.69 ms | 32.4% bf16 MFU | 124754 tok/s step 15385/19560 | loss 3.252057 (-1.08z)| norm 0.2275 (-1.53z)| lr 9.58e-05 | 4187.25 ms | 32.2% bf16 MFU | 124777 tok/s step 15386/19560 | loss 3.255646 (-1.00z)| norm 0.2210 (-2.02z)| lr 9.58e-05 | 4202.28 ms | 32.1% bf16 MFU | 124776 tok/s step 15387/19560 | loss 3.258479 (-0.94z)| norm 0.2300 (-1.29z)| lr 9.57e-05 | 4162.15 ms | 32.4% bf16 MFU | 124835 tok/s step 15388/19560 | loss 3.288845 (-0.33z)| norm 0.2295 (-1.32z)| lr 9.57e-05 | 4158.93 ms | 32.5% bf16 MFU | 124897 tok/s step 15389/19560 | loss 3.310964 (+0.11z)| norm 0.2404 (-0.43z)| lr 9.57e-05 | 4205.12 ms | 32.1% bf16 MFU | 124886 tok/s step 15390/19560 | loss 3.272422 (-0.64z)| norm 0.2228 (-1.84z)| lr 9.56e-05 | 4159.16 ms | 32.5% bf16 MFU | 124944 tok/s step 15391/19560 | loss 3.249807 (-1.08z)| norm 0.2503 (+0.36z)| lr 9.56e-05 | 4170.36 ms | 32.4% bf16 MFU | 124983 tok/s step 15392/19560 | loss 3.270418 (-0.67z)| norm 0.2330 (-1.04z)| lr 9.55e-05 | 4172.95 ms | 32.4% bf16 MFU | 125016 tok/s step 15393/19560 | loss 3.280447 (-0.47z)| norm 0.2253 (-1.63z)| lr 9.55e-05 | 4164.12 ms | 32.4% bf16 MFU | 125060 tok/s step 15394/19560 | loss 3.266797 (-0.74z)| norm 0.2351 (-0.84z)| lr 9.54e-05 | 4171.86 ms | 32.4% bf16 MFU | 125091 tok/s step 15395/19560 | loss 3.272329 (-0.63z)| norm 0.2264 (-1.52z)| lr 9.54e-05 | 4166.67 ms | 32.4% bf16 MFU | 125128 tok/s step 15396/19560 | loss 3.302568 (-0.03z)| norm 0.2356 (-0.78z)| lr 9.54e-05 | 4169.95 ms | 32.4% bf16 MFU | 125158 tok/s step 15397/19560 | loss 3.296122 (-0.15z)| norm 0.2288 (-1.30z)| lr 9.53e-05 | 4160.52 ms | 32.5% bf16 MFU | 125201 tok/s step 15398/19560 | loss 3.323156 (+0.38z)| norm 0.2492 (+0.31z)| lr 9.53e-05 | 4162.23 ms | 32.4% bf16 MFU | 125239 tok/s step 15399/19560 | loss 3.298616 (-0.10z)| norm 0.2388 (-0.52z)| lr 9.52e-05 | 4171.86 ms | 32.4% bf16 MFU | 125261 tok/s step 15400/19560 | loss 3.285313 (-0.37z)| norm 0.2297 (-1.23z)| lr 9.52e-05 | 4164.12 ms | 32.4% bf16 MFU | 125293 tok/s step 15401/19560 | loss 3.285121 (-0.37z)| norm 0.2311 (-1.11z)| lr 9.51e-05 | 4165.81 ms | 32.4% bf16 MFU | 125321 tok/s step 15402/19560 | loss 3.316988 (+0.26z)| norm 0.2387 (-0.52z)| lr 9.51e-05 | 4161.32 ms | 32.4% bf16 MFU | 125354 tok/s step 15403/19560 | loss 3.326440 (+0.46z)| norm 0.2458 (+0.05z)| lr 9.50e-05 | 4174.19 ms | 32.3% bf16 MFU | 125367 tok/s step 15404/19560 | loss 3.312056 (+0.20z)| norm 0.2350 (-0.80z)| lr 9.50e-05 | 4166.78 ms | 32.4% bf16 MFU | 125390 tok/s step 15405/19560 | loss 3.283324 (-0.40z)| norm 0.2508 (+0.47z)| lr 9.50e-05 | 4157.76 ms | 32.5% bf16 MFU | 125425 tok/s step 15406/19560 | loss 3.331923 (+0.60z)| norm 0.2420 (-0.22z)| lr 9.49e-05 | 4171.76 ms | 32.4% bf16 MFU | 125438 tok/s step 15407/19560 | loss 3.306665 (+0.09z)| norm 0.2253 (-1.54z)| lr 9.49e-05 | 4169.77 ms | 32.4% bf16 MFU | 125453 tok/s step 15408/19560 | loss 3.274070 (-0.59z)| norm 0.2456 (+0.07z)| lr 9.48e-05 | 4161.25 ms | 32.4% bf16 MFU | 125480 tok/s step 15409/19560 | loss 3.199845 (-2.08z)| norm 0.2350 (-0.76z)| lr 9.48e-05 | 4165.75 ms | 32.4% bf16 MFU | 125499 tok/s step 15410/19560 | loss 3.243017 (-1.18z)| norm 0.2346 (-0.79z)| lr 9.47e-05 | 4174.85 ms | 32.3% bf16 MFU | 125503 tok/s step 15411/19560 | loss 3.291387 (-0.18z)| norm 0.2304 (-1.12z)| lr 9.47e-05 | 4158.46 ms | 32.5% bf16 MFU | 125531 tok/s step 15412/19560 | loss 3.318969 (+0.39z)| norm 0.2319 (-1.00z)| lr 9.47e-05 | 4170.12 ms | 32.4% bf16 MFU | 125541 tok/s step 15413/19560 | loss 3.210245 (-1.81z)| norm 0.2420 (-0.19z)| lr 9.46e-05 | 4168.85 ms | 32.4% bf16 MFU | 125552 tok/s step 15414/19560 | loss 3.316709 (+0.35z)| norm 0.2417 (-0.20z)| lr 9.46e-05 | 4163.13 ms | 32.4% bf16 MFU | 125571 tok/s step 15415/19560 | loss 3.266861 (-0.66z)| norm 0.2286 (-1.24z)| lr 9.45e-05 | 4165.24 ms | 32.4% bf16 MFU | 125586 tok/s step 15416/19560 | loss 3.331607 (+0.64z)| norm 0.2451 (+0.09z)| lr 9.45e-05 | 4254.68 ms | 31.7% bf16 MFU | 125468 tok/s step 15417/19560 | loss 3.289263 (-0.23z)| norm 0.2492 (+0.42z)| lr 9.44e-05 | 4169.59 ms | 32.4% bf16 MFU | 125482 tok/s step 15418/19560 | loss 3.326788 (+0.53z)| norm 0.2540 (+0.79z)| lr 9.44e-05 | 4168.21 ms | 32.4% bf16 MFU | 125497 tok/s step 15419/19560 | loss 3.269616 (-0.63z)| norm 0.2285 (-1.24z)| lr 9.43e-05 | 4164.38 ms | 32.4% bf16 MFU | 125517 tok/s step 15420/19560 | loss 3.242849 (-1.20z)| norm 0.2600 (+1.29z)| lr 9.43e-05 | 4162.64 ms | 32.4% bf16 MFU | 125539 tok/s step 15421/19560 | loss 3.275599 (-0.50z)| norm 0.2448 (+0.08z)| lr 9.43e-05 | 4166.61 ms | 32.4% bf16 MFU | 125553 tok/s step 15422/19560 | loss 3.280855 (-0.38z)| norm 0.2360 (-0.62z)| lr 9.42e-05 | 4159.58 ms | 32.5% bf16 MFU | 125578 tok/s step 15423/19560 | loss 3.266473 (-0.68z)| norm 0.2714 (+2.19z)| lr 9.42e-05 | 4202.67 ms | 32.1% bf16 MFU | 125537 tok/s step 15424/19560 | loss 3.272571 (-0.56z)| norm 0.2317 (-0.95z)| lr 9.41e-05 | 4173.64 ms | 32.4% bf16 MFU | 125541 tok/s step 15425/19560 | loss 3.242692 (-1.17z)| norm 0.2771 (+2.55z)| lr 9.41e-05 | 4164.84 ms | 32.4% bf16 MFU | 125558 tok/s step 15426/19560 | loss 3.269405 (-0.62z)| norm 0.2442 (+0.01z)| lr 9.40e-05 | 4159.24 ms | 32.5% bf16 MFU | 125583 tok/s step 15427/19560 | loss 3.240004 (-1.22z)| norm 0.2435 (-0.03z)| lr 9.40e-05 | 4201.78 ms | 32.1% bf16 MFU | 125542 tok/s step 15428/19560 | loss 3.294393 (-0.08z)| norm 0.2432 (-0.06z)| lr 9.40e-05 | 4165.33 ms | 32.4% bf16 MFU | 125559 tok/s step 15429/19560 | loss 3.307705 (+0.22z)| norm 0.2331 (-0.84z)| lr 9.39e-05 | 4158.80 ms | 32.5% bf16 MFU | 125584 tok/s step 15430/19560 | loss 3.240442 (-1.22z)| norm 0.2358 (-0.63z)| lr 9.39e-05 | 4171.37 ms | 32.4% bf16 MFU | 125589 tok/s step 15431/19560 | loss 3.281597 (-0.33z)| norm 0.2567 (+0.98z)| lr 9.38e-05 | 4167.38 ms | 32.4% bf16 MFU | 125600 tok/s step 15432/19560 | loss 3.343717 (+0.99z)| norm 0.2360 (-0.61z)| lr 9.38e-05 | 4204.50 ms | 32.1% bf16 MFU | 125555 tok/s step 15433/19560 | loss 3.301991 (+0.09z)| norm 0.2550 (+0.85z)| lr 9.37e-05 | 4178.08 ms | 32.3% bf16 MFU | 125552 tok/s step 15434/19560 | loss 3.284820 (-0.28z)| norm 0.2622 (+1.39z)| lr 9.37e-05 | 4157.51 ms | 32.5% bf16 MFU | 125579 tok/s step 15435/19560 | loss 3.283545 (-0.31z)| norm 0.2519 (+0.59z)| lr 9.36e-05 | 4172.05 ms | 32.4% bf16 MFU | 125584 tok/s step 15436/19560 | loss 3.318748 (+0.46z)| norm 0.2565 (+0.96z)| lr 9.36e-05 | 4161.50 ms | 32.4% bf16 MFU | 125604 tok/s step 15437/19560 | loss 3.295224 (-0.06z)| norm 0.2600 (+1.21z)| lr 9.36e-05 | 4166.60 ms | 32.4% bf16 MFU | 125615 tok/s step 15438/19560 | loss 3.261898 (-0.78z)| norm 0.2131 (-2.37z)| lr 9.35e-05 | 4174.06 ms | 32.3% bf16 MFU | 125615 tok/s step 15439/19560 | loss 3.328873 (+0.68z)| norm 0.2491 (+0.38z)| lr 9.35e-05 | 4170.30 ms | 32.4% bf16 MFU | 125620 tok/s step 15440/19560 | loss 3.296201 (-0.03z)| norm 0.2417 (-0.18z)| lr 9.34e-05 | 4166.74 ms | 32.4% bf16 MFU | 125630 tok/s step 15441/19560 | loss 3.252943 (-1.09z)| norm 0.2353 (-0.66z)| lr 9.34e-05 | 4168.33 ms | 32.4% bf16 MFU | 125638 tok/s step 15442/19560 | loss 3.306359 (+0.29z)| norm 0.2387 (-0.39z)| lr 9.33e-05 | 4163.98 ms | 32.4% bf16 MFU | 125651 tok/s step 15443/19560 | loss 3.366385 (+1.81z)| norm 0.2685 (+1.92z)| lr 9.33e-05 | 4162.75 ms | 32.4% bf16 MFU | 125666 tok/s step 15444/19560 | loss 3.338717 (+1.09z)| norm 0.2419 (-0.14z)| lr 9.33e-05 | 4170.32 ms | 32.4% bf16 MFU | 125669 tok/s step 15445/19560 | loss 3.341013 (+1.16z)| norm 0.2448 (+0.10z)| lr 9.32e-05 | 4163.82 ms | 32.4% bf16 MFU | 125681 tok/s step 15446/19560 | loss 3.253476 (-1.09z)| norm 0.2449 (+0.10z)| lr 9.32e-05 | 4181.54 ms | 32.3% bf16 MFU | 125666 tok/s step 15447/19560 | loss 3.321908 (+0.68z)| norm 0.2532 (+0.75z)| lr 9.31e-05 | 4172.81 ms | 32.4% bf16 MFU | 125665 tok/s step 15448/19560 | loss 3.262519 (-0.85z)| norm 0.2427 (-0.06z)| lr 9.31e-05 | 4172.64 ms | 32.4% bf16 MFU | 125664 tok/s step 15449/19560 | loss 3.294708 (-0.01z)| norm 0.2472 (+0.29z)| lr 9.30e-05 | 4166.02 ms | 32.4% bf16 MFU | 125673 tok/s step 15450/19560 | loss 3.262965 (-0.83z)| norm 0.2327 (-0.86z)| lr 9.30e-05 | 4169.20 ms | 32.4% bf16 MFU | 125677 tok/s step 15451/19560 | loss 3.403063 (+2.75z)| norm 0.2453 (+0.16z)| lr 9.30e-05 | 4163.74 ms | 32.4% bf16 MFU | 125689 tok/s step 15452/19560 | loss 3.264979 (-0.78z)| norm 0.2507 (+0.59z)| lr 9.29e-05 | 4160.38 ms | 32.5% bf16 MFU | 125706 tok/s step 15453/19560 | loss 3.344151 (+1.30z)| norm 0.2537 (+0.87z)| lr 9.29e-05 | 4179.19 ms | 32.3% bf16 MFU | 125693 tok/s step 15454/19560 | loss 3.314302 (+0.50z)| norm 0.2464 (+0.26z)| lr 9.28e-05 | 4166.12 ms | 32.4% bf16 MFU | 125701 tok/s step 15455/19560 | loss 3.297899 (+0.06z)| norm 0.2409 (-0.19z)| lr 9.28e-05 | 4168.64 ms | 32.4% bf16 MFU | 125704 tok/s step 15456/19560 | loss 3.271819 (-0.63z)| norm 0.2549 (+0.95z)| lr 9.27e-05 | 4163.22 ms | 32.4% bf16 MFU | 125716 tok/s step 15457/19560 | loss 3.275393 (-0.52z)| norm 0.2488 (+0.45z)| lr 9.27e-05 | 4230.61 ms | 31.9% bf16 MFU | 125626 tok/s step 15458/19560 | loss 3.293271 (-0.04z)| norm 0.2472 (+0.31z)| lr 9.27e-05 | 4163.63 ms | 32.4% bf16 MFU | 125641 tok/s step 15459/19560 | loss 3.274023 (-0.55z)| norm 0.2379 (-0.47z)| lr 9.26e-05 | 4168.48 ms | 32.4% bf16 MFU | 125648 tok/s step 15460/19560 | loss 3.342550 (+1.26z)| norm 0.2355 (-0.68z)| lr 9.26e-05 | 4157.57 ms | 32.5% bf16 MFU | 125671 tok/s step 15461/19560 | loss 3.253088 (-1.10z)| norm 0.2491 (+0.46z)| lr 9.25e-05 | 4163.17 ms | 32.4% bf16 MFU | 125684 tok/s step 15462/19560 | loss 3.305876 (+0.32z)| norm 0.2380 (-0.46z)| lr 9.25e-05 | 4178.00 ms | 32.3% bf16 MFU | 125674 tok/s step 15463/19560 | loss 3.281239 (-0.33z)| norm 0.2469 (+0.28z)| lr 9.24e-05 | 4167.47 ms | 32.4% bf16 MFU | 125680 tok/s step 15464/19560 | loss 3.318799 (+0.68z)| norm 0.2551 (+0.95z)| lr 9.24e-05 | 4162.22 ms | 32.4% bf16 MFU | 125695 tok/s step 15465/19560 | loss 3.324374 (+0.82z)| norm 0.2266 (-1.40z)| lr 9.23e-05 | 4159.43 ms | 32.5% bf16 MFU | 125712 tok/s step 15466/19560 | loss 3.309249 (+0.42z)| norm 0.2481 (+0.37z)| lr 9.23e-05 | 4167.15 ms | 32.4% bf16 MFU | 125717 tok/s step 15467/19560 | loss 3.278722 (-0.40z)| norm 0.2572 (+1.10z)| lr 9.23e-05 | 4164.75 ms | 32.4% bf16 MFU | 125726 tok/s step 15468/19560 | loss 3.239992 (-1.44z)| norm 0.2470 (+0.26z)| lr 9.22e-05 | 4164.91 ms | 32.4% bf16 MFU | 125734 tok/s step 15469/19560 | loss 3.262446 (-0.82z)| norm 0.2321 (-0.95z)| lr 9.22e-05 | 4163.53 ms | 32.4% bf16 MFU | 125743 tok/s step 15470/19560 | loss 3.296373 (+0.11z)| norm 0.2559 (+0.98z)| lr 9.21e-05 | 4164.37 ms | 32.4% bf16 MFU | 125751 tok/s step 15471/19560 | loss 3.279349 (-0.36z)| norm 0.2415 (-0.19z)| lr 9.21e-05 | 4164.64 ms | 32.4% bf16 MFU | 125758 tok/s step 15472/19560 | loss 3.314930 (+0.63z)| norm 0.2943 (+3.88z)| lr 9.20e-05 | 4156.60 ms | 32.5% bf16 MFU | 125777 tok/s step 15473/19560 | loss 3.278603 (-0.37z)| norm 0.2408 (-0.25z)| lr 9.20e-05 | 4158.23 ms | 32.5% bf16 MFU | 125792 tok/s step 15474/19560 | loss 3.222020 (-1.90z)| norm 0.2511 (+0.57z)| lr 9.20e-05 | 4173.80 ms | 32.3% bf16 MFU | 125783 tok/s step 15475/19560 | loss 3.288950 (-0.06z)| norm 0.2447 (+0.07z)| lr 9.19e-05 | 4164.80 ms | 32.4% bf16 MFU | 125788 tok/s step 15476/19560 | loss 3.253082 (-1.04z)| norm 0.2392 (-0.36z)| lr 9.19e-05 | 4161.71 ms | 32.4% bf16 MFU | 125798 tok/s step 15477/19560 | loss 3.325542 (+0.97z)| norm 0.2521 (+0.66z)| lr 9.18e-05 | 4260.69 ms | 31.7% bf16 MFU | 125661 tok/s step 15478/19560 | loss 3.208553 (-2.25z)| norm 0.2495 (+0.46z)| lr 9.18e-05 | 4173.73 ms | 32.3% bf16 MFU | 125658 tok/s step 15479/19560 | loss 3.268722 (-0.56z)| norm 0.2365 (-0.55z)| lr 9.17e-05 | 4165.16 ms | 32.4% bf16 MFU | 125669 tok/s step 15480/19560 | loss 3.287924 (-0.02z)| norm 0.2537 (+0.80z)| lr 9.17e-05 | 4163.20 ms | 32.4% bf16 MFU | 125682 tok/s step 15481/19560 | loss 3.277253 (-0.34z)| norm 0.2405 (-0.24z)| lr 9.17e-05 | 4169.64 ms | 32.4% bf16 MFU | 125685 tok/s step 15482/19560 | loss 3.398743 (+2.98z)| norm 0.2393 (-0.32z)| lr 9.16e-05 | 4167.65 ms | 32.4% bf16 MFU | 125691 tok/s step 15483/19560 | loss 3.274772 (-0.42z)| norm 0.2343 (-0.72z)| lr 9.16e-05 | 4168.01 ms | 32.4% bf16 MFU | 125696 tok/s step 15484/19560 | loss 3.286431 (-0.08z)| norm 0.2372 (-0.49z)| lr 9.15e-05 | 4160.53 ms | 32.5% bf16 MFU | 125712 tok/s step 15485/19560 | loss 3.206334 (-2.24z)| norm 0.2498 (+0.59z)| lr 9.15e-05 | 4165.97 ms | 32.4% bf16 MFU | 125719 tok/s step 15486/19560 | loss 3.298696 (+0.26z)| norm 0.2301 (-1.08z)| lr 9.14e-05 | 4177.50 ms | 32.3% bf16 MFU | 125708 tok/s step 15487/19560 | loss 3.282130 (-0.19z)| norm 0.2429 (-0.01z)| lr 9.14e-05 | 4163.02 ms | 32.4% bf16 MFU | 125719 tok/s step 15488/19560 | loss 3.235071 (-1.47z)| norm 0.2380 (-0.41z)| lr 9.14e-05 | 4160.71 ms | 32.5% bf16 MFU | 125734 tok/s step 15489/19560 | loss 3.284903 (-0.09z)| norm 0.2408 (-0.17z)| lr 9.13e-05 | 4167.02 ms | 32.4% bf16 MFU | 125738 tok/s step 15490/19560 | loss 3.320776 (+0.89z)| norm 0.2479 (+0.43z)| lr 9.13e-05 | 4158.89 ms | 32.5% bf16 MFU | 125754 tok/s step 15491/19560 | loss 3.305763 (+0.47z)| norm 0.2468 (+0.33z)| lr 9.12e-05 | 4167.79 ms | 32.4% bf16 MFU | 125757 tok/s step 15492/19560 | loss 3.253039 (-1.00z)| norm 0.2519 (+0.75z)| lr 9.12e-05 | 4161.33 ms | 32.4% bf16 MFU | 125768 tok/s step 15493/19560 | loss 3.332183 (+1.18z)| norm 0.2497 (+0.56z)| lr 9.11e-05 | 4158.03 ms | 32.5% bf16 MFU | 125784 tok/s step 15494/19560 | loss 3.308282 (+0.51z)| norm 0.2532 (+0.87z)| lr 9.11e-05 | 4179.82 ms | 32.3% bf16 MFU | 125767 tok/s step 15495/19560 | loss 3.288167 (-0.04z)| norm 0.2341 (-0.77z)| lr 9.11e-05 | 4162.80 ms | 32.4% bf16 MFU | 125776 tok/s step 15496/19560 | loss 3.300702 (+0.34z)| norm 0.2400 (-0.26z)| lr 9.10e-05 | 4237.54 ms | 31.9% bf16 MFU | 125673 tok/s step 15497/19560 | loss 3.242003 (-1.31z)| norm 0.2588 (+1.34z)| lr 9.10e-05 | 4167.57 ms | 32.4% bf16 MFU | 125680 tok/s step 15498/19560 | loss 3.327175 (+1.08z)| norm 0.2312 (-1.02z)| lr 9.09e-05 | 4173.48 ms | 32.4% bf16 MFU | 125677 tok/s step 15499/19560 | loss 3.268470 (-0.57z)| norm 0.2452 (+0.20z)| lr 9.09e-05 | 4170.90 ms | 32.4% bf16 MFU | 125678 tok/s step 15500/19560 | loss 3.307872 (+0.55z)| norm 0.2292 (-1.19z)| lr 9.08e-05 | 4183.09 ms | 32.3% bf16 MFU | 125661 tok/s val loss 3.277862 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3030/10042 = 0.301733 step 15501/19560 | loss 3.309370 (+0.59z)| norm 0.2572 (+1.23z)| lr 9.08e-05 | 4159.68 ms | 32.5% bf16 MFU | 125680 tok/s step 15502/19560 | loss 3.234327 (-1.54z)| norm 0.2304 (-1.08z)| lr 9.08e-05 | 4156.80 ms | 32.5% bf16 MFU | 125702 tok/s step 15503/19560 | loss 3.239623 (-1.36z)| norm 0.2350 (-0.67z)| lr 9.07e-05 | 4162.68 ms | 32.4% bf16 MFU | 125715 tok/s step 15504/19560 | loss 3.312358 (+0.72z)| norm 0.2395 (-0.29z)| lr 9.07e-05 | 4208.65 ms | 32.1% bf16 MFU | 125658 tok/s step 15505/19560 | loss 3.299043 (+0.34z)| norm 0.2411 (-0.13z)| lr 9.06e-05 | 4223.60 ms | 32.0% bf16 MFU | 125581 tok/s step 15506/19560 | loss 3.279001 (-0.23z)| norm 0.2456 (+0.26z)| lr 9.06e-05 | 4232.67 ms | 31.9% bf16 MFU | 125496 tok/s step 15507/19560 | loss 3.250495 (-1.05z)| norm 0.2617 (+1.63z)| lr 9.05e-05 | 4200.04 ms | 32.1% bf16 MFU | 125462 tok/s step 15508/19560 | loss 3.294469 (+0.22z)| norm 0.2428 (+0.01z)| lr 9.05e-05 | 4163.90 ms | 32.4% bf16 MFU | 125485 tok/s step 15509/19560 | loss 3.237435 (-1.41z)| norm 0.2581 (+1.31z)| lr 9.05e-05 | 4156.72 ms | 32.5% bf16 MFU | 125517 tok/s step 15510/19560 | loss 3.335131 (+1.37z)| norm 0.2596 (+1.41z)| lr 9.04e-05 | 4160.76 ms | 32.5% bf16 MFU | 125542 tok/s step 15511/19560 | loss 3.287688 (+0.01z)| norm 0.2276 (-1.33z)| lr 9.04e-05 | 4161.55 ms | 32.4% bf16 MFU | 125564 tok/s step 15512/19560 | loss 3.286129 (-0.03z)| norm 0.2523 (+0.78z)| lr 9.03e-05 | 4160.44 ms | 32.5% bf16 MFU | 125586 tok/s step 15513/19560 | loss 3.287859 (+0.01z)| norm 0.2429 (-0.03z)| lr 9.03e-05 | 4324.72 ms | 31.2% bf16 MFU | 125369 tok/s step 15514/19560 | loss 3.304580 (+0.48z)| norm 0.2341 (-0.81z)| lr 9.02e-05 | 4234.15 ms | 31.9% bf16 MFU | 125291 tok/s step 15515/19560 | loss 3.250447 (-1.07z)| norm 0.2413 (-0.18z)| lr 9.02e-05 | 4161.08 ms | 32.4% bf16 MFU | 125327 tok/s step 15516/19560 | loss 3.270374 (-0.49z)| norm 0.2464 (+0.25z)| lr 9.02e-05 | 4160.26 ms | 32.5% bf16 MFU | 125362 tok/s step 15517/19560 | loss 3.301063 (+0.39z)| norm 0.2294 (-1.23z)| lr 9.01e-05 | 4161.07 ms | 32.4% bf16 MFU | 125393 tok/s step 15518/19560 | loss 3.375715 (+2.45z)| norm 0.2930 (+4.04z)| lr 9.01e-05 | 4161.34 ms | 32.4% bf16 MFU | 125423 tok/s step 15519/19560 | loss 3.335941 (+1.31z)| norm 0.2555 (+0.94z)| lr 9.00e-05 | 4191.81 ms | 32.2% bf16 MFU | 125406 tok/s step 15520/19560 | loss 3.308833 (+0.55z)| norm 0.2494 (+0.43z)| lr 9.00e-05 | 4161.13 ms | 32.4% bf16 MFU | 125435 tok/s step 15521/19560 | loss 3.308956 (+0.54z)| norm 0.2502 (+0.48z)| lr 8.99e-05 | 4170.36 ms | 32.4% bf16 MFU | 125449 tok/s step 15522/19560 | loss 3.300120 (+0.29z)| norm 0.2442 (-0.02z)| lr 8.99e-05 | 4191.57 ms | 32.2% bf16 MFU | 125431 tok/s step 15523/19560 | loss 3.266906 (-0.64z)| norm 0.2316 (-1.09z)| lr 8.99e-05 | 4161.24 ms | 32.4% bf16 MFU | 125459 tok/s step 15524/19560 | loss 3.314731 (+0.69z)| norm 0.2433 (-0.11z)| lr 8.98e-05 | 4202.38 ms | 32.1% bf16 MFU | 125424 tok/s step 15525/19560 | loss 3.300705 (+0.30z)| norm 0.2550 (+0.86z)| lr 8.98e-05 | 4164.63 ms | 32.4% bf16 MFU | 125447 tok/s step 15526/19560 | loss 3.350646 (+1.68z)| norm 0.2372 (-0.64z)| lr 8.97e-05 | 4166.14 ms | 32.4% bf16 MFU | 125467 tok/s step 15527/19560 | loss 3.236768 (-1.45z)| norm 0.2489 (+0.35z)| lr 8.97e-05 | 4157.01 ms | 32.5% bf16 MFU | 125500 tok/s step 15528/19560 | loss 3.258183 (-0.86z)| norm 0.2384 (-0.55z)| lr 8.96e-05 | 4161.09 ms | 32.4% bf16 MFU | 125525 tok/s step 15529/19560 | loss 3.282606 (-0.19z)| norm 0.2362 (-0.74z)| lr 8.96e-05 | 4158.26 ms | 32.5% bf16 MFU | 125553 tok/s step 15530/19560 | loss 3.245823 (-1.17z)| norm 0.2307 (-1.19z)| lr 8.96e-05 | 4181.93 ms | 32.3% bf16 MFU | 125544 tok/s step 15531/19560 | loss 3.258320 (-0.82z)| norm 0.2475 (+0.23z)| lr 8.95e-05 | 4191.90 ms | 32.2% bf16 MFU | 125520 tok/s step 15532/19560 | loss 3.349489 (+1.65z)| norm 0.2590 (+1.18z)| lr 8.95e-05 | 4164.36 ms | 32.4% bf16 MFU | 125539 tok/s step 15533/19560 | loss 3.260972 (-0.74z)| norm 0.2509 (+0.49z)| lr 8.94e-05 | 4161.82 ms | 32.4% bf16 MFU | 125561 tok/s step 15534/19560 | loss 3.279285 (-0.24z)| norm 0.2572 (+1.01z)| lr 8.94e-05 | 4168.50 ms | 32.4% bf16 MFU | 125572 tok/s step 15535/19560 | loss 3.282432 (-0.15z)| norm 0.2579 (+1.06z)| lr 8.93e-05 | 4172.62 ms | 32.4% bf16 MFU | 125575 tok/s step 15536/19560 | loss 3.252881 (-0.94z)| norm 0.2278 (-1.47z)| lr 8.93e-05 | 4160.94 ms | 32.4% bf16 MFU | 125597 tok/s step 15537/19560 | loss 3.286891 (-0.04z)| norm 0.2396 (-0.48z)| lr 8.93e-05 | 4173.63 ms | 32.4% bf16 MFU | 125598 tok/s step 15538/19560 | loss 3.302175 (+0.37z)| norm 0.2545 (+0.76z)| lr 8.92e-05 | 4258.17 ms | 31.7% bf16 MFU | 125474 tok/s step 15539/19560 | loss 3.246824 (-1.16z)| norm 0.2437 (-0.15z)| lr 8.92e-05 | 4163.33 ms | 32.4% bf16 MFU | 125497 tok/s step 15540/19560 | loss 3.280563 (-0.21z)| norm 0.2374 (-0.70z)| lr 8.91e-05 | 4214.04 ms | 32.0% bf16 MFU | 125443 tok/s step 15541/19560 | loss 3.328771 (+1.12z)| norm 0.2412 (-0.37z)| lr 8.91e-05 | 4168.35 ms | 32.4% bf16 MFU | 125460 tok/s step 15542/19560 | loss 3.297779 (+0.25z)| norm 0.2529 (+0.61z)| lr 8.90e-05 | 4165.60 ms | 32.4% bf16 MFU | 125480 tok/s step 15543/19560 | loss 3.294341 (+0.15z)| norm 0.2524 (+0.56z)| lr 8.90e-05 | 4166.70 ms | 32.4% bf16 MFU | 125497 tok/s step 15544/19560 | loss 3.276207 (-0.36z)| norm 0.2304 (-1.31z)| lr 8.90e-05 | 4161.15 ms | 32.4% bf16 MFU | 125522 tok/s step 15545/19560 | loss 3.318739 (+0.85z)| norm 0.2428 (-0.25z)| lr 8.89e-05 | 4169.80 ms | 32.4% bf16 MFU | 125533 tok/s step 15546/19560 | loss 3.300545 (+0.34z)| norm 0.2336 (-1.02z)| lr 8.89e-05 | 4176.42 ms | 32.3% bf16 MFU | 125533 tok/s step 15547/19560 | loss 3.210723 (-2.17z)| norm 0.2371 (-0.73z)| lr 8.88e-05 | 4155.71 ms | 32.5% bf16 MFU | 125564 tok/s step 15548/19560 | loss 3.332830 (+1.23z)| norm 0.2237 (-1.83z)| lr 8.88e-05 | 4173.93 ms | 32.3% bf16 MFU | 125567 tok/s step 15549/19560 | loss 3.317671 (+0.79z)| norm 0.2753 (+2.46z)| lr 8.87e-05 | 4164.78 ms | 32.4% bf16 MFU | 125582 tok/s step 15550/19560 | loss 3.295073 (+0.16z)| norm 0.2698 (+1.95z)| lr 8.87e-05 | 4172.93 ms | 32.4% bf16 MFU | 125585 tok/s step 15551/19560 | loss 3.287656 (-0.05z)| norm 0.2279 (-1.45z)| lr 8.87e-05 | 4241.87 ms | 31.8% bf16 MFU | 125486 tok/s step 15552/19560 | loss 3.312682 (+0.64z)| norm 0.2698 (+1.96z)| lr 8.86e-05 | 4674.37 ms | 28.9% bf16 MFU | 124820 tok/s step 15553/19560 | loss 3.345936 (+1.54z)| norm 0.2495 (+0.33z)| lr 8.86e-05 | 4542.33 ms | 29.7% bf16 MFU | 124350 tok/s step 15554/19560 | loss 3.345644 (+1.51z)| norm 0.2423 (-0.27z)| lr 8.85e-05 | 4613.32 ms | 29.3% bf16 MFU | 123815 tok/s step 15555/19560 | loss 3.331718 (+1.10z)| norm 0.2492 (+0.30z)| lr 8.85e-05 | 4338.07 ms | 31.1% bf16 MFU | 123667 tok/s step 15556/19560 | loss 3.347004 (+1.50z)| norm 0.2296 (-1.32z)| lr 8.84e-05 | 4488.34 ms | 30.1% bf16 MFU | 123324 tok/s step 15557/19560 | loss 3.232404 (-1.62z)| norm 0.2348 (-0.89z)| lr 8.84e-05 | 4398.92 ms | 30.7% bf16 MFU | 123117 tok/s step 15558/19560 | loss 3.253881 (-1.04z)| norm 0.2439 (-0.14z)| lr 8.84e-05 | 4508.03 ms | 30.0% bf16 MFU | 122776 tok/s step 15559/19560 | loss 3.303620 (+0.31z)| norm 0.2579 (+1.03z)| lr 8.83e-05 | 4280.00 ms | 31.5% bf16 MFU | 122762 tok/s step 15560/19560 | loss 3.335152 (+1.18z)| norm 0.2381 (-0.62z)| lr 8.83e-05 | 4358.73 ms | 31.0% bf16 MFU | 122639 tok/s step 15561/19560 | loss 3.303942 (+0.33z)| norm 0.2189 (-2.17z)| lr 8.82e-05 | 4237.08 ms | 31.9% bf16 MFU | 122694 tok/s step 15562/19560 | loss 3.276791 (-0.42z)| norm 0.2422 (-0.24z)| lr 8.82e-05 | 4222.52 ms | 32.0% bf16 MFU | 122767 tok/s step 15563/19560 | loss 3.314387 (+0.61z)| norm 0.2426 (-0.21z)| lr 8.81e-05 | 4216.34 ms | 32.0% bf16 MFU | 122846 tok/s step 15564/19560 | loss 3.369754 (+2.08z)| norm 0.2472 (+0.18z)| lr 8.81e-05 | 4219.37 ms | 32.0% bf16 MFU | 122917 tok/s step 15565/19560 | loss 3.303719 (+0.30z)| norm 0.2420 (-0.24z)| lr 8.81e-05 | 4278.13 ms | 31.6% bf16 MFU | 122898 tok/s step 15566/19560 | loss 3.439820 (+3.72z)| norm 0.2392 (-0.51z)| lr 8.80e-05 | 4272.10 ms | 31.6% bf16 MFU | 122890 tok/s step 15567/19560 | loss 3.333844 (+1.01z)| norm 0.2346 (-0.89z)| lr 8.80e-05 | 4212.61 ms | 32.1% bf16 MFU | 122968 tok/s step 15568/19560 | loss 3.312290 (+0.46z)| norm 0.2332 (-0.99z)| lr 8.79e-05 | 4220.44 ms | 32.0% bf16 MFU | 123031 tok/s step 15569/19560 | loss 3.300623 (+0.15z)| norm 0.2546 (+0.81z)| lr 8.79e-05 | 4230.10 ms | 31.9% bf16 MFU | 123076 tok/s step 15570/19560 | loss 3.340824 (+1.17z)| norm 0.2262 (-1.58z)| lr 8.79e-05 | 4202.88 ms | 32.1% bf16 MFU | 123160 tok/s step 15571/19560 | loss 3.302613 (+0.21z)| norm 0.2414 (-0.29z)| lr 8.78e-05 | 4241.47 ms | 31.8% bf16 MFU | 123182 tok/s step 15572/19560 | loss 3.329741 (+0.91z)| norm 0.2377 (-0.60z)| lr 8.78e-05 | 4227.53 ms | 31.9% bf16 MFU | 123224 tok/s step 15573/19560 | loss 3.248993 (-1.15z)| norm 0.2349 (-0.83z)| lr 8.77e-05 | 4229.56 ms | 31.9% bf16 MFU | 123261 tok/s step 15574/19560 | loss 3.297845 (+0.10z)| norm 0.2395 (-0.43z)| lr 8.77e-05 | 4215.01 ms | 32.0% bf16 MFU | 123317 tok/s step 15575/19560 | loss 3.325480 (+0.82z)| norm 0.2337 (-0.92z)| lr 8.76e-05 | 4167.78 ms | 32.4% bf16 MFU | 123441 tok/s step 15576/19560 | loss 3.328941 (+0.89z)| norm 0.2428 (-0.14z)| lr 8.76e-05 | 4165.70 ms | 32.4% bf16 MFU | 123562 tok/s step 15577/19560 | loss 3.383371 (+2.24z)| norm 0.2531 (+0.73z)| lr 8.76e-05 | 4239.59 ms | 31.8% bf16 MFU | 123567 tok/s step 15578/19560 | loss 3.261334 (-0.86z)| norm 0.2359 (-0.74z)| lr 8.75e-05 | 4162.79 ms | 32.4% bf16 MFU | 123686 tok/s step 15579/19560 | loss 3.350277 (+1.44z)| norm 0.2373 (-0.61z)| lr 8.75e-05 | 4324.65 ms | 31.2% bf16 MFU | 123563 tok/s step 15580/19560 | loss 3.272923 (-0.57z)| norm 0.2345 (-0.83z)| lr 8.74e-05 | 4188.63 ms | 32.2% bf16 MFU | 123644 tok/s step 15581/19560 | loss 3.334826 (+1.04z)| norm 0.2411 (-0.26z)| lr 8.74e-05 | 4173.27 ms | 32.4% bf16 MFU | 123743 tok/s step 15582/19560 | loss 3.309005 (+0.37z)| norm 0.2215 (-1.89z)| lr 8.73e-05 | 4178.93 ms | 32.3% bf16 MFU | 123829 tok/s step 15583/19560 | loss 3.312597 (+0.46z)| norm 0.2423 (-0.15z)| lr 8.73e-05 | 4197.12 ms | 32.2% bf16 MFU | 123883 tok/s step 15584/19560 | loss 3.325591 (+0.79z)| norm 0.2302 (-1.15z)| lr 8.73e-05 | 4182.32 ms | 32.3% bf16 MFU | 123957 tok/s step 15585/19560 | loss 3.317610 (+0.57z)| norm 0.2438 (+0.00z)| lr 8.72e-05 | 4192.63 ms | 32.2% bf16 MFU | 124012 tok/s step 15586/19560 | loss 3.340937 (+1.16z)| norm 0.2357 (-0.67z)| lr 8.72e-05 | 4234.72 ms | 31.9% bf16 MFU | 124001 tok/s step 15587/19560 | loss 3.340351 (+1.13z)| norm 0.2448 (+0.08z)| lr 8.71e-05 | 4180.67 ms | 32.3% bf16 MFU | 124072 tok/s step 15588/19560 | loss 3.299655 (+0.09z)| norm 0.2316 (-1.02z)| lr 8.71e-05 | 4170.02 ms | 32.4% bf16 MFU | 124154 tok/s step 15589/19560 | loss 3.245306 (-1.32z)| norm 0.2282 (-1.28z)| lr 8.71e-05 | 4171.18 ms | 32.4% bf16 MFU | 124231 tok/s step 15590/19560 | loss 3.309834 (+0.35z)| norm 0.2407 (-0.25z)| lr 8.70e-05 | 4175.61 ms | 32.3% bf16 MFU | 124298 tok/s step 15591/19560 | loss 3.325197 (+0.74z)| norm 0.2382 (-0.44z)| lr 8.70e-05 | 4171.20 ms | 32.4% bf16 MFU | 124368 tok/s step 15592/19560 | loss 3.322799 (+0.68z)| norm 0.2451 (+0.14z)| lr 8.69e-05 | 4169.22 ms | 32.4% bf16 MFU | 124437 tok/s step 15593/19560 | loss 3.223358 (-1.85z)| norm 0.2264 (-1.43z)| lr 8.69e-05 | 4169.06 ms | 32.4% bf16 MFU | 124503 tok/s step 15594/19560 | loss 3.303915 (+0.21z)| norm 0.2310 (-1.02z)| lr 8.68e-05 | 4342.81 ms | 31.1% bf16 MFU | 124314 tok/s step 15595/19560 | loss 3.289119 (-0.17z)| norm 0.2439 (+0.06z)| lr 8.68e-05 | 4294.97 ms | 31.4% bf16 MFU | 124202 tok/s step 15596/19560 | loss 3.310271 (+0.36z)| norm 0.2278 (-1.27z)| lr 8.68e-05 | 4163.62 ms | 32.4% bf16 MFU | 124288 tok/s step 15597/19560 | loss 3.223124 (-1.86z)| norm 0.2413 (-0.16z)| lr 8.67e-05 | 4168.35 ms | 32.4% bf16 MFU | 124362 tok/s step 15598/19560 | loss 3.507672 (+4.83z)| norm 0.2656 (+1.85z)| lr 8.67e-05 | 4274.40 ms | 31.6% bf16 MFU | 124277 tok/s step 15599/19560 | loss 3.314474 (+0.38z)| norm 0.2297 (-1.11z)| lr 8.66e-05 | 4189.63 ms | 32.2% bf16 MFU | 124320 tok/s step 15600/19560 | loss 3.340485 (+0.97z)| norm 0.2369 (-0.51z)| lr 8.66e-05 | 4235.16 ms | 31.9% bf16 MFU | 124294 tok/s step 15601/19560 | loss 3.372810 (+1.68z)| norm 0.2172 (-2.20z)| lr 8.65e-05 | 4175.81 ms | 32.3% bf16 MFU | 124357 tok/s step 15602/19560 | loss 3.289267 (-0.23z)| norm 0.2341 (-0.72z)| lr 8.65e-05 | 4176.68 ms | 32.3% bf16 MFU | 124415 tok/s step 15603/19560 | loss 3.303796 (+0.10z)| norm 0.2408 (-0.13z)| lr 8.65e-05 | 4183.02 ms | 32.3% bf16 MFU | 124461 tok/s step 15604/19560 | loss 3.297750 (-0.05z)| norm 0.2288 (-1.16z)| lr 8.64e-05 | 4168.55 ms | 32.4% bf16 MFU | 124527 tok/s step 15605/19560 | loss 3.305926 (+0.14z)| norm 0.2337 (-0.72z)| lr 8.64e-05 | 4235.68 ms | 31.9% bf16 MFU | 124490 tok/s step 15606/19560 | loss 3.363184 (+1.45z)| norm 0.2452 (+0.27z)| lr 8.63e-05 | 4181.34 ms | 32.3% bf16 MFU | 124534 tok/s step 15607/19560 | loss 3.325172 (+0.56z)| norm 0.2404 (-0.15z)| lr 8.63e-05 | 4183.12 ms | 32.3% bf16 MFU | 124574 tok/s step 15608/19560 | loss 3.306416 (+0.12z)| norm 0.2281 (-1.20z)| lr 8.63e-05 | 4168.51 ms | 32.4% bf16 MFU | 124634 tok/s step 15609/19560 | loss 3.308065 (+0.15z)| norm 0.2606 (+1.59z)| lr 8.62e-05 | 4174.58 ms | 32.3% bf16 MFU | 124682 tok/s step 15610/19560 | loss 3.272065 (-0.68z)| norm 0.2387 (-0.29z)| lr 8.62e-05 | 4183.70 ms | 32.3% bf16 MFU | 124714 tok/s step 15611/19560 | loss 3.292378 (-0.20z)| norm 0.2332 (-0.76z)| lr 8.61e-05 | 4172.34 ms | 32.4% bf16 MFU | 124761 tok/s step 15612/19560 | loss 3.241369 (-1.40z)| norm 0.2424 (+0.03z)| lr 8.61e-05 | 4241.69 ms | 31.8% bf16 MFU | 124703 tok/s step 15613/19560 | loss 3.270703 (-0.73z)| norm 0.2404 (-0.14z)| lr 8.60e-05 | 4174.65 ms | 32.3% bf16 MFU | 124748 tok/s step 15614/19560 | loss 3.328273 (+0.65z)| norm 0.2200 (-1.86z)| lr 8.60e-05 | 4257.88 ms | 31.7% bf16 MFU | 124667 tok/s step 15615/19560 | loss 3.330554 (+0.70z)| norm 0.2281 (-1.16z)| lr 8.60e-05 | 4173.57 ms | 32.4% bf16 MFU | 124715 tok/s step 15616/19560 | loss 3.384663 (+1.96z)| norm 0.2465 (+0.39z)| lr 8.59e-05 | 4173.70 ms | 32.3% bf16 MFU | 124760 tok/s step 15617/19560 | loss 3.248238 (-1.29z)| norm 0.2451 (+0.27z)| lr 8.59e-05 | 4177.44 ms | 32.3% bf16 MFU | 124797 tok/s step 15618/19560 | loss 3.287612 (-0.35z)| norm 0.2364 (-0.46z)| lr 8.58e-05 | 4177.83 ms | 32.3% bf16 MFU | 124832 tok/s step 15619/19560 | loss 3.338250 (+0.85z)| norm 0.2411 (-0.06z)| lr 8.58e-05 | 4175.91 ms | 32.3% bf16 MFU | 124868 tok/s step 15620/19560 | loss 3.264117 (-0.92z)| norm 0.2512 (+0.80z)| lr 8.57e-05 | 4179.59 ms | 32.3% bf16 MFU | 124896 tok/s step 15621/19560 | loss 3.258422 (-1.03z)| norm 0.2284 (-1.11z)| lr 8.57e-05 | 4175.12 ms | 32.3% bf16 MFU | 124930 tok/s step 15622/19560 | loss 3.343167 (+0.97z)| norm 0.2459 (+0.37z)| lr 8.57e-05 | 4165.04 ms | 32.4% bf16 MFU | 124978 tok/s step 15623/19560 | loss 3.263762 (-0.90z)| norm 0.2424 (+0.07z)| lr 8.56e-05 | 4167.08 ms | 32.4% bf16 MFU | 125020 tok/s step 15624/19560 | loss 3.271919 (-0.71z)| norm 0.2289 (-1.07z)| lr 8.56e-05 | 4275.86 ms | 31.6% bf16 MFU | 124899 tok/s step 15625/19560 | loss 3.296707 (-0.13z)| norm 0.2548 (+1.13z)| lr 8.55e-05 | 4175.34 ms | 32.3% bf16 MFU | 124933 tok/s step 15626/19560 | loss 3.300560 (-0.04z)| norm 0.2503 (+0.74z)| lr 8.55e-05 | 4181.52 ms | 32.3% bf16 MFU | 124955 tok/s step 15627/19560 | loss 3.344445 (+0.99z)| norm 0.2476 (+0.50z)| lr 8.55e-05 | 4190.57 ms | 32.2% bf16 MFU | 124963 tok/s step 15628/19560 | loss 3.320202 (+0.41z)| norm 0.2641 (+1.86z)| lr 8.54e-05 | 4256.65 ms | 31.7% bf16 MFU | 124873 tok/s step 15629/19560 | loss 3.257313 (-1.06z)| norm 0.2475 (+0.47z)| lr 8.54e-05 | 4188.04 ms | 32.2% bf16 MFU | 124889 tok/s step 15630/19560 | loss 3.200910 (-2.36z)| norm 0.2540 (+1.01z)| lr 8.53e-05 | 4168.86 ms | 32.4% bf16 MFU | 124933 tok/s step 15631/19560 | loss 3.274546 (-0.65z)| norm 0.2436 (+0.13z)| lr 8.53e-05 | 4178.44 ms | 32.3% bf16 MFU | 124960 tok/s step 15632/19560 | loss 3.262390 (-0.93z)| norm 0.2346 (-0.63z)| lr 8.52e-05 | 4174.29 ms | 32.3% bf16 MFU | 124992 tok/s step 15633/19560 | loss 3.289789 (-0.28z)| norm 0.2530 (+0.91z)| lr 8.52e-05 | 4166.92 ms | 32.4% bf16 MFU | 125033 tok/s step 15634/19560 | loss 3.252207 (-1.15z)| norm 0.2362 (-0.50z)| lr 8.52e-05 | 4174.57 ms | 32.3% bf16 MFU | 125061 tok/s step 15635/19560 | loss 3.305851 (+0.09z)| norm 0.2519 (+0.84z)| lr 8.51e-05 | 4167.96 ms | 32.4% bf16 MFU | 125098 tok/s step 15636/19560 | loss 3.293547 (-0.20z)| norm 0.2411 (-0.08z)| lr 8.51e-05 | 4181.97 ms | 32.3% bf16 MFU | 125111 tok/s step 15637/19560 | loss 3.291147 (-0.27z)| norm 0.2360 (-0.50z)| lr 8.50e-05 | 4176.72 ms | 32.3% bf16 MFU | 125132 tok/s step 15638/19560 | loss 3.235278 (-1.56z)| norm 0.2421 (+0.04z)| lr 8.50e-05 | 4170.81 ms | 32.4% bf16 MFU | 125161 tok/s step 15639/19560 | loss 3.323556 (+0.50z)| norm 0.2435 (+0.14z)| lr 8.50e-05 | 4175.71 ms | 32.3% bf16 MFU | 125180 tok/s step 15640/19560 | loss 3.316541 (+0.33z)| norm 0.2432 (+0.13z)| lr 8.49e-05 | 4183.88 ms | 32.3% bf16 MFU | 125187 tok/s step 15641/19560 | loss 3.246846 (-1.29z)| norm 0.2448 (+0.26z)| lr 8.49e-05 | 4192.80 ms | 32.2% bf16 MFU | 125180 tok/s step 15642/19560 | loss 3.311768 (+0.23z)| norm 0.2510 (+0.79z)| lr 8.48e-05 | 4175.96 ms | 32.3% bf16 MFU | 125198 tok/s step 15643/19560 | loss 3.290410 (-0.28z)| norm 0.2276 (-1.22z)| lr 8.48e-05 | 4178.77 ms | 32.3% bf16 MFU | 125212 tok/s step 15644/19560 | loss 3.260159 (-0.99z)| norm 0.2337 (-0.69z)| lr 8.47e-05 | 4170.07 ms | 32.4% bf16 MFU | 125237 tok/s step 15645/19560 | loss 3.281260 (-0.49z)| norm 0.2452 (+0.29z)| lr 8.47e-05 | 4218.03 ms | 32.0% bf16 MFU | 125190 tok/s step 15646/19560 | loss 3.326205 (+0.58z)| norm 0.2441 (+0.25z)| lr 8.47e-05 | 4254.73 ms | 31.7% bf16 MFU | 125092 tok/s step 15647/19560 | loss 3.380511 (+1.83z)| norm 0.2400 (-0.13z)| lr 8.46e-05 | 4172.32 ms | 32.4% bf16 MFU | 125120 tok/s step 15648/19560 | loss 3.303178 (+0.03z)| norm 0.2444 (+0.29z)| lr 8.46e-05 | 4254.52 ms | 31.7% bf16 MFU | 125026 tok/s step 15649/19560 | loss 3.431086 (+2.89z)| norm 0.2305 (-1.01z)| lr 8.45e-05 | 4196.79 ms | 32.2% bf16 MFU | 125021 tok/s step 15650/19560 | loss 3.321917 (+0.42z)| norm 0.2431 (+0.19z)| lr 8.45e-05 | 4176.81 ms | 32.3% bf16 MFU | 125046 tok/s step 15651/19560 | loss 3.272812 (-0.69z)| norm 0.2445 (+0.31z)| lr 8.45e-05 | 4167.83 ms | 32.4% bf16 MFU | 125083 tok/s step 15652/19560 | loss 3.297078 (-0.14z)| norm 0.2410 (-0.02z)| lr 8.44e-05 | 4172.07 ms | 32.4% bf16 MFU | 125113 tok/s step 15653/19560 | loss 3.345759 (+0.95z)| norm 0.2264 (-1.39z)| lr 8.44e-05 | 4182.22 ms | 32.3% bf16 MFU | 125125 tok/s step 15654/19560 | loss 3.288273 (-0.33z)| norm 0.2384 (-0.24z)| lr 8.43e-05 | 4176.87 ms | 32.3% bf16 MFU | 125145 tok/s step 15655/19560 | loss 3.335989 (+0.73z)| norm 0.2364 (-0.43z)| lr 8.43e-05 | 4181.75 ms | 32.3% bf16 MFU | 125156 tok/s step 15656/19560 | loss 3.237117 (-1.50z)| norm 0.2395 (-0.13z)| lr 8.42e-05 | 4283.31 ms | 31.5% bf16 MFU | 125019 tok/s step 15657/19560 | loss 3.491511 (+3.96z)| norm 0.2461 (+0.49z)| lr 8.42e-05 | 4172.08 ms | 32.4% bf16 MFU | 125051 tok/s step 15658/19560 | loss 3.324964 (+0.41z)| norm 0.2414 (+0.04z)| lr 8.42e-05 | 4181.49 ms | 32.3% bf16 MFU | 125068 tok/s step 15659/19560 | loss 3.425342 (+2.47z)| norm 0.2604 (+1.82z)| lr 8.41e-05 | 4171.47 ms | 32.4% bf16 MFU | 125098 tok/s step 15660/19560 | loss 3.337755 (+0.64z)| norm 0.2338 (-0.69z)| lr 8.41e-05 | 4174.28 ms | 32.3% bf16 MFU | 125123 tok/s step 15661/19560 | loss 3.280821 (-0.55z)| norm 0.2353 (-0.53z)| lr 8.40e-05 | 4178.32 ms | 32.3% bf16 MFU | 125141 tok/s step 15662/19560 | loss 3.273708 (-0.70z)| norm 0.2328 (-0.76z)| lr 8.40e-05 | 4173.54 ms | 32.4% bf16 MFU | 125165 tok/s step 15663/19560 | loss 3.277646 (-0.62z)| norm 0.2581 (+1.68z)| lr 8.40e-05 | 4173.38 ms | 32.4% bf16 MFU | 125188 tok/s step 15664/19560 | loss 3.294825 (-0.27z)| norm 0.2324 (-0.80z)| lr 8.39e-05 | 4176.73 ms | 32.3% bf16 MFU | 125205 tok/s step 15665/19560 | loss 3.292017 (-0.33z)| norm 0.2374 (-0.32z)| lr 8.39e-05 | 4179.29 ms | 32.3% bf16 MFU | 125217 tok/s step 15666/19560 | loss 3.289208 (-0.38z)| norm 0.2491 (+0.83z)| lr 8.38e-05 | 4179.35 ms | 32.3% bf16 MFU | 125229 tok/s step 15667/19560 | loss 3.227900 (-1.66z)| norm 0.2432 (+0.25z)| lr 8.38e-05 | 4179.73 ms | 32.3% bf16 MFU | 125239 tok/s step 15668/19560 | loss 3.241358 (-1.37z)| norm 0.2394 (-0.12z)| lr 8.37e-05 | 4181.80 ms | 32.3% bf16 MFU | 125246 tok/s step 15669/19560 | loss 3.352710 (+0.95z)| norm 0.2416 (+0.09z)| lr 8.37e-05 | 4249.03 ms | 31.8% bf16 MFU | 125153 tok/s step 15670/19560 | loss 3.224331 (-1.69z)| norm 0.2375 (-0.29z)| lr 8.37e-05 | 4170.36 ms | 32.4% bf16 MFU | 125181 tok/s step 15671/19560 | loss 3.307544 (+0.02z)| norm 0.2232 (-1.66z)| lr 8.36e-05 | 4188.31 ms | 32.2% bf16 MFU | 125181 tok/s step 15672/19560 | loss 3.358123 (+1.04z)| norm 0.2380 (-0.23z)| lr 8.36e-05 | 4372.41 ms | 30.9% bf16 MFU | 124918 tok/s step 15673/19560 | loss 3.382747 (+1.52z)| norm 0.2421 (+0.17z)| lr 8.35e-05 | 4169.75 ms | 32.4% bf16 MFU | 124959 tok/s step 15674/19560 | loss 3.295938 (-0.24z)| norm 0.2229 (-1.68z)| lr 8.35e-05 | 4184.26 ms | 32.3% bf16 MFU | 124976 tok/s step 15675/19560 | loss 3.262680 (-0.93z)| norm 0.2290 (-1.08z)| lr 8.35e-05 | 4175.60 ms | 32.3% bf16 MFU | 125005 tok/s step 15676/19560 | loss 3.297442 (-0.21z)| norm 0.2431 (+0.26z)| lr 8.34e-05 | 4171.75 ms | 32.4% bf16 MFU | 125038 tok/s step 15677/19560 | loss 3.237468 (-1.42z)| norm 0.2342 (-0.59z)| lr 8.34e-05 | 4183.30 ms | 32.3% bf16 MFU | 125053 tok/s step 15678/19560 | loss 3.320270 (+0.26z)| norm 0.2527 (+1.34z)| lr 8.33e-05 | 4184.98 ms | 32.3% bf16 MFU | 125064 tok/s step 15679/19560 | loss 3.287293 (-0.41z)| norm 0.2363 (-0.39z)| lr 8.33e-05 | 4187.43 ms | 32.2% bf16 MFU | 125071 tok/s step 15680/19560 | loss 3.280418 (-0.55z)| norm 0.2604 (+2.21z)| lr 8.33e-05 | 4173.51 ms | 32.4% bf16 MFU | 125099 tok/s step 15681/19560 | loss 3.281591 (-0.51z)| norm 0.2357 (-0.44z)| lr 8.32e-05 | 4209.00 ms | 32.1% bf16 MFU | 125072 tok/s step 15682/19560 | loss 3.322759 (+0.33z)| norm 0.2484 (+0.92z)| lr 8.32e-05 | 4177.74 ms | 32.3% bf16 MFU | 125093 tok/s step 15683/19560 | loss 3.348725 (+0.86z)| norm 0.2351 (-0.50z)| lr 8.31e-05 | 4174.86 ms | 32.3% bf16 MFU | 125118 tok/s step 15684/19560 | loss 3.235309 (-1.43z)| norm 0.2600 (+2.13z)| lr 8.31e-05 | 4168.81 ms | 32.4% bf16 MFU | 125150 tok/s step 15685/19560 | loss 3.299025 (-0.15z)| norm 0.2484 (+0.89z)| lr 8.30e-05 | 4178.09 ms | 32.3% bf16 MFU | 125167 tok/s step 15686/19560 | loss 3.285841 (-0.43z)| norm 0.2372 (-0.29z)| lr 8.30e-05 | 4174.76 ms | 32.3% bf16 MFU | 125188 tok/s step 15687/19560 | loss 3.338669 (+0.65z)| norm 0.2401 (+0.02z)| lr 8.30e-05 | 4194.86 ms | 32.2% bf16 MFU | 125177 tok/s step 15688/19560 | loss 3.294050 (-0.26z)| norm 0.2432 (+0.35z)| lr 8.29e-05 | 4183.63 ms | 32.3% bf16 MFU | 125184 tok/s step 15689/19560 | loss 3.332943 (+0.54z)| norm 0.2314 (-0.94z)| lr 8.29e-05 | 4165.04 ms | 32.4% bf16 MFU | 125219 tok/s step 15690/19560 | loss 3.338842 (+0.65z)| norm 0.2479 (+0.86z)| lr 8.28e-05 | 4180.31 ms | 32.3% bf16 MFU | 125229 tok/s step 15691/19560 | loss 3.265965 (-0.84z)| norm 0.2367 (-0.36z)| lr 8.28e-05 | 4174.92 ms | 32.3% bf16 MFU | 125247 tok/s step 15692/19560 | loss 3.320413 (+0.29z)| norm 0.2437 (+0.41z)| lr 8.28e-05 | 4168.59 ms | 32.4% bf16 MFU | 125273 tok/s step 15693/19560 | loss 3.285302 (-0.43z)| norm 0.2360 (-0.42z)| lr 8.27e-05 | 4173.70 ms | 32.3% bf16 MFU | 125290 tok/s step 15694/19560 | loss 3.301191 (-0.08z)| norm 0.2603 (+2.17z)| lr 8.27e-05 | 4168.70 ms | 32.4% bf16 MFU | 125314 tok/s step 15695/19560 | loss 3.264177 (-0.86z)| norm 0.2397 (-0.04z)| lr 8.26e-05 | 4180.15 ms | 32.3% bf16 MFU | 125319 tok/s step 15696/19560 | loss 3.336476 (+0.67z)| norm 0.2454 (+0.56z)| lr 8.26e-05 | 4177.02 ms | 32.3% bf16 MFU | 125329 tok/s step 15697/19560 | loss 3.228235 (-1.59z)| norm 0.2404 (+0.04z)| lr 8.25e-05 | 4169.98 ms | 32.4% bf16 MFU | 125349 tok/s step 15698/19560 | loss 3.312396 (+0.18z)| norm 0.2478 (+0.82z)| lr 8.25e-05 | 4181.28 ms | 32.3% bf16 MFU | 125351 tok/s step 15699/19560 | loss 3.255915 (-1.00z)| norm 0.2485 (+0.89z)| lr 8.25e-05 | 4174.51 ms | 32.3% bf16 MFU | 125363 tok/s step 15700/19560 | loss 3.333253 (+0.62z)| norm 0.2489 (+0.93z)| lr 8.24e-05 | 4168.39 ms | 32.4% bf16 MFU | 125384 tok/s step 15701/19560 | loss 3.347349 (+0.90z)| norm 0.2552 (+1.58z)| lr 8.24e-05 | 4180.00 ms | 32.3% bf16 MFU | 125386 tok/s step 15702/19560 | loss 3.295036 (-0.20z)| norm 0.2490 (+0.89z)| lr 8.23e-05 | 4174.72 ms | 32.3% bf16 MFU | 125396 tok/s step 15703/19560 | loss 3.274648 (-0.62z)| norm 0.2506 (+1.05z)| lr 8.23e-05 | 4167.74 ms | 32.4% bf16 MFU | 125416 tok/s step 15704/19560 | loss 3.308298 (+0.09z)| norm 0.2474 (+0.71z)| lr 8.23e-05 | 4179.42 ms | 32.3% bf16 MFU | 125418 tok/s step 15705/19560 | loss 3.298896 (-0.09z)| norm 0.2568 (+1.70z)| lr 8.22e-05 | 4175.20 ms | 32.3% bf16 MFU | 125425 tok/s step 15706/19560 | loss 3.228768 (-1.56z)| norm 0.2414 (+0.06z)| lr 8.22e-05 | 4174.66 ms | 32.3% bf16 MFU | 125434 tok/s step 15707/19560 | loss 3.311256 (+0.18z)| norm 0.2235 (-1.82z)| lr 8.21e-05 | 4204.65 ms | 32.1% bf16 MFU | 125396 tok/s step 15708/19560 | loss 3.282115 (-0.44z)| norm 0.2553 (+1.50z)| lr 8.21e-05 | 4185.89 ms | 32.3% bf16 MFU | 125389 tok/s step 15709/19560 | loss 3.215910 (-1.80z)| norm 0.2393 (-0.17z)| lr 8.21e-05 | 4177.76 ms | 32.3% bf16 MFU | 125395 tok/s step 15710/19560 | loss 3.253692 (-0.99z)| norm 0.2235 (-1.83z)| lr 8.20e-05 | 4186.32 ms | 32.3% bf16 MFU | 125387 tok/s step 15711/19560 | loss 3.252007 (-1.02z)| norm 0.2354 (-0.58z)| lr 8.20e-05 | 4177.34 ms | 32.3% bf16 MFU | 125393 tok/s step 15712/19560 | loss 3.253180 (-0.98z)| norm 0.2368 (-0.43z)| lr 8.19e-05 | 4218.81 ms | 32.0% bf16 MFU | 125337 tok/s step 15713/19560 | loss 3.278216 (-0.45z)| norm 0.2261 (-1.53z)| lr 8.19e-05 | 4179.29 ms | 32.3% bf16 MFU | 125342 tok/s step 15714/19560 | loss 3.285708 (-0.29z)| norm 0.2312 (-1.00z)| lr 8.18e-05 | 4180.25 ms | 32.3% bf16 MFU | 125346 tok/s step 15715/19560 | loss 3.282711 (-0.34z)| norm 0.2355 (-0.54z)| lr 8.18e-05 | 4184.43 ms | 32.3% bf16 MFU | 125344 tok/s step 15716/19560 | loss 3.261978 (-0.77z)| norm 0.2367 (-0.42z)| lr 8.18e-05 | 4169.99 ms | 32.4% bf16 MFU | 125363 tok/s step 15717/19560 | loss 3.294249 (-0.10z)| norm 0.2401 (-0.07z)| lr 8.17e-05 | 4174.49 ms | 32.3% bf16 MFU | 125375 tok/s step 15718/19560 | loss 3.401841 (+2.09z)| norm 0.2355 (-0.56z)| lr 8.17e-05 | 4180.59 ms | 32.3% bf16 MFU | 125376 tok/s step 15719/19560 | loss 3.341347 (+0.84z)| norm 0.2206 (-2.07z)| lr 8.16e-05 | 4178.22 ms | 32.3% bf16 MFU | 125382 tok/s step 15720/19560 | loss 3.308192 (+0.17z)| norm 0.2406 (-0.00z)| lr 8.16e-05 | 4161.99 ms | 32.4% bf16 MFU | 125411 tok/s step 15721/19560 | loss 3.291049 (-0.20z)| norm 0.2303 (-1.07z)| lr 8.16e-05 | 4178.30 ms | 32.3% bf16 MFU | 125414 tok/s step 15722/19560 | loss 3.211822 (-1.79z)| norm 0.2678 (+2.71z)| lr 8.15e-05 | 4168.11 ms | 32.4% bf16 MFU | 125433 tok/s step 15723/19560 | loss 3.295166 (-0.10z)| norm 0.2404 (-0.06z)| lr 8.15e-05 | 4176.81 ms | 32.3% bf16 MFU | 125437 tok/s step 15724/19560 | loss 3.326930 (+0.55z)| norm 0.2311 (-0.99z)| lr 8.14e-05 | 4175.90 ms | 32.3% bf16 MFU | 125443 tok/s step 15725/19560 | loss 3.311896 (+0.23z)| norm 0.2380 (-0.29z)| lr 8.14e-05 | 4174.36 ms | 32.3% bf16 MFU | 125451 tok/s step 15726/19560 | loss 3.288031 (-0.24z)| norm 0.2361 (-0.48z)| lr 8.14e-05 | 4180.15 ms | 32.3% bf16 MFU | 125449 tok/s step 15727/19560 | loss 3.284073 (-0.33z)| norm 0.2324 (-0.86z)| lr 8.13e-05 | 4177.03 ms | 32.3% bf16 MFU | 125453 tok/s step 15728/19560 | loss 3.258248 (-0.89z)| norm 0.2380 (-0.28z)| lr 8.13e-05 | 4179.70 ms | 32.3% bf16 MFU | 125452 tok/s step 15729/19560 | loss 3.321508 (+0.54z)| norm 0.2447 (+0.40z)| lr 8.12e-05 | 4166.12 ms | 32.4% bf16 MFU | 125472 tok/s step 15730/19560 | loss 3.241640 (-1.24z)| norm 0.2282 (-1.34z)| lr 8.12e-05 | 4177.92 ms | 32.3% bf16 MFU | 125473 tok/s step 15731/19560 | loss 3.267086 (-0.67z)| norm 0.2410 (+0.01z)| lr 8.12e-05 | 4180.97 ms | 32.3% bf16 MFU | 125469 tok/s step 15732/19560 | loss 3.305411 (+0.19z)| norm 0.2394 (-0.17z)| lr 8.11e-05 | 4167.98 ms | 32.4% bf16 MFU | 125485 tok/s step 15733/19560 | loss 3.277039 (-0.44z)| norm 0.2248 (-1.70z)| lr 8.11e-05 | 4185.75 ms | 32.3% bf16 MFU | 125473 tok/s step 15734/19560 | loss 3.323203 (+0.60z)| norm 0.2521 (+1.18z)| lr 8.10e-05 | 4249.54 ms | 31.8% bf16 MFU | 125369 tok/s step 15735/19560 | loss 3.281341 (-0.33z)| norm 0.2326 (-0.87z)| lr 8.10e-05 | 4187.37 ms | 32.2% bf16 MFU | 125360 tok/s step 15736/19560 | loss 3.266556 (-0.66z)| norm 0.2328 (-0.85z)| lr 8.09e-05 | 4175.01 ms | 32.3% bf16 MFU | 125371 tok/s step 15737/19560 | loss 3.335367 (+0.88z)| norm 0.2278 (-1.37z)| lr 8.09e-05 | 4168.22 ms | 32.4% bf16 MFU | 125392 tok/s step 15738/19560 | loss 3.300905 (+0.10z)| norm 0.2218 (-1.96z)| lr 8.09e-05 | 4170.85 ms | 32.4% bf16 MFU | 125407 tok/s step 15739/19560 | loss 3.320262 (+0.53z)| norm 0.2320 (-0.89z)| lr 8.08e-05 | 4170.32 ms | 32.4% bf16 MFU | 125423 tok/s step 15740/19560 | loss 3.319929 (+0.51z)| norm 0.2244 (-1.65z)| lr 8.08e-05 | 4251.27 ms | 31.8% bf16 MFU | 125318 tok/s step 15741/19560 | loss 3.347055 (+1.10z)| norm 0.2140 (-2.63z)| lr 8.07e-05 | 4170.34 ms | 32.4% bf16 MFU | 125338 tok/s step 15742/19560 | loss 3.300266 (+0.06z)| norm 0.2219 (-1.85z)| lr 8.07e-05 | 4376.38 ms | 30.9% bf16 MFU | 125061 tok/s step 15743/19560 | loss 3.304329 (+0.16z)| norm 0.2228 (-1.74z)| lr 8.07e-05 | 4753.55 ms | 28.4% bf16 MFU | 124323 tok/s step 15744/19560 | loss 3.332195 (+0.80z)| norm 0.2307 (-0.93z)| lr 8.06e-05 | 4943.58 ms | 27.3% bf16 MFU | 123409 tok/s step 15745/19560 | loss 3.270973 (-0.60z)| norm 0.2296 (-1.03z)| lr 8.06e-05 | 4552.76 ms | 29.7% bf16 MFU | 122997 tok/s step 15746/19560 | loss 3.242647 (-1.23z)| norm 0.2377 (-0.22z)| lr 8.05e-05 | 4437.44 ms | 30.4% bf16 MFU | 122755 tok/s step 15747/19560 | loss 3.301165 (+0.11z)| norm 0.2304 (-0.94z)| lr 8.05e-05 | 4501.23 ms | 30.0% bf16 MFU | 122441 tok/s step 15748/19560 | loss 3.279421 (-0.39z)| norm 0.2297 (-0.99z)| lr 8.05e-05 | 4366.58 ms | 30.9% bf16 MFU | 122322 tok/s step 15749/19560 | loss 3.304785 (+0.18z)| norm 0.2244 (-1.51z)| lr 8.04e-05 | 4287.45 ms | 31.5% bf16 MFU | 122320 tok/s step 15750/19560 | loss 3.331999 (+0.81z)| norm 0.2120 (-2.64z)| lr 8.04e-05 | 4158.00 ms | 32.5% bf16 MFU | 122509 tok/s val loss 3.274197 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3026/10042 = 0.301334 step 15751/19560 | loss 3.280101 (-0.39z)| norm 0.2139 (-2.38z)| lr 8.03e-05 | 4299.27 ms | 31.4% bf16 MFU | 122481 tok/s step 15752/19560 | loss 3.293622 (-0.08z)| norm 0.2200 (-1.79z)| lr 8.03e-05 | 4373.85 ms | 30.9% bf16 MFU | 122350 tok/s step 15753/19560 | loss 3.333541 (+0.83z)| norm 0.2134 (-2.34z)| lr 8.03e-05 | 4223.31 ms | 32.0% bf16 MFU | 122440 tok/s step 15754/19560 | loss 3.352566 (+1.25z)| norm 0.2117 (-2.42z)| lr 8.02e-05 | 4312.34 ms | 31.3% bf16 MFU | 122397 tok/s step 15755/19560 | loss 3.283821 (-0.31z)| norm 0.2238 (-1.31z)| lr 8.02e-05 | 4202.03 ms | 32.1% bf16 MFU | 122515 tok/s step 15756/19560 | loss 3.270758 (-0.60z)| norm 0.2242 (-1.26z)| lr 8.01e-05 | 4364.96 ms | 30.9% bf16 MFU | 122395 tok/s step 15757/19560 | loss 3.277982 (-0.44z)| norm 0.2320 (-0.54z)| lr 8.01e-05 | 4229.29 ms | 31.9% bf16 MFU | 122474 tok/s step 15758/19560 | loss 3.323318 (+0.59z)| norm 0.2279 (-0.90z)| lr 8.01e-05 | 4317.08 ms | 31.3% bf16 MFU | 122422 tok/s step 15759/19560 | loss 3.298657 (+0.01z)| norm 0.2237 (-1.27z)| lr 8.00e-05 | 4167.73 ms | 32.4% bf16 MFU | 122591 tok/s step 15760/19560 | loss 3.294183 (-0.10z)| norm 0.2223 (-1.37z)| lr 8.00e-05 | 4230.03 ms | 31.9% bf16 MFU | 122659 tok/s step 15761/19560 | loss 3.272206 (-0.61z)| norm 0.2246 (-1.15z)| lr 7.99e-05 | 4179.03 ms | 32.3% bf16 MFU | 122799 tok/s step 15762/19560 | loss 3.282837 (-0.37z)| norm 0.2274 (-0.89z)| lr 7.99e-05 | 4164.50 ms | 32.4% bf16 MFU | 122953 tok/s step 15763/19560 | loss 3.319377 (+0.49z)| norm 0.2263 (-0.97z)| lr 7.98e-05 | 4320.00 ms | 31.3% bf16 MFU | 122874 tok/s step 15764/19560 | loss 3.273817 (-0.58z)| norm 0.2407 (+0.35z)| lr 7.98e-05 | 4235.99 ms | 31.9% bf16 MFU | 122919 tok/s step 15765/19560 | loss 3.304956 (+0.15z)| norm 0.2254 (-1.04z)| lr 7.98e-05 | 4176.18 ms | 32.3% bf16 MFU | 123050 tok/s step 15766/19560 | loss 3.325428 (+0.62z)| norm 0.2491 (+1.11z)| lr 7.97e-05 | 4181.02 ms | 32.3% bf16 MFU | 123167 tok/s step 15767/19560 | loss 3.290618 (-0.20z)| norm 0.2406 (+0.34z)| lr 7.97e-05 | 4188.81 ms | 32.2% bf16 MFU | 123267 tok/s step 15768/19560 | loss 3.298637 (-0.01z)| norm 0.2336 (-0.29z)| lr 7.96e-05 | 4163.78 ms | 32.4% bf16 MFU | 123400 tok/s step 15769/19560 | loss 3.466305 (+3.73z)| norm 0.2762 (+3.40z)| lr 7.96e-05 | 4211.68 ms | 32.1% bf16 MFU | 123454 tok/s step 15770/19560 | loss 3.311209 (+0.24z)| norm 0.2356 (-0.12z)| lr 7.96e-05 | 4173.06 ms | 32.4% bf16 MFU | 123563 tok/s step 15771/19560 | loss 3.220864 (-1.76z)| norm 0.2437 (+0.58z)| lr 7.95e-05 | 4171.83 ms | 32.4% bf16 MFU | 123668 tok/s step 15772/19560 | loss 3.292653 (-0.17z)| norm 0.2405 (+0.30z)| lr 7.95e-05 | 4178.09 ms | 32.3% bf16 MFU | 123759 tok/s step 15773/19560 | loss 3.376893 (+1.67z)| norm 0.2469 (+0.86z)| lr 7.94e-05 | 4259.70 ms | 31.7% bf16 MFU | 123725 tok/s step 15774/19560 | loss 3.308055 (+0.16z)| norm 0.2364 (-0.06z)| lr 7.94e-05 | 4183.68 ms | 32.3% bf16 MFU | 123805 tok/s step 15775/19560 | loss 3.332364 (+0.71z)| norm 0.2396 (+0.23z)| lr 7.94e-05 | 4173.65 ms | 32.3% bf16 MFU | 123896 tok/s step 15776/19560 | loss 3.315863 (+0.34z)| norm 0.2343 (-0.23z)| lr 7.93e-05 | 4304.41 ms | 31.4% bf16 MFU | 123791 tok/s step 15777/19560 | loss 3.364201 (+1.47z)| norm 0.2531 (+1.39z)| lr 7.93e-05 | 4233.89 ms | 31.9% bf16 MFU | 123793 tok/s step 15778/19560 | loss 3.316555 (+0.38z)| norm 0.2582 (+1.81z)| lr 7.92e-05 | 4208.97 ms | 32.1% bf16 MFU | 123832 tok/s step 15779/19560 | loss 3.441854 (+3.10z)| norm 0.2459 (+0.74z)| lr 7.92e-05 | 4177.49 ms | 32.3% bf16 MFU | 123915 tok/s step 15780/19560 | loss 3.283089 (-0.40z)| norm 0.2414 (+0.36z)| lr 7.92e-05 | 4226.23 ms | 31.9% bf16 MFU | 123922 tok/s step 15781/19560 | loss 3.324910 (+0.52z)| norm 0.2504 (+1.11z)| lr 7.91e-05 | 4174.82 ms | 32.3% bf16 MFU | 124005 tok/s step 15782/19560 | loss 3.288903 (-0.27z)| norm 0.2421 (+0.40z)| lr 7.91e-05 | 4170.26 ms | 32.4% bf16 MFU | 124091 tok/s step 15783/19560 | loss 3.288996 (-0.26z)| norm 0.2435 (+0.51z)| lr 7.90e-05 | 4189.35 ms | 32.2% bf16 MFU | 124144 tok/s step 15784/19560 | loss 3.280964 (-0.45z)| norm 0.2306 (-0.59z)| lr 7.90e-05 | 4200.19 ms | 32.1% bf16 MFU | 124178 tok/s step 15785/19560 | loss 3.246268 (-1.27z)| norm 0.2436 (+0.53z)| lr 7.90e-05 | 4176.10 ms | 32.3% bf16 MFU | 124246 tok/s step 15786/19560 | loss 3.271901 (-0.65z)| norm 0.2478 (+0.88z)| lr 7.89e-05 | 4171.00 ms | 32.4% bf16 MFU | 124319 tok/s step 15787/19560 | loss 3.231706 (-1.61z)| norm 0.2463 (+0.77z)| lr 7.89e-05 | 4160.43 ms | 32.5% bf16 MFU | 124404 tok/s step 15788/19560 | loss 3.267045 (-0.73z)| norm 0.2494 (+1.03z)| lr 7.88e-05 | 4234.82 ms | 31.9% bf16 MFU | 124374 tok/s step 15789/19560 | loss 3.347627 (+1.23z)| norm 0.2443 (+0.58z)| lr 7.88e-05 | 4173.07 ms | 32.4% bf16 MFU | 124437 tok/s step 15790/19560 | loss 3.353501 (+1.35z)| norm 0.2328 (-0.41z)| lr 7.88e-05 | 4172.48 ms | 32.4% bf16 MFU | 124498 tok/s step 15791/19560 | loss 3.351500 (+1.28z)| norm 0.2362 (-0.10z)| lr 7.87e-05 | 4167.36 ms | 32.4% bf16 MFU | 124563 tok/s step 15792/19560 | loss 3.234845 (-1.52z)| norm 0.2317 (-0.50z)| lr 7.87e-05 | 4166.44 ms | 32.4% bf16 MFU | 124627 tok/s step 15793/19560 | loss 3.292965 (-0.12z)| norm 0.2257 (-1.01z)| lr 7.86e-05 | 4186.74 ms | 32.2% bf16 MFU | 124657 tok/s step 15794/19560 | loss 3.272716 (-0.60z)| norm 0.2238 (-1.16z)| lr 7.86e-05 | 4161.04 ms | 32.4% bf16 MFU | 124724 tok/s step 15795/19560 | loss 3.315057 (+0.40z)| norm 0.2575 (+1.74z)| lr 7.86e-05 | 4180.10 ms | 32.3% bf16 MFU | 124759 tok/s step 15796/19560 | loss 3.248956 (-1.21z)| norm 0.2436 (+0.55z)| lr 7.85e-05 | 4173.89 ms | 32.3% bf16 MFU | 124802 tok/s step 15797/19560 | loss 3.296039 (-0.05z)| norm 0.2478 (+0.90z)| lr 7.85e-05 | 4179.89 ms | 32.3% bf16 MFU | 124833 tok/s step 15798/19560 | loss 3.254609 (-1.08z)| norm 0.2352 (-0.17z)| lr 7.84e-05 | 4175.21 ms | 32.3% bf16 MFU | 124870 tok/s step 15799/19560 | loss 3.252699 (-1.11z)| norm 0.2403 (+0.25z)| lr 7.84e-05 | 4181.34 ms | 32.3% bf16 MFU | 124896 tok/s step 15800/19560 | loss 3.293319 (-0.10z)| norm 0.2545 (+1.45z)| lr 7.84e-05 | 4180.79 ms | 32.3% bf16 MFU | 124921 tok/s step 15801/19560 | loss 3.275565 (-0.53z)| norm 0.2332 (-0.36z)| lr 7.83e-05 | 4175.84 ms | 32.3% bf16 MFU | 124953 tok/s step 15802/19560 | loss 3.229674 (-1.65z)| norm 0.2557 (+1.53z)| lr 7.83e-05 | 4204.07 ms | 32.1% bf16 MFU | 124941 tok/s step 15803/19560 | loss 3.308977 (+0.31z)| norm 0.2467 (+0.75z)| lr 7.82e-05 | 4175.07 ms | 32.3% bf16 MFU | 124972 tok/s step 15804/19560 | loss 3.292552 (-0.10z)| norm 0.2404 (+0.22z)| lr 7.82e-05 | 4174.42 ms | 32.3% bf16 MFU | 125004 tok/s step 15805/19560 | loss 3.296542 (-0.01z)| norm 0.2510 (+1.10z)| lr 7.82e-05 | 4158.50 ms | 32.5% bf16 MFU | 125057 tok/s step 15806/19560 | loss 3.289590 (-0.18z)| norm 0.2350 (-0.24z)| lr 7.81e-05 | 4181.78 ms | 32.3% bf16 MFU | 125073 tok/s step 15807/19560 | loss 3.240228 (-1.40z)| norm 0.2532 (+1.29z)| lr 7.81e-05 | 4165.78 ms | 32.4% bf16 MFU | 125112 tok/s step 15808/19560 | loss 3.276471 (-0.50z)| norm 0.2367 (-0.09z)| lr 7.80e-05 | 4171.08 ms | 32.4% bf16 MFU | 125141 tok/s step 15809/19560 | loss 3.269695 (-0.66z)| norm 0.2489 (+0.95z)| lr 7.80e-05 | 4163.56 ms | 32.4% bf16 MFU | 125180 tok/s step 15810/19560 | loss 3.294155 (-0.05z)| norm 0.2435 (+0.49z)| lr 7.80e-05 | 4166.95 ms | 32.4% bf16 MFU | 125212 tok/s step 15811/19560 | loss 3.314947 (+0.48z)| norm 0.2495 (+0.98z)| lr 7.79e-05 | 4172.00 ms | 32.4% bf16 MFU | 125235 tok/s step 15812/19560 | loss 3.321292 (+0.63z)| norm 0.2469 (+0.78z)| lr 7.79e-05 | 4223.11 ms | 32.0% bf16 MFU | 125181 tok/s step 15813/19560 | loss 3.355357 (+1.47z)| norm 0.2563 (+1.58z)| lr 7.78e-05 | 4171.94 ms | 32.4% bf16 MFU | 125205 tok/s step 15814/19560 | loss 3.244221 (-1.31z)| norm 0.2491 (+0.95z)| lr 7.78e-05 | 4169.28 ms | 32.4% bf16 MFU | 125233 tok/s step 15815/19560 | loss 3.306017 (+0.24z)| norm 0.2537 (+1.33z)| lr 7.78e-05 | 4180.55 ms | 32.3% bf16 MFU | 125242 tok/s step 15816/19560 | loss 3.316373 (+0.50z)| norm 0.2403 (+0.18z)| lr 7.77e-05 | 4167.56 ms | 32.4% bf16 MFU | 125270 tok/s step 15817/19560 | loss 3.333731 (+0.93z)| norm 0.2378 (-0.03z)| lr 7.77e-05 | 4183.24 ms | 32.3% bf16 MFU | 125273 tok/s step 15818/19560 | loss 3.302032 (+0.15z)| norm 0.2418 (+0.32z)| lr 7.76e-05 | 4172.86 ms | 32.4% bf16 MFU | 125291 tok/s step 15819/19560 | loss 3.287739 (-0.22z)| norm 0.2265 (-0.98z)| lr 7.76e-05 | 4175.35 ms | 32.3% bf16 MFU | 125305 tok/s step 15820/19560 | loss 3.313897 (+0.44z)| norm 0.2355 (-0.21z)| lr 7.76e-05 | 4175.66 ms | 32.3% bf16 MFU | 125318 tok/s step 15821/19560 | loss 3.269244 (-0.68z)| norm 0.2324 (-0.47z)| lr 7.75e-05 | 4175.74 ms | 32.3% bf16 MFU | 125329 tok/s step 15822/19560 | loss 3.371019 (+1.85z)| norm 0.2554 (+1.50z)| lr 7.75e-05 | 4163.02 ms | 32.4% bf16 MFU | 125360 tok/s step 15823/19560 | loss 3.279055 (-0.44z)| norm 0.2292 (-0.73z)| lr 7.74e-05 | 4175.13 ms | 32.3% bf16 MFU | 125371 tok/s step 15824/19560 | loss 3.308440 (+0.30z)| norm 0.2415 (+0.32z)| lr 7.74e-05 | 4167.30 ms | 32.4% bf16 MFU | 125393 tok/s step 15825/19560 | loss 3.322393 (+0.63z)| norm 0.2369 (-0.07z)| lr 7.74e-05 | 4167.26 ms | 32.4% bf16 MFU | 125414 tok/s step 15826/19560 | loss 3.297801 (+0.01z)| norm 0.2331 (-0.39z)| lr 7.73e-05 | 4178.27 ms | 32.3% bf16 MFU | 125417 tok/s step 15827/19560 | loss 3.395116 (+2.40z)| norm 0.2495 (+1.01z)| lr 7.73e-05 | 4174.30 ms | 32.3% bf16 MFU | 125426 tok/s step 15828/19560 | loss 3.297892 (-0.00z)| norm 0.2396 (+0.18z)| lr 7.72e-05 | 4169.55 ms | 32.4% bf16 MFU | 125442 tok/s step 15829/19560 | loss 3.261275 (-0.90z)| norm 0.2482 (+0.93z)| lr 7.72e-05 | 4185.77 ms | 32.3% bf16 MFU | 125432 tok/s step 15830/19560 | loss 3.279271 (-0.45z)| norm 0.2320 (-0.46z)| lr 7.72e-05 | 4184.62 ms | 32.3% bf16 MFU | 125425 tok/s step 15831/19560 | loss 3.305387 (+0.20z)| norm 0.2413 (+0.35z)| lr 7.71e-05 | 4210.04 ms | 32.1% bf16 MFU | 125381 tok/s step 15832/19560 | loss 3.286383 (-0.27z)| norm 0.2399 (+0.23z)| lr 7.71e-05 | 4171.64 ms | 32.4% bf16 MFU | 125396 tok/s step 15833/19560 | loss 3.278748 (-0.46z)| norm 0.2563 (+1.66z)| lr 7.70e-05 | 4179.58 ms | 32.3% bf16 MFU | 125398 tok/s step 15834/19560 | loss 3.313247 (+0.39z)| norm 0.2418 (+0.39z)| lr 7.70e-05 | 4176.94 ms | 32.3% bf16 MFU | 125404 tok/s step 15835/19560 | loss 3.258640 (-0.97z)| norm 0.2470 (+0.84z)| lr 7.70e-05 | 4239.09 ms | 31.9% bf16 MFU | 125318 tok/s step 15836/19560 | loss 3.287678 (-0.25z)| norm 0.2400 (+0.24z)| lr 7.69e-05 | 4176.91 ms | 32.3% bf16 MFU | 125328 tok/s step 15837/19560 | loss 3.285537 (-0.32z)| norm 0.2530 (+1.37z)| lr 7.69e-05 | 4180.30 ms | 32.3% bf16 MFU | 125332 tok/s step 15838/19560 | loss 3.296725 (-0.04z)| norm 0.2354 (-0.18z)| lr 7.68e-05 | 4182.52 ms | 32.3% bf16 MFU | 125333 tok/s step 15839/19560 | loss 3.323947 (+0.65z)| norm 0.2405 (+0.26z)| lr 7.68e-05 | 4175.46 ms | 32.3% bf16 MFU | 125345 tok/s step 15840/19560 | loss 3.315469 (+0.42z)| norm 0.2484 (+0.95z)| lr 7.68e-05 | 4177.51 ms | 32.3% bf16 MFU | 125353 tok/s step 15841/19560 | loss 3.323588 (+0.62z)| norm 0.2611 (+2.02z)| lr 7.67e-05 | 4192.90 ms | 32.2% bf16 MFU | 125337 tok/s step 15842/19560 | loss 3.311263 (+0.29z)| norm 0.2434 (+0.47z)| lr 7.67e-05 | 4184.72 ms | 32.3% bf16 MFU | 125335 tok/s step 15843/19560 | loss 3.387652 (+2.21z)| norm 0.2558 (+1.52z)| lr 7.66e-05 | 4744.24 ms | 28.5% bf16 MFU | 124593 tok/s step 15844/19560 | loss 3.248014 (-1.33z)| norm 0.2498 (+0.99z)| lr 7.66e-05 | 4264.77 ms | 31.7% bf16 MFU | 124511 tok/s step 15845/19560 | loss 3.294974 (-0.14z)| norm 0.2602 (+1.85z)| lr 7.66e-05 | 4293.10 ms | 31.4% bf16 MFU | 124391 tok/s step 15846/19560 | loss 3.297337 (-0.06z)| norm 0.2683 (+2.45z)| lr 7.65e-05 | 4242.67 ms | 31.8% bf16 MFU | 124350 tok/s step 15847/19560 | loss 3.321815 (+0.58z)| norm 0.2615 (+1.86z)| lr 7.65e-05 | 4190.55 ms | 32.2% bf16 MFU | 124388 tok/s step 15848/19560 | loss 3.311697 (+0.31z)| norm 0.2334 (-0.46z)| lr 7.64e-05 | 4170.56 ms | 32.4% bf16 MFU | 124455 tok/s step 15849/19560 | loss 3.295190 (-0.12z)| norm 0.2703 (+2.50z)| lr 7.64e-05 | 4174.98 ms | 32.3% bf16 MFU | 124511 tok/s step 15850/19560 | loss 3.329389 (+0.76z)| norm 0.2682 (+2.33z)| lr 7.64e-05 | 4273.85 ms | 31.6% bf16 MFU | 124419 tok/s step 15851/19560 | loss 3.370554 (+1.82z)| norm 0.2369 (-0.19z)| lr 7.63e-05 | 4200.27 ms | 32.1% bf16 MFU | 124439 tok/s step 15852/19560 | loss 3.316808 (+0.41z)| norm 0.2398 (+0.04z)| lr 7.63e-05 | 4271.86 ms | 31.6% bf16 MFU | 124354 tok/s step 15853/19560 | loss 3.344792 (+1.13z)| norm 0.2575 (+1.44z)| lr 7.62e-05 | 4222.91 ms | 32.0% bf16 MFU | 124344 tok/s step 15854/19560 | loss 3.254422 (-1.21z)| norm 0.2248 (-1.16z)| lr 7.62e-05 | 4168.46 ms | 32.4% bf16 MFU | 124415 tok/s step 15855/19560 | loss 3.266345 (-0.90z)| norm 0.2282 (-0.89z)| lr 7.62e-05 | 4211.22 ms | 32.1% bf16 MFU | 124419 tok/s step 15856/19560 | loss 3.287925 (-0.35z)| norm 0.2401 (+0.06z)| lr 7.61e-05 | 4218.76 ms | 32.0% bf16 MFU | 124412 tok/s step 15857/19560 | loss 3.357617 (+1.45z)| norm 0.2323 (-0.55z)| lr 7.61e-05 | 4210.20 ms | 32.1% bf16 MFU | 124418 tok/s step 15858/19560 | loss 3.289533 (-0.32z)| norm 0.2471 (+0.61z)| lr 7.60e-05 | 4208.17 ms | 32.1% bf16 MFU | 124426 tok/s step 15859/19560 | loss 3.269589 (-0.84z)| norm 0.2307 (-0.68z)| lr 7.60e-05 | 4185.79 ms | 32.3% bf16 MFU | 124468 tok/s step 15860/19560 | loss 3.207242 (-2.39z)| norm 0.2325 (-0.53z)| lr 7.60e-05 | 4211.25 ms | 32.1% bf16 MFU | 124469 tok/s step 15861/19560 | loss 3.257387 (-1.11z)| norm 0.2478 (+0.67z)| lr 7.59e-05 | 4238.63 ms | 31.9% bf16 MFU | 124430 tok/s step 15862/19560 | loss 3.290472 (-0.26z)| norm 0.2372 (-0.17z)| lr 7.59e-05 | 4179.06 ms | 32.3% bf16 MFU | 124482 tok/s step 15863/19560 | loss 3.280821 (-0.51z)| norm 0.2314 (-0.63z)| lr 7.58e-05 | 4165.89 ms | 32.4% bf16 MFU | 124550 tok/s step 15864/19560 | loss 3.272484 (-0.72z)| norm 0.2315 (-0.62z)| lr 7.58e-05 | 4170.96 ms | 32.4% bf16 MFU | 124608 tok/s step 15865/19560 | loss 3.259814 (-1.03z)| norm 0.2348 (-0.36z)| lr 7.58e-05 | 4222.12 ms | 32.0% bf16 MFU | 124586 tok/s step 15866/19560 | loss 3.355542 (+1.39z)| norm 0.2389 (-0.05z)| lr 7.57e-05 | 4175.25 ms | 32.3% bf16 MFU | 124635 tok/s step 15867/19560 | loss 3.236212 (-1.59z)| norm 0.2368 (-0.22z)| lr 7.57e-05 | 4170.99 ms | 32.4% bf16 MFU | 124689 tok/s step 15868/19560 | loss 3.244797 (-1.36z)| norm 0.2368 (-0.23z)| lr 7.56e-05 | 4208.54 ms | 32.1% bf16 MFU | 124683 tok/s step 15869/19560 | loss 3.350173 (+1.26z)| norm 0.2493 (+0.77z)| lr 7.56e-05 | 4182.87 ms | 32.3% bf16 MFU | 124716 tok/s step 15870/19560 | loss 3.285890 (-0.33z)| norm 0.2311 (-0.73z)| lr 7.56e-05 | 4170.40 ms | 32.4% bf16 MFU | 124766 tok/s step 15871/19560 | loss 3.319149 (+0.49z)| norm 0.2344 (-0.48z)| lr 7.55e-05 | 4187.95 ms | 32.2% bf16 MFU | 124787 tok/s step 15872/19560 | loss 3.263889 (-0.87z)| norm 0.2413 (+0.10z)| lr 7.55e-05 | 4181.04 ms | 32.3% bf16 MFU | 124818 tok/s step 15873/19560 | loss 3.280054 (-0.47z)| norm 0.2427 (+0.21z)| lr 7.54e-05 | 4200.38 ms | 32.1% bf16 MFU | 124818 tok/s step 15874/19560 | loss 3.275612 (-0.59z)| norm 0.2308 (-0.79z)| lr 7.54e-05 | 4179.28 ms | 32.3% bf16 MFU | 124849 tok/s step 15875/19560 | loss 3.262075 (-0.92z)| norm 0.2318 (-0.70z)| lr 7.54e-05 | 4177.34 ms | 32.3% bf16 MFU | 124882 tok/s step 15876/19560 | loss 3.338912 (+0.98z)| norm 0.2499 (+0.80z)| lr 7.53e-05 | 4181.10 ms | 32.3% bf16 MFU | 124908 tok/s step 15877/19560 | loss 3.287941 (-0.28z)| norm 0.2274 (-1.09z)| lr 7.53e-05 | 4179.57 ms | 32.3% bf16 MFU | 124934 tok/s step 15878/19560 | loss 3.340795 (+1.03z)| norm 0.2217 (-1.60z)| lr 7.52e-05 | 4183.49 ms | 32.3% bf16 MFU | 124954 tok/s step 15879/19560 | loss 3.312706 (+0.33z)| norm 0.2383 (-0.20z)| lr 7.52e-05 | 4173.65 ms | 32.3% bf16 MFU | 124987 tok/s step 15880/19560 | loss 3.250044 (-1.21z)| norm 0.2332 (-0.66z)| lr 7.52e-05 | 4174.95 ms | 32.3% bf16 MFU | 125017 tok/s step 15881/19560 | loss 3.234916 (-1.55z)| norm 0.2168 (-2.12z)| lr 7.51e-05 | 4173.83 ms | 32.3% bf16 MFU | 125047 tok/s step 15882/19560 | loss 3.344059 (+1.12z)| norm 0.2383 (-0.24z)| lr 7.51e-05 | 4208.56 ms | 32.1% bf16 MFU | 125023 tok/s step 15883/19560 | loss 3.258261 (-0.98z)| norm 0.2448 (+0.33z)| lr 7.50e-05 | 4179.29 ms | 32.3% bf16 MFU | 125044 tok/s step 15884/19560 | loss 3.273048 (-0.61z)| norm 0.2276 (-1.25z)| lr 7.50e-05 | 4178.74 ms | 32.3% bf16 MFU | 125065 tok/s step 15885/19560 | loss 3.318000 (+0.48z)| norm 0.2397 (-0.15z)| lr 7.50e-05 | 4177.01 ms | 32.3% bf16 MFU | 125088 tok/s step 15886/19560 | loss 3.282816 (-0.38z)| norm 0.2435 (+0.20z)| lr 7.49e-05 | 4195.70 ms | 32.2% bf16 MFU | 125082 tok/s step 15887/19560 | loss 3.322279 (+0.58z)| norm 0.2304 (-1.03z)| lr 7.49e-05 | 4174.56 ms | 32.3% bf16 MFU | 125107 tok/s step 15888/19560 | loss 3.278445 (-0.48z)| norm 0.2259 (-1.45z)| lr 7.48e-05 | 4191.87 ms | 32.2% bf16 MFU | 125105 tok/s step 15889/19560 | loss 3.316123 (+0.43z)| norm 0.2331 (-0.79z)| lr 7.48e-05 | 4170.04 ms | 32.4% bf16 MFU | 125136 tok/s step 15890/19560 | loss 3.296900 (-0.04z)| norm 0.2336 (-0.75z)| lr 7.48e-05 | 4176.87 ms | 32.3% bf16 MFU | 125156 tok/s step 15891/19560 | loss 3.312548 (+0.34z)| norm 0.2497 (+0.76z)| lr 7.47e-05 | 4182.35 ms | 32.3% bf16 MFU | 125166 tok/s step 15892/19560 | loss 3.350798 (+1.26z)| norm 0.2235 (-1.70z)| lr 7.47e-05 | 4223.54 ms | 32.0% bf16 MFU | 125114 tok/s step 15893/19560 | loss 3.241041 (-1.39z)| norm 0.2413 (-0.04z)| lr 7.47e-05 | 4263.04 ms | 31.7% bf16 MFU | 125008 tok/s step 15894/19560 | loss 3.249430 (-1.17z)| norm 0.2323 (-0.88z)| lr 7.46e-05 | 4201.26 ms | 32.1% bf16 MFU | 124997 tok/s step 15895/19560 | loss 3.350945 (+1.25z)| norm 0.2346 (-0.66z)| lr 7.46e-05 | 4172.61 ms | 32.4% bf16 MFU | 125030 tok/s step 15896/19560 | loss 3.248991 (-1.17z)| norm 0.2366 (-0.47z)| lr 7.45e-05 | 4164.81 ms | 32.4% bf16 MFU | 125072 tok/s step 15897/19560 | loss 3.296123 (-0.02z)| norm 0.2439 (+0.26z)| lr 7.45e-05 | 4192.53 ms | 32.2% bf16 MFU | 125071 tok/s step 15898/19560 | loss 3.340779 (+1.11z)| norm 0.2358 (-0.55z)| lr 7.45e-05 | 4167.09 ms | 32.4% bf16 MFU | 125109 tok/s step 15899/19560 | loss 3.271420 (-0.67z)| norm 0.2514 (+0.98z)| lr 7.44e-05 | 4173.47 ms | 32.4% bf16 MFU | 125134 tok/s step 15900/19560 | loss 3.247473 (-1.27z)| norm 0.2464 (+0.49z)| lr 7.44e-05 | 4165.97 ms | 32.4% bf16 MFU | 125170 tok/s step 15901/19560 | loss 3.307633 (+0.28z)| norm 0.2600 (+1.80z)| lr 7.43e-05 | 4182.76 ms | 32.3% bf16 MFU | 125179 tok/s step 15902/19560 | loss 3.253907 (-1.09z)| norm 0.2448 (+0.31z)| lr 7.43e-05 | 4187.93 ms | 32.2% bf16 MFU | 125180 tok/s step 15903/19560 | loss 3.269896 (-0.67z)| norm 0.2564 (+1.42z)| lr 7.43e-05 | 4174.86 ms | 32.3% bf16 MFU | 125200 tok/s step 15904/19560 | loss 3.295539 (-0.00z)| norm 0.2519 (+0.96z)| lr 7.42e-05 | 4170.53 ms | 32.4% bf16 MFU | 125225 tok/s step 15905/19560 | loss 3.301642 (+0.17z)| norm 0.2450 (+0.31z)| lr 7.42e-05 | 4166.57 ms | 32.4% bf16 MFU | 125256 tok/s step 15906/19560 | loss 3.346084 (+1.32z)| norm 0.2603 (+1.79z)| lr 7.41e-05 | 4175.01 ms | 32.3% bf16 MFU | 125272 tok/s step 15907/19560 | loss 3.247916 (-1.26z)| norm 0.2422 (+0.04z)| lr 7.41e-05 | 4189.39 ms | 32.2% bf16 MFU | 125265 tok/s step 15908/19560 | loss 3.250775 (-1.17z)| norm 0.2423 (+0.05z)| lr 7.41e-05 | 4167.44 ms | 32.4% bf16 MFU | 125292 tok/s step 15909/19560 | loss 3.289913 (-0.09z)| norm 0.2270 (-1.41z)| lr 7.40e-05 | 4165.94 ms | 32.4% bf16 MFU | 125320 tok/s step 15910/19560 | loss 3.288781 (-0.12z)| norm 0.2336 (-0.77z)| lr 7.40e-05 | 4169.24 ms | 32.4% bf16 MFU | 125342 tok/s step 15911/19560 | loss 3.272601 (-0.56z)| norm 0.2418 (+0.02z)| lr 7.39e-05 | 4184.60 ms | 32.3% bf16 MFU | 125339 tok/s step 15912/19560 | loss 3.273674 (-0.53z)| norm 0.2384 (-0.31z)| lr 7.39e-05 | 4178.90 ms | 32.3% bf16 MFU | 125345 tok/s step 15913/19560 | loss 3.225451 (-1.84z)| norm 0.2348 (-0.65z)| lr 7.39e-05 | 4197.84 ms | 32.2% bf16 MFU | 125323 tok/s step 15914/19560 | loss 3.437844 (+3.70z)| norm 0.2290 (-1.19z)| lr 7.38e-05 | 4254.29 ms | 31.7% bf16 MFU | 125219 tok/s step 15915/19560 | loss 3.267980 (-0.69z)| norm 0.3545 (+7.79z)| lr 7.38e-05 | 4174.50 ms | 32.3% bf16 MFU | 125237 tok/s step 15916/19560 | loss 3.258229 (-0.94z)| norm 0.2399 (-0.16z)| lr 7.37e-05 | 4174.92 ms | 32.3% bf16 MFU | 125254 tok/s step 15917/19560 | loss 3.285991 (-0.21z)| norm 0.2596 (+1.20z)| lr 7.37e-05 | 4165.05 ms | 32.4% bf16 MFU | 125286 tok/s step 15918/19560 | loss 3.260171 (-0.87z)| norm 0.2448 (+0.17z)| lr 7.37e-05 | 4166.95 ms | 32.4% bf16 MFU | 125312 tok/s step 15919/19560 | loss 3.234408 (-1.52z)| norm 0.2342 (-0.56z)| lr 7.36e-05 | 4174.69 ms | 32.3% bf16 MFU | 125326 tok/s step 15920/19560 | loss 3.227600 (-1.70z)| norm 0.2480 (+0.38z)| lr 7.36e-05 | 4178.96 ms | 32.3% bf16 MFU | 125333 tok/s step 15921/19560 | loss 3.264470 (-0.72z)| norm 0.2455 (+0.20z)| lr 7.36e-05 | 4181.61 ms | 32.3% bf16 MFU | 125335 tok/s step 15922/19560 | loss 3.503855 (+4.95z)| norm 0.2721 (+2.00z)| lr 7.35e-05 | 4182.81 ms | 32.3% bf16 MFU | 125336 tok/s step 15923/19560 | loss 3.300078 (+0.15z)| norm 0.2350 (-0.54z)| lr 7.35e-05 | 4178.59 ms | 32.3% bf16 MFU | 125342 tok/s step 15924/19560 | loss 3.269592 (-0.57z)| norm 0.2429 (+0.00z)| lr 7.34e-05 | 4176.19 ms | 32.3% bf16 MFU | 125352 tok/s step 15925/19560 | loss 3.378033 (+1.95z)| norm 0.2372 (-0.38z)| lr 7.34e-05 | 4187.28 ms | 32.2% bf16 MFU | 125345 tok/s step 15926/19560 | loss 3.312557 (+0.41z)| norm 0.2379 (-0.34z)| lr 7.34e-05 | 4169.92 ms | 32.4% bf16 MFU | 125364 tok/s step 15927/19560 | loss 3.287028 (-0.19z)| norm 0.2426 (-0.01z)| lr 7.33e-05 | 4176.45 ms | 32.3% bf16 MFU | 125373 tok/s step 15928/19560 | loss 3.296537 (+0.03z)| norm 0.2367 (-0.41z)| lr 7.33e-05 | 4168.28 ms | 32.4% bf16 MFU | 125393 tok/s step 15929/19560 | loss 3.247815 (-1.10z)| norm 0.2429 (+0.01z)| lr 7.32e-05 | 4172.54 ms | 32.4% bf16 MFU | 125406 tok/s step 15930/19560 | loss 3.274964 (-0.48z)| norm 0.2412 (-0.10z)| lr 7.32e-05 | 4160.71 ms | 32.5% bf16 MFU | 125436 tok/s step 15931/19560 | loss 3.286960 (-0.20z)| norm 0.2338 (-0.61z)| lr 7.32e-05 | 4180.74 ms | 32.3% bf16 MFU | 125435 tok/s step 15932/19560 | loss 3.301637 (+0.15z)| norm 0.2349 (-0.53z)| lr 7.31e-05 | 4176.37 ms | 32.3% bf16 MFU | 125440 tok/s step 15933/19560 | loss 3.256753 (-0.90z)| norm 0.2259 (-1.13z)| lr 7.31e-05 | 4610.14 ms | 29.3% bf16 MFU | 124854 tok/s step 15934/19560 | loss 3.260034 (-0.81z)| norm 0.2245 (-1.22z)| lr 7.30e-05 | 4787.87 ms | 28.2% bf16 MFU | 124087 tok/s step 15935/19560 | loss 3.296872 (+0.04z)| norm 0.2314 (-0.73z)| lr 7.30e-05 | 4297.56 ms | 31.4% bf16 MFU | 123982 tok/s step 15936/19560 | loss 3.273708 (-0.51z)| norm 0.2302 (-0.81z)| lr 7.30e-05 | 4248.77 ms | 31.8% bf16 MFU | 123953 tok/s step 15937/19560 | loss 3.290705 (-0.11z)| norm 0.2240 (-1.22z)| lr 7.29e-05 | 4256.63 ms | 31.7% bf16 MFU | 123914 tok/s step 15938/19560 | loss 3.253227 (-0.98z)| norm 0.2343 (-0.51z)| lr 7.29e-05 | 4301.44 ms | 31.4% bf16 MFU | 123812 tok/s step 15939/19560 | loss 3.309604 (+0.34z)| norm 0.2281 (-0.92z)| lr 7.28e-05 | 4248.03 ms | 31.8% bf16 MFU | 123793 tok/s step 15940/19560 | loss 3.247267 (-1.11z)| norm 0.2344 (-0.48z)| lr 7.28e-05 | 4190.51 ms | 32.2% bf16 MFU | 123859 tok/s step 15941/19560 | loss 3.283607 (-0.24z)| norm 0.2421 (+0.05z)| lr 7.28e-05 | 4261.76 ms | 31.7% bf16 MFU | 123817 tok/s step 15942/19560 | loss 3.330794 (+0.86z)| norm 0.2364 (-0.33z)| lr 7.27e-05 | 4193.78 ms | 32.2% bf16 MFU | 123877 tok/s step 15943/19560 | loss 3.274301 (-0.47z)| norm 0.2544 (+0.90z)| lr 7.27e-05 | 4198.86 ms | 32.2% bf16 MFU | 123926 tok/s step 15944/19560 | loss 3.223309 (-1.65z)| norm 0.2310 (-0.70z)| lr 7.27e-05 | 4215.00 ms | 32.0% bf16 MFU | 123949 tok/s step 15945/19560 | loss 3.337861 (+1.04z)| norm 0.2281 (-0.89z)| lr 7.26e-05 | 4169.63 ms | 32.4% bf16 MFU | 124039 tok/s step 15946/19560 | loss 3.347506 (+1.25z)| norm 0.2446 (+0.24z)| lr 7.26e-05 | 4213.84 ms | 32.0% bf16 MFU | 124058 tok/s step 15947/19560 | loss 3.274525 (-0.45z)| norm 0.2320 (-0.62z)| lr 7.25e-05 | 4164.69 ms | 32.4% bf16 MFU | 124149 tok/s step 15948/19560 | loss 3.309765 (+0.37z)| norm 0.2258 (-1.04z)| lr 7.25e-05 | 4256.72 ms | 31.7% bf16 MFU | 124100 tok/s step 15949/19560 | loss 3.332025 (+0.88z)| norm 0.2463 (+0.35z)| lr 7.25e-05 | 4164.66 ms | 32.4% bf16 MFU | 124190 tok/s step 15950/19560 | loss 3.284369 (-0.22z)| norm 0.2420 (+0.06z)| lr 7.24e-05 | 4194.14 ms | 32.2% bf16 MFU | 124230 tok/s step 15951/19560 | loss 3.261230 (-0.76z)| norm 0.2201 (-1.43z)| lr 7.24e-05 | 4159.71 ms | 32.5% bf16 MFU | 124321 tok/s step 15952/19560 | loss 3.301666 (+0.19z)| norm 0.2394 (-0.11z)| lr 7.23e-05 | 4160.02 ms | 32.5% bf16 MFU | 124406 tok/s step 15953/19560 | loss 3.219036 (-1.71z)| norm 0.2411 (+0.00z)| lr 7.23e-05 | 4164.01 ms | 32.4% bf16 MFU | 124482 tok/s step 15954/19560 | loss 3.343331 (+1.17z)| norm 0.2422 (+0.08z)| lr 7.23e-05 | 4159.10 ms | 32.5% bf16 MFU | 124560 tok/s step 15955/19560 | loss 3.316725 (+0.58z)| norm 0.2433 (+0.16z)| lr 7.22e-05 | 4164.17 ms | 32.4% bf16 MFU | 124628 tok/s step 15956/19560 | loss 3.360480 (+1.58z)| norm 0.2400 (-0.08z)| lr 7.22e-05 | 4185.52 ms | 32.3% bf16 MFU | 124659 tok/s step 15957/19560 | loss 3.285908 (-0.17z)| norm 0.2330 (-0.54z)| lr 7.21e-05 | 4256.19 ms | 31.7% bf16 MFU | 124585 tok/s step 15958/19560 | loss 3.242049 (-1.18z)| norm 0.2543 (+0.90z)| lr 7.21e-05 | 4161.90 ms | 32.4% bf16 MFU | 124655 tok/s step 15959/19560 | loss 3.311777 (+0.44z)| norm 0.2405 (-0.04z)| lr 7.21e-05 | 4164.11 ms | 32.4% bf16 MFU | 124717 tok/s step 15960/19560 | loss 3.301539 (+0.20z)| norm 0.2288 (-0.83z)| lr 7.20e-05 | 4234.34 ms | 31.9% bf16 MFU | 124672 tok/s step 15961/19560 | loss 3.289080 (-0.09z)| norm 0.2277 (-0.89z)| lr 7.20e-05 | 4162.17 ms | 32.4% bf16 MFU | 124737 tok/s step 15962/19560 | loss 3.307687 (+0.34z)| norm 0.2461 (+0.36z)| lr 7.20e-05 | 4163.61 ms | 32.4% bf16 MFU | 124796 tok/s step 15963/19560 | loss 3.322494 (+0.68z)| norm 0.2416 (+0.06z)| lr 7.19e-05 | 4162.63 ms | 32.4% bf16 MFU | 124854 tok/s step 15964/19560 | loss 3.251463 (-0.97z)| norm 0.2305 (-0.69z)| lr 7.19e-05 | 4161.13 ms | 32.4% bf16 MFU | 124911 tok/s step 15965/19560 | loss 3.255482 (-0.87z)| norm 0.2297 (-0.73z)| lr 7.18e-05 | 4161.04 ms | 32.4% bf16 MFU | 124966 tok/s step 15966/19560 | loss 3.318045 (+0.58z)| norm 0.2237 (-1.13z)| lr 7.18e-05 | 4164.60 ms | 32.4% bf16 MFU | 125012 tok/s step 15967/19560 | loss 3.261397 (-0.72z)| norm 0.2378 (-0.18z)| lr 7.18e-05 | 4162.96 ms | 32.4% bf16 MFU | 125058 tok/s step 15968/19560 | loss 3.231050 (-1.40z)| norm 0.2277 (-0.85z)| lr 7.17e-05 | 4161.18 ms | 32.4% bf16 MFU | 125105 tok/s step 15969/19560 | loss 3.307429 (+0.36z)| norm 0.2280 (-0.81z)| lr 7.17e-05 | 4154.65 ms | 32.5% bf16 MFU | 125160 tok/s step 15970/19560 | loss 3.289441 (-0.05z)| norm 0.2390 (-0.07z)| lr 7.16e-05 | 4156.67 ms | 32.5% bf16 MFU | 125208 tok/s step 15971/19560 | loss 3.259738 (-0.73z)| norm 0.2292 (-0.72z)| lr 7.16e-05 | 4154.76 ms | 32.5% bf16 MFU | 125257 tok/s step 15972/19560 | loss 3.252899 (-0.89z)| norm 0.2349 (-0.32z)| lr 7.16e-05 | 4158.12 ms | 32.5% bf16 MFU | 125299 tok/s step 15973/19560 | loss 3.322346 (+0.74z)| norm 0.2208 (-1.27z)| lr 7.15e-05 | 4152.96 ms | 32.5% bf16 MFU | 125346 tok/s step 15974/19560 | loss 3.217150 (-1.70z)| norm 0.2591 (+1.36z)| lr 7.15e-05 | 4159.15 ms | 32.5% bf16 MFU | 125382 tok/s step 15975/19560 | loss 3.305698 (+0.36z)| norm 0.2438 (+0.33z)| lr 7.15e-05 | 4158.82 ms | 32.5% bf16 MFU | 125416 tok/s step 15976/19560 | loss 3.304994 (+0.35z)| norm 0.2360 (-0.22z)| lr 7.14e-05 | 4156.54 ms | 32.5% bf16 MFU | 125452 tok/s step 15977/19560 | loss 3.315863 (+0.59z)| norm 0.2412 (+0.16z)| lr 7.14e-05 | 4158.22 ms | 32.5% bf16 MFU | 125483 tok/s step 15978/19560 | loss 3.314806 (+0.57z)| norm 0.2446 (+0.42z)| lr 7.13e-05 | 4161.03 ms | 32.4% bf16 MFU | 125509 tok/s step 15979/19560 | loss 3.278835 (-0.25z)| norm 0.2415 (+0.20z)| lr 7.13e-05 | 4157.85 ms | 32.5% bf16 MFU | 125539 tok/s step 15980/19560 | loss 3.327019 (+0.88z)| norm 0.2428 (+0.28z)| lr 7.13e-05 | 4157.55 ms | 32.5% bf16 MFU | 125567 tok/s step 15981/19560 | loss 3.290362 (+0.03z)| norm 0.2412 (+0.18z)| lr 7.12e-05 | 4158.25 ms | 32.5% bf16 MFU | 125593 tok/s step 15982/19560 | loss 3.285149 (-0.10z)| norm 0.2917 (+3.61z)| lr 7.12e-05 | 4158.45 ms | 32.5% bf16 MFU | 125617 tok/s step 15983/19560 | loss 3.263984 (-0.60z)| norm 0.2604 (+1.43z)| lr 7.11e-05 | 4157.42 ms | 32.5% bf16 MFU | 125642 tok/s step 15984/19560 | loss 3.322830 (+0.79z)| norm 0.2414 (+0.13z)| lr 7.11e-05 | 4157.10 ms | 32.5% bf16 MFU | 125665 tok/s step 15985/19560 | loss 3.305298 (+0.39z)| norm 0.2613 (+1.47z)| lr 7.11e-05 | 4158.57 ms | 32.5% bf16 MFU | 125686 tok/s step 15986/19560 | loss 3.280208 (-0.21z)| norm 0.2526 (+0.88z)| lr 7.10e-05 | 4164.47 ms | 32.4% bf16 MFU | 125696 tok/s step 15987/19560 | loss 3.261813 (-0.65z)| norm 0.2345 (-0.35z)| lr 7.10e-05 | 4158.88 ms | 32.5% bf16 MFU | 125715 tok/s step 15988/19560 | loss 3.281945 (-0.19z)| norm 0.2322 (-0.51z)| lr 7.10e-05 | 4160.54 ms | 32.5% bf16 MFU | 125730 tok/s step 15989/19560 | loss 3.366014 (+1.81z)| norm 0.2441 (+0.30z)| lr 7.09e-05 | 4161.17 ms | 32.4% bf16 MFU | 125743 tok/s step 15990/19560 | loss 3.267533 (-0.55z)| norm 0.2397 (+0.00z)| lr 7.09e-05 | 4159.34 ms | 32.5% bf16 MFU | 125758 tok/s step 15991/19560 | loss 3.342254 (+1.23z)| norm 0.2432 (+0.23z)| lr 7.08e-05 | 4192.90 ms | 32.2% bf16 MFU | 125723 tok/s step 15992/19560 | loss 3.338447 (+1.12z)| norm 0.2453 (+0.36z)| lr 7.08e-05 | 4154.79 ms | 32.5% bf16 MFU | 125746 tok/s step 15993/19560 | loss 3.329533 (+0.89z)| norm 0.2467 (+0.46z)| lr 7.08e-05 | 4164.58 ms | 32.4% bf16 MFU | 125753 tok/s step 15994/19560 | loss 3.270581 (-0.49z)| norm 0.2309 (-0.61z)| lr 7.07e-05 | 4154.86 ms | 32.5% bf16 MFU | 125775 tok/s step 15995/19560 | loss 3.294553 (+0.07z)| norm 0.2563 (+1.09z)| lr 7.07e-05 | 4154.78 ms | 32.5% bf16 MFU | 125796 tok/s step 15996/19560 | loss 3.333987 (+1.00z)| norm 0.2262 (-0.93z)| lr 7.06e-05 | 4159.48 ms | 32.5% bf16 MFU | 125808 tok/s step 15997/19560 | loss 3.310534 (+0.45z)| norm 0.2320 (-0.53z)| lr 7.06e-05 | 4167.38 ms | 32.4% bf16 MFU | 125808 tok/s step 15998/19560 | loss 3.270550 (-0.52z)| norm 0.2256 (-0.96z)| lr 7.06e-05 | 4160.50 ms | 32.5% bf16 MFU | 125818 tok/s step 15999/19560 | loss 3.299109 (+0.18z)| norm 0.2243 (-1.04z)| lr 7.05e-05 | 4162.39 ms | 32.4% bf16 MFU | 125825 tok/s step 16000/19560 | loss 3.303727 (+0.28z)| norm 0.2210 (-1.24z)| lr 7.05e-05 | 4161.31 ms | 32.4% bf16 MFU | 125834 tok/s val loss 3.270629 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3032/10042 = 0.301932 step 16001/19560 | loss 3.319155 (+0.65z)| norm 0.2299 (-0.64z)| lr 7.05e-05 | 4161.01 ms | 32.4% bf16 MFU | 125842 tok/s step 16002/19560 | loss 3.312440 (+0.48z)| norm 0.2353 (-0.29z)| lr 7.04e-05 | 4160.36 ms | 32.5% bf16 MFU | 125851 tok/s step 16003/19560 | loss 3.300082 (+0.17z)| norm 0.2655 (+1.69z)| lr 7.04e-05 | 4163.73 ms | 32.4% bf16 MFU | 125854 tok/s step 16004/19560 | loss 3.327801 (+0.85z)| norm 0.2386 (-0.07z)| lr 7.03e-05 | 4160.77 ms | 32.5% bf16 MFU | 125862 tok/s step 16005/19560 | loss 3.345657 (+1.27z)| norm 0.2404 (+0.04z)| lr 7.03e-05 | 4157.79 ms | 32.5% bf16 MFU | 125874 tok/s step 16006/19560 | loss 3.285422 (-0.18z)| norm 0.2686 (+1.86z)| lr 7.03e-05 | 4161.57 ms | 32.4% bf16 MFU | 125879 tok/s step 16007/19560 | loss 3.295767 (+0.07z)| norm 0.2264 (-0.90z)| lr 7.02e-05 | 4162.47 ms | 32.4% bf16 MFU | 125883 tok/s step 16008/19560 | loss 3.321232 (+0.68z)| norm 0.2339 (-0.41z)| lr 7.02e-05 | 4160.11 ms | 32.5% bf16 MFU | 125890 tok/s step 16009/19560 | loss 3.292975 (-0.02z)| norm 0.2490 (+0.57z)| lr 7.01e-05 | 4160.20 ms | 32.5% bf16 MFU | 125897 tok/s step 16010/19560 | loss 3.317810 (+0.60z)| norm 0.2541 (+0.90z)| lr 7.01e-05 | 4158.53 ms | 32.5% bf16 MFU | 125906 tok/s step 16011/19560 | loss 3.276445 (-0.43z)| norm 0.2414 (+0.06z)| lr 7.01e-05 | 4158.35 ms | 32.5% bf16 MFU | 125915 tok/s step 16012/19560 | loss 3.263882 (-0.74z)| norm 0.2317 (-0.58z)| lr 7.00e-05 | 4157.90 ms | 32.5% bf16 MFU | 125924 tok/s step 16013/19560 | loss 3.274724 (-0.46z)| norm 0.2473 (+0.44z)| lr 7.00e-05 | 4163.62 ms | 32.4% bf16 MFU | 125924 tok/s step 16014/19560 | loss 3.225024 (-1.66z)| norm 0.2205 (-1.30z)| lr 7.00e-05 | 4160.04 ms | 32.5% bf16 MFU | 125929 tok/s step 16015/19560 | loss 3.276402 (-0.39z)| norm 0.2421 (+0.10z)| lr 6.99e-05 | 4158.79 ms | 32.5% bf16 MFU | 125936 tok/s step 16016/19560 | loss 3.253428 (-0.95z)| norm 0.2210 (-1.27z)| lr 6.99e-05 | 4158.91 ms | 32.5% bf16 MFU | 125942 tok/s step 16017/19560 | loss 3.250296 (-1.01z)| norm 0.2306 (-0.64z)| lr 6.98e-05 | 4158.25 ms | 32.5% bf16 MFU | 125949 tok/s step 16018/19560 | loss 3.272297 (-0.47z)| norm 0.2425 (+0.13z)| lr 6.98e-05 | 4161.11 ms | 32.4% bf16 MFU | 125952 tok/s step 16019/19560 | loss 3.282735 (-0.21z)| norm 0.2277 (-0.82z)| lr 6.98e-05 | 4162.69 ms | 32.4% bf16 MFU | 125952 tok/s step 16020/19560 | loss 3.239466 (-1.25z)| norm 0.2325 (-0.52z)| lr 6.97e-05 | 4163.48 ms | 32.4% bf16 MFU | 125950 tok/s step 16021/19560 | loss 3.260840 (-0.73z)| norm 0.2277 (-0.82z)| lr 6.97e-05 | 4158.60 ms | 32.5% bf16 MFU | 125956 tok/s step 16022/19560 | loss 3.252423 (-0.94z)| norm 0.2310 (-0.61z)| lr 6.96e-05 | 4157.75 ms | 32.5% bf16 MFU | 125964 tok/s step 16023/19560 | loss 3.313026 (+0.56z)| norm 0.2276 (-0.82z)| lr 6.96e-05 | 4159.12 ms | 32.5% bf16 MFU | 125968 tok/s step 16024/19560 | loss 3.296895 (+0.15z)| norm 0.2262 (-0.90z)| lr 6.96e-05 | 4161.36 ms | 32.4% bf16 MFU | 125969 tok/s step 16025/19560 | loss 3.300821 (+0.25z)| norm 0.2299 (-0.65z)| lr 6.95e-05 | 4165.66 ms | 32.4% bf16 MFU | 125964 tok/s step 16026/19560 | loss 3.265256 (-0.62z)| norm 0.2358 (-0.28z)| lr 6.95e-05 | 4158.88 ms | 32.5% bf16 MFU | 125969 tok/s step 16027/19560 | loss 3.286082 (-0.11z)| norm 0.2235 (-1.06z)| lr 6.95e-05 | 4161.27 ms | 32.4% bf16 MFU | 125970 tok/s step 16028/19560 | loss 3.304362 (+0.34z)| norm 0.2456 (+0.38z)| lr 6.94e-05 | 4160.59 ms | 32.5% bf16 MFU | 125972 tok/s step 16029/19560 | loss 3.309318 (+0.47z)| norm 0.2379 (-0.11z)| lr 6.94e-05 | 4163.02 ms | 32.4% bf16 MFU | 125970 tok/s step 16030/19560 | loss 3.341062 (+1.24z)| norm 0.2485 (+0.58z)| lr 6.93e-05 | 4155.80 ms | 32.5% bf16 MFU | 125980 tok/s step 16031/19560 | loss 3.318642 (+0.67z)| norm 0.2381 (-0.09z)| lr 6.93e-05 | 4156.43 ms | 32.5% bf16 MFU | 125988 tok/s step 16032/19560 | loss 3.220319 (-1.75z)| norm 0.2454 (+0.39z)| lr 6.93e-05 | 4159.80 ms | 32.5% bf16 MFU | 125990 tok/s step 16033/19560 | loss 3.278821 (-0.30z)| norm 0.2349 (-0.29z)| lr 6.92e-05 | 4160.88 ms | 32.4% bf16 MFU | 125991 tok/s step 16034/19560 | loss 3.247202 (-1.07z)| norm 0.2481 (+0.58z)| lr 6.92e-05 | 4162.36 ms | 32.4% bf16 MFU | 125989 tok/s step 16035/19560 | loss 3.303804 (+0.33z)| norm 0.2503 (+0.72z)| lr 6.92e-05 | 4160.55 ms | 32.5% bf16 MFU | 125991 tok/s step 16036/19560 | loss 3.316758 (+0.64z)| norm 0.2396 (+0.02z)| lr 6.91e-05 | 4157.27 ms | 32.5% bf16 MFU | 125997 tok/s step 16037/19560 | loss 3.336218 (+1.11z)| norm 0.2502 (+0.70z)| lr 6.91e-05 | 4159.77 ms | 32.5% bf16 MFU | 125999 tok/s step 16038/19560 | loss 3.187118 (-2.51z)| norm 0.2271 (-0.81z)| lr 6.90e-05 | 4157.83 ms | 32.5% bf16 MFU | 126004 tok/s step 16039/19560 | loss 3.312128 (+0.51z)| norm 0.2275 (-0.78z)| lr 6.90e-05 | 4170.84 ms | 32.4% bf16 MFU | 125989 tok/s step 16040/19560 | loss 3.242734 (-1.16z)| norm 0.2309 (-0.55z)| lr 6.90e-05 | 4158.85 ms | 32.5% bf16 MFU | 125992 tok/s step 16041/19560 | loss 3.288374 (-0.07z)| norm 0.2256 (-0.89z)| lr 6.89e-05 | 4159.57 ms | 32.5% bf16 MFU | 125995 tok/s step 16042/19560 | loss 3.247056 (-1.09z)| norm 0.2230 (-1.05z)| lr 6.89e-05 | 4158.55 ms | 32.5% bf16 MFU | 125999 tok/s step 16043/19560 | loss 3.283561 (-0.16z)| norm 0.2441 (+0.50z)| lr 6.88e-05 | 4161.16 ms | 32.4% bf16 MFU | 125999 tok/s step 16044/19560 | loss 3.306827 (+0.42z)| norm 0.2353 (-0.26z)| lr 6.88e-05 | 4158.09 ms | 32.5% bf16 MFU | 126003 tok/s step 16045/19560 | loss 3.278634 (-0.30z)| norm 0.2404 (+0.21z)| lr 6.88e-05 | 4158.07 ms | 32.5% bf16 MFU | 126008 tok/s step 16046/19560 | loss 3.285030 (-0.14z)| norm 0.2433 (+0.46z)| lr 6.87e-05 | 4159.96 ms | 32.5% bf16 MFU | 126009 tok/s step 16047/19560 | loss 3.291788 (+0.02z)| norm 0.2451 (+0.61z)| lr 6.87e-05 | 4155.06 ms | 32.5% bf16 MFU | 126017 tok/s step 16048/19560 | loss 3.274554 (-0.44z)| norm 0.2318 (-0.56z)| lr 6.87e-05 | 4155.85 ms | 32.5% bf16 MFU | 126024 tok/s step 16049/19560 | loss 3.277216 (-0.37z)| norm 0.2380 (+0.00z)| lr 6.86e-05 | 4158.30 ms | 32.5% bf16 MFU | 126027 tok/s step 16050/19560 | loss 3.252736 (-1.10z)| norm 0.2343 (-0.31z)| lr 6.86e-05 | 4158.92 ms | 32.5% bf16 MFU | 126029 tok/s step 16051/19560 | loss 3.292734 (+0.10z)| norm 0.2396 (+0.18z)| lr 6.85e-05 | 4157.16 ms | 32.5% bf16 MFU | 126034 tok/s step 16052/19560 | loss 3.295877 (+0.19z)| norm 0.2306 (-0.65z)| lr 6.85e-05 | 4160.07 ms | 32.5% bf16 MFU | 126033 tok/s step 16053/19560 | loss 3.282235 (-0.20z)| norm 0.2401 (+0.23z)| lr 6.85e-05 | 4160.18 ms | 32.5% bf16 MFU | 126033 tok/s step 16054/19560 | loss 3.303239 (+0.45z)| norm 0.2472 (+0.87z)| lr 6.84e-05 | 4161.01 ms | 32.4% bf16 MFU | 126031 tok/s step 16055/19560 | loss 3.388187 (+2.94z)| norm 0.2382 (+0.05z)| lr 6.84e-05 | 4160.21 ms | 32.5% bf16 MFU | 126031 tok/s step 16056/19560 | loss 3.255449 (-1.00z)| norm 0.2506 (+1.17z)| lr 6.84e-05 | 4160.18 ms | 32.5% bf16 MFU | 126031 tok/s step 16057/19560 | loss 3.262471 (-0.80z)| norm 0.2457 (+0.72z)| lr 6.83e-05 | 4160.35 ms | 32.5% bf16 MFU | 126030 tok/s step 16058/19560 | loss 3.217037 (-2.10z)| norm 0.2464 (+0.78z)| lr 6.83e-05 | 4159.72 ms | 32.5% bf16 MFU | 126031 tok/s step 16059/19560 | loss 3.271057 (-0.52z)| norm 0.2597 (+1.94z)| lr 6.82e-05 | 4162.05 ms | 32.4% bf16 MFU | 126027 tok/s step 16060/19560 | loss 3.268987 (-0.57z)| norm 0.2408 (+0.24z)| lr 6.82e-05 | 4159.85 ms | 32.5% bf16 MFU | 126028 tok/s step 16061/19560 | loss 3.293399 (+0.13z)| norm 0.2325 (-0.51z)| lr 6.82e-05 | 4160.90 ms | 32.4% bf16 MFU | 126027 tok/s step 16062/19560 | loss 3.207651 (-2.32z)| norm 0.2607 (+1.98z)| lr 6.81e-05 | 4157.35 ms | 32.5% bf16 MFU | 126031 tok/s step 16063/19560 | loss 3.256402 (-0.91z)| norm 0.2368 (-0.15z)| lr 6.81e-05 | 4155.08 ms | 32.5% bf16 MFU | 126038 tok/s step 16064/19560 | loss 3.321910 (+0.96z)| norm 0.2368 (-0.16z)| lr 6.81e-05 | 4156.04 ms | 32.5% bf16 MFU | 126044 tok/s step 16065/19560 | loss 3.289264 (+0.02z)| norm 0.2354 (-0.29z)| lr 6.80e-05 | 4156.06 ms | 32.5% bf16 MFU | 126049 tok/s step 16066/19560 | loss 3.290481 (+0.05z)| norm 0.2399 (+0.11z)| lr 6.80e-05 | 4158.01 ms | 32.5% bf16 MFU | 126051 tok/s step 16067/19560 | loss 3.311647 (+0.66z)| norm 0.2383 (-0.04z)| lr 6.79e-05 | 4155.14 ms | 32.5% bf16 MFU | 126058 tok/s step 16068/19560 | loss 3.290710 (+0.05z)| norm 0.2411 (+0.20z)| lr 6.79e-05 | 4159.88 ms | 32.5% bf16 MFU | 126057 tok/s step 16069/19560 | loss 3.346787 (+1.63z)| norm 0.2310 (-0.70z)| lr 6.79e-05 | 4160.38 ms | 32.5% bf16 MFU | 126055 tok/s step 16070/19560 | loss 3.326906 (+1.07z)| norm 0.2354 (-0.30z)| lr 6.78e-05 | 4158.53 ms | 32.5% bf16 MFU | 126056 tok/s step 16071/19560 | loss 3.290188 (+0.01z)| norm 0.2373 (-0.11z)| lr 6.78e-05 | 4159.56 ms | 32.5% bf16 MFU | 126055 tok/s step 16072/19560 | loss 3.262838 (-0.79z)| norm 0.2422 (+0.32z)| lr 6.78e-05 | 4156.63 ms | 32.5% bf16 MFU | 126059 tok/s step 16073/19560 | loss 3.304965 (+0.44z)| norm 0.2571 (+1.65z)| lr 6.77e-05 | 4159.17 ms | 32.5% bf16 MFU | 126059 tok/s step 16074/19560 | loss 3.304560 (+0.45z)| norm 0.2527 (+1.24z)| lr 6.77e-05 | 4155.55 ms | 32.5% bf16 MFU | 126064 tok/s step 16075/19560 | loss 3.258187 (-0.91z)| norm 0.2373 (-0.15z)| lr 6.76e-05 | 4157.34 ms | 32.5% bf16 MFU | 126067 tok/s step 16076/19560 | loss 3.295389 (+0.18z)| norm 0.2476 (+0.77z)| lr 6.76e-05 | 4156.08 ms | 32.5% bf16 MFU | 126071 tok/s step 16077/19560 | loss 3.294181 (+0.16z)| norm 0.2480 (+0.80z)| lr 6.76e-05 | 4158.58 ms | 32.5% bf16 MFU | 126071 tok/s step 16078/19560 | loss 3.387887 (+2.81z)| norm 0.2481 (+0.80z)| lr 6.75e-05 | 4156.31 ms | 32.5% bf16 MFU | 126074 tok/s step 16079/19560 | loss 3.295981 (+0.17z)| norm 0.2376 (-0.16z)| lr 6.75e-05 | 4158.29 ms | 32.5% bf16 MFU | 126075 tok/s step 16080/19560 | loss 3.315980 (+0.74z)| norm 0.2456 (+0.56z)| lr 6.74e-05 | 4156.59 ms | 32.5% bf16 MFU | 126078 tok/s step 16081/19560 | loss 3.289827 (-0.02z)| norm 0.2506 (+1.01z)| lr 6.74e-05 | 4156.84 ms | 32.5% bf16 MFU | 126080 tok/s step 16082/19560 | loss 3.317820 (+0.80z)| norm 0.2362 (-0.29z)| lr 6.74e-05 | 4158.15 ms | 32.5% bf16 MFU | 126081 tok/s step 16083/19560 | loss 3.380640 (+2.57z)| norm 0.2395 (+0.01z)| lr 6.73e-05 | 4160.31 ms | 32.5% bf16 MFU | 126078 tok/s step 16084/19560 | loss 3.277734 (-0.36z)| norm 0.2492 (+0.88z)| lr 6.73e-05 | 4156.37 ms | 32.5% bf16 MFU | 126081 tok/s step 16085/19560 | loss 3.308865 (+0.54z)| norm 0.2345 (-0.45z)| lr 6.73e-05 | 4159.37 ms | 32.5% bf16 MFU | 126079 tok/s step 16086/19560 | loss 3.259212 (-0.92z)| norm 0.2400 (+0.06z)| lr 6.72e-05 | 4162.68 ms | 32.4% bf16 MFU | 126073 tok/s step 16087/19560 | loss 3.283820 (-0.19z)| norm 0.2299 (-0.86z)| lr 6.72e-05 | 4156.46 ms | 32.5% bf16 MFU | 126076 tok/s step 16088/19560 | loss 3.292818 (+0.07z)| norm 0.2440 (+0.42z)| lr 6.71e-05 | 4159.11 ms | 32.5% bf16 MFU | 126075 tok/s step 16089/19560 | loss 3.285587 (-0.14z)| norm 0.2396 (+0.01z)| lr 6.71e-05 | 4156.83 ms | 32.5% bf16 MFU | 126078 tok/s step 16090/19560 | loss 3.340430 (+1.45z)| norm 0.2343 (-0.47z)| lr 6.71e-05 | 4159.43 ms | 32.5% bf16 MFU | 126076 tok/s step 16091/19560 | loss 3.299438 (+0.26z)| norm 0.2695 (+2.67z)| lr 6.70e-05 | 4156.99 ms | 32.5% bf16 MFU | 126078 tok/s step 16092/19560 | loss 3.287357 (-0.10z)| norm 0.2479 (+0.73z)| lr 6.70e-05 | 4156.68 ms | 32.5% bf16 MFU | 126081 tok/s step 16093/19560 | loss 3.334767 (+1.27z)| norm 0.2326 (-0.64z)| lr 6.70e-05 | 4159.89 ms | 32.5% bf16 MFU | 126079 tok/s step 16094/19560 | loss 3.280201 (-0.31z)| norm 0.2515 (+1.03z)| lr 6.69e-05 | 4158.95 ms | 32.5% bf16 MFU | 126078 tok/s step 16095/19560 | loss 3.280872 (-0.30z)| norm 0.2486 (+0.77z)| lr 6.69e-05 | 4160.18 ms | 32.5% bf16 MFU | 126075 tok/s step 16096/19560 | loss 3.296272 (+0.14z)| norm 0.2282 (-1.07z)| lr 6.68e-05 | 4157.82 ms | 32.5% bf16 MFU | 126076 tok/s step 16097/19560 | loss 3.282449 (-0.27z)| norm 0.2405 (+0.02z)| lr 6.68e-05 | 4159.84 ms | 32.5% bf16 MFU | 126074 tok/s step 16098/19560 | loss 3.282965 (-0.25z)| norm 0.2511 (+0.97z)| lr 6.68e-05 | 4155.58 ms | 32.5% bf16 MFU | 126079 tok/s step 16099/19560 | loss 3.314470 (+0.68z)| norm 0.2292 (-0.99z)| lr 6.67e-05 | 4158.73 ms | 32.5% bf16 MFU | 126078 tok/s step 16100/19560 | loss 3.283194 (-0.26z)| norm 0.2527 (+1.10z)| lr 6.67e-05 | 4159.04 ms | 32.5% bf16 MFU | 126077 tok/s step 16101/19560 | loss 3.346690 (+1.62z)| norm 0.2406 (+0.00z)| lr 6.67e-05 | 4157.01 ms | 32.5% bf16 MFU | 126080 tok/s step 16102/19560 | loss 3.305864 (+0.39z)| norm 0.2265 (-1.25z)| lr 6.66e-05 | 4157.81 ms | 32.5% bf16 MFU | 126081 tok/s step 16103/19560 | loss 3.297293 (+0.13z)| norm 0.2382 (-0.19z)| lr 6.66e-05 | 4155.15 ms | 32.5% bf16 MFU | 126085 tok/s step 16104/19560 | loss 3.304381 (+0.35z)| norm 0.2495 (+0.83z)| lr 6.65e-05 | 4158.39 ms | 32.5% bf16 MFU | 126085 tok/s step 16105/19560 | loss 3.260526 (-0.96z)| norm 0.2319 (-0.76z)| lr 6.65e-05 | 4158.96 ms | 32.5% bf16 MFU | 126084 tok/s step 16106/19560 | loss 3.253292 (-1.16z)| norm 0.2460 (+0.51z)| lr 6.65e-05 | 4158.00 ms | 32.5% bf16 MFU | 126084 tok/s step 16107/19560 | loss 3.284583 (-0.22z)| norm 0.2224 (-1.59z)| lr 6.64e-05 | 4157.03 ms | 32.5% bf16 MFU | 126086 tok/s step 16108/19560 | loss 3.275983 (-0.47z)| norm 0.2370 (-0.28z)| lr 6.64e-05 | 4154.36 ms | 32.5% bf16 MFU | 126092 tok/s step 16109/19560 | loss 3.263128 (-0.85z)| norm 0.2439 (+0.33z)| lr 6.64e-05 | 4157.75 ms | 32.5% bf16 MFU | 126092 tok/s step 16110/19560 | loss 3.256377 (-1.04z)| norm 0.2383 (-0.15z)| lr 6.63e-05 | 4158.12 ms | 32.5% bf16 MFU | 126092 tok/s step 16111/19560 | loss 3.286406 (-0.15z)| norm 0.2339 (-0.56z)| lr 6.63e-05 | 4157.43 ms | 32.5% bf16 MFU | 126093 tok/s step 16112/19560 | loss 3.352840 (+1.83z)| norm 0.2502 (+1.05z)| lr 6.63e-05 | 4158.70 ms | 32.5% bf16 MFU | 126092 tok/s step 16113/19560 | loss 3.266228 (-0.74z)| norm 0.2391 (-0.03z)| lr 6.62e-05 | 4154.60 ms | 32.5% bf16 MFU | 126097 tok/s step 16114/19560 | loss 3.289735 (-0.05z)| norm 0.2455 (+0.62z)| lr 6.62e-05 | 4157.71 ms | 32.5% bf16 MFU | 126097 tok/s step 16115/19560 | loss 3.257684 (-1.00z)| norm 0.2399 (+0.05z)| lr 6.61e-05 | 4157.05 ms | 32.5% bf16 MFU | 126098 tok/s step 16116/19560 | loss 3.272642 (-0.55z)| norm 0.2460 (+0.66z)| lr 6.61e-05 | 4157.70 ms | 32.5% bf16 MFU | 126098 tok/s step 16117/19560 | loss 3.279080 (-0.35z)| norm 0.2340 (-0.55z)| lr 6.61e-05 | 4154.62 ms | 32.5% bf16 MFU | 126103 tok/s step 16118/19560 | loss 3.277657 (-0.39z)| norm 0.2347 (-0.48z)| lr 6.60e-05 | 4157.23 ms | 32.5% bf16 MFU | 126104 tok/s step 16119/19560 | loss 3.244925 (-1.37z)| norm 0.2337 (-0.57z)| lr 6.60e-05 | 4160.54 ms | 32.5% bf16 MFU | 126099 tok/s step 16120/19560 | loss 3.318514 (+0.88z)| norm 0.2318 (-0.76z)| lr 6.60e-05 | 4158.59 ms | 32.5% bf16 MFU | 126098 tok/s step 16121/19560 | loss 3.277844 (-0.35z)| norm 0.2395 (+0.03z)| lr 6.59e-05 | 4156.76 ms | 32.5% bf16 MFU | 126099 tok/s step 16122/19560 | loss 3.276719 (-0.39z)| norm 0.2285 (-1.08z)| lr 6.59e-05 | 4158.33 ms | 32.5% bf16 MFU | 126099 tok/s step 16123/19560 | loss 3.312180 (+0.70z)| norm 0.2288 (-1.04z)| lr 6.58e-05 | 4156.91 ms | 32.5% bf16 MFU | 126100 tok/s step 16124/19560 | loss 3.268047 (-0.65z)| norm 0.2423 (+0.33z)| lr 6.58e-05 | 4685.66 ms | 28.8% bf16 MFU | 125389 tok/s step 16125/19560 | loss 3.241426 (-1.44z)| norm 0.2383 (-0.09z)| lr 6.58e-05 | 4740.24 ms | 28.5% bf16 MFU | 124650 tok/s step 16126/19560 | loss 3.325874 (+1.13z)| norm 0.2628 (+2.38z)| lr 6.57e-05 | 4472.61 ms | 30.2% bf16 MFU | 124279 tok/s step 16127/19560 | loss 3.310430 (+0.66z)| norm 0.2368 (-0.28z)| lr 6.57e-05 | 4237.68 ms | 31.9% bf16 MFU | 124251 tok/s step 16128/19560 | loss 3.337549 (+1.47z)| norm 0.2378 (-0.20z)| lr 6.57e-05 | 4213.86 ms | 32.0% bf16 MFU | 124259 tok/s step 16129/19560 | loss 3.286714 (-0.07z)| norm 0.2360 (-0.39z)| lr 6.56e-05 | 4204.81 ms | 32.1% bf16 MFU | 124281 tok/s step 16130/19560 | loss 3.408341 (+3.44z)| norm 0.2562 (+1.69z)| lr 6.56e-05 | 4186.12 ms | 32.3% bf16 MFU | 124329 tok/s step 16131/19560 | loss 3.335916 (+1.32z)| norm 0.2288 (-1.14z)| lr 6.55e-05 | 4200.48 ms | 32.1% bf16 MFU | 124353 tok/s step 16132/19560 | loss 3.237960 (-1.47z)| norm 0.2473 (+0.81z)| lr 6.55e-05 | 4220.83 ms | 32.0% bf16 MFU | 124346 tok/s step 16133/19560 | loss 3.292919 (+0.12z)| norm 0.2358 (-0.40z)| lr 6.55e-05 | 4161.44 ms | 32.4% bf16 MFU | 124428 tok/s step 16134/19560 | loss 3.304039 (+0.44z)| norm 0.2341 (-0.57z)| lr 6.54e-05 | 4177.12 ms | 32.3% bf16 MFU | 124483 tok/s step 16135/19560 | loss 3.262687 (-0.75z)| norm 0.2350 (-0.48z)| lr 6.54e-05 | 4199.76 ms | 32.1% bf16 MFU | 124500 tok/s step 16136/19560 | loss 3.313867 (+0.73z)| norm 0.2400 (+0.06z)| lr 6.54e-05 | 4162.26 ms | 32.4% bf16 MFU | 124573 tok/s step 16137/19560 | loss 3.282443 (-0.18z)| norm 0.2329 (-0.71z)| lr 6.53e-05 | 4193.01 ms | 32.2% bf16 MFU | 124597 tok/s step 16138/19560 | loss 3.259381 (-0.83z)| norm 0.2152 (-2.59z)| lr 6.53e-05 | 4160.13 ms | 32.5% bf16 MFU | 124668 tok/s step 16139/19560 | loss 3.301405 (+0.38z)| norm 0.2327 (-0.67z)| lr 6.52e-05 | 4206.10 ms | 32.1% bf16 MFU | 124667 tok/s step 16140/19560 | loss 3.314616 (+0.75z)| norm 0.2489 (+1.06z)| lr 6.52e-05 | 4393.60 ms | 30.7% bf16 MFU | 124400 tok/s step 16141/19560 | loss 3.270458 (-0.53z)| norm 0.2276 (-1.22z)| lr 6.52e-05 | 4159.02 ms | 32.5% bf16 MFU | 124483 tok/s step 16142/19560 | loss 3.295675 (+0.19z)| norm 0.2266 (-1.35z)| lr 6.51e-05 | 4159.60 ms | 32.5% bf16 MFU | 124561 tok/s step 16143/19560 | loss 3.337718 (+1.40z)| norm 0.2471 (+0.88z)| lr 6.51e-05 | 4192.82 ms | 32.2% bf16 MFU | 124586 tok/s step 16144/19560 | loss 3.330659 (+1.17z)| norm 0.2299 (-1.01z)| lr 6.51e-05 | 4183.76 ms | 32.3% bf16 MFU | 124622 tok/s step 16145/19560 | loss 3.282821 (-0.23z)| norm 0.2336 (-0.60z)| lr 6.50e-05 | 4162.22 ms | 32.4% bf16 MFU | 124689 tok/s step 16146/19560 | loss 3.417027 (+3.49z)| norm 0.2522 (+1.42z)| lr 6.50e-05 | 4160.55 ms | 32.5% bf16 MFU | 124755 tok/s step 16147/19560 | loss 3.322040 (+0.83z)| norm 0.2427 (+0.37z)| lr 6.50e-05 | 4210.44 ms | 32.1% bf16 MFU | 124744 tok/s step 16148/19560 | loss 3.359757 (+1.85z)| norm 0.2399 (+0.06z)| lr 6.49e-05 | 4178.35 ms | 32.3% bf16 MFU | 124780 tok/s step 16149/19560 | loss 3.331505 (+1.05z)| norm 0.2375 (-0.21z)| lr 6.49e-05 | 4200.12 ms | 32.1% bf16 MFU | 124783 tok/s step 16150/19560 | loss 3.347180 (+1.46z)| norm 0.2256 (-1.52z)| lr 6.48e-05 | 4206.31 ms | 32.1% bf16 MFU | 124776 tok/s step 16151/19560 | loss 3.251574 (-1.16z)| norm 0.2440 (+0.50z)| lr 6.48e-05 | 4161.55 ms | 32.4% bf16 MFU | 124836 tok/s step 16152/19560 | loss 3.303610 (+0.27z)| norm 0.2556 (+1.75z)| lr 6.48e-05 | 4162.24 ms | 32.4% bf16 MFU | 124892 tok/s step 16153/19560 | loss 3.244682 (-1.33z)| norm 0.2376 (-0.24z)| lr 6.47e-05 | 4160.95 ms | 32.4% bf16 MFU | 124948 tok/s step 16154/19560 | loss 3.290512 (-0.08z)| norm 0.2228 (-1.85z)| lr 6.47e-05 | 4162.07 ms | 32.4% bf16 MFU | 124999 tok/s step 16155/19560 | loss 3.332082 (+1.04z)| norm 0.2497 (+1.09z)| lr 6.47e-05 | 4162.80 ms | 32.4% bf16 MFU | 125046 tok/s step 16156/19560 | loss 3.240556 (-1.43z)| norm 0.2303 (-1.05z)| lr 6.46e-05 | 4162.97 ms | 32.4% bf16 MFU | 125091 tok/s step 16157/19560 | loss 3.282716 (-0.28z)| norm 0.2382 (-0.17z)| lr 6.46e-05 | 4162.41 ms | 32.4% bf16 MFU | 125134 tok/s step 16158/19560 | loss 3.319769 (+0.72z)| norm 0.2401 (+0.04z)| lr 6.45e-05 | 4161.92 ms | 32.4% bf16 MFU | 125176 tok/s step 16159/19560 | loss 3.349523 (+1.51z)| norm 0.2352 (-0.49z)| lr 6.45e-05 | 4213.79 ms | 32.0% bf16 MFU | 125139 tok/s step 16160/19560 | loss 3.337821 (+1.18z)| norm 0.2357 (-0.43z)| lr 6.45e-05 | 4161.42 ms | 32.4% bf16 MFU | 125181 tok/s step 16161/19560 | loss 3.256273 (-1.02z)| norm 0.2337 (-0.66z)| lr 6.44e-05 | 4309.15 ms | 31.3% bf16 MFU | 125005 tok/s step 16162/19560 | loss 3.285265 (-0.25z)| norm 0.2293 (-1.12z)| lr 6.44e-05 | 4162.29 ms | 32.4% bf16 MFU | 125053 tok/s step 16163/19560 | loss 3.276519 (-0.48z)| norm 0.2290 (-1.14z)| lr 6.44e-05 | 4160.33 ms | 32.5% bf16 MFU | 125102 tok/s step 16164/19560 | loss 3.298329 (+0.12z)| norm 0.2408 (+0.16z)| lr 6.43e-05 | 4160.79 ms | 32.4% bf16 MFU | 125147 tok/s step 16165/19560 | loss 3.358961 (+1.75z)| norm 0.2285 (-1.17z)| lr 6.43e-05 | 4158.04 ms | 32.5% bf16 MFU | 125194 tok/s step 16166/19560 | loss 3.404181 (+2.93z)| norm 0.2375 (-0.20z)| lr 6.42e-05 | 4158.91 ms | 32.5% bf16 MFU | 125237 tok/s step 16167/19560 | loss 3.307175 (+0.31z)| norm 0.2308 (-0.94z)| lr 6.42e-05 | 4164.48 ms | 32.4% bf16 MFU | 125270 tok/s step 16168/19560 | loss 3.280246 (-0.43z)| norm 0.2374 (-0.21z)| lr 6.42e-05 | 4161.46 ms | 32.4% bf16 MFU | 125306 tok/s step 16169/19560 | loss 3.257600 (-1.04z)| norm 0.2332 (-0.69z)| lr 6.41e-05 | 4158.98 ms | 32.5% bf16 MFU | 125344 tok/s step 16170/19560 | loss 3.355511 (+1.59z)| norm 0.2441 (+0.52z)| lr 6.41e-05 | 4163.91 ms | 32.4% bf16 MFU | 125372 tok/s step 16171/19560 | loss 3.296833 (-0.00z)| norm 0.2453 (+0.66z)| lr 6.41e-05 | 4163.34 ms | 32.4% bf16 MFU | 125400 tok/s step 16172/19560 | loss 3.328397 (+0.85z)| norm 0.2417 (+0.24z)| lr 6.40e-05 | 4163.94 ms | 32.4% bf16 MFU | 125426 tok/s step 16173/19560 | loss 3.223720 (-1.94z)| norm 0.2485 (+1.00z)| lr 6.40e-05 | 4159.21 ms | 32.5% bf16 MFU | 125457 tok/s step 16174/19560 | loss 3.303796 (+0.19z)| norm 0.2325 (-0.80z)| lr 6.40e-05 | 4163.17 ms | 32.4% bf16 MFU | 125481 tok/s step 16175/19560 | loss 3.331691 (+0.92z)| norm 0.2442 (+0.53z)| lr 6.39e-05 | 4163.63 ms | 32.4% bf16 MFU | 125503 tok/s step 16176/19560 | loss 3.276393 (-0.55z)| norm 0.2291 (-1.18z)| lr 6.39e-05 | 4159.72 ms | 32.5% bf16 MFU | 125530 tok/s step 16177/19560 | loss 3.274314 (-0.60z)| norm 0.2402 (+0.07z)| lr 6.38e-05 | 4158.96 ms | 32.5% bf16 MFU | 125557 tok/s step 16178/19560 | loss 3.261816 (-0.94z)| norm 0.2468 (+0.81z)| lr 6.38e-05 | 4162.98 ms | 32.4% bf16 MFU | 125576 tok/s step 16179/19560 | loss 3.290449 (-0.18z)| norm 0.2325 (-0.80z)| lr 6.38e-05 | 4163.87 ms | 32.4% bf16 MFU | 125593 tok/s step 16180/19560 | loss 3.306511 (+0.25z)| norm 0.2452 (+0.62z)| lr 6.37e-05 | 4162.26 ms | 32.4% bf16 MFU | 125611 tok/s step 16181/19560 | loss 3.280689 (-0.44z)| norm 0.2330 (-0.75z)| lr 6.37e-05 | 4150.38 ms | 32.5% bf16 MFU | 125647 tok/s step 16182/19560 | loss 3.407715 (+2.82z)| norm 0.2464 (+0.76z)| lr 6.37e-05 | 4174.87 ms | 32.3% bf16 MFU | 125643 tok/s step 16183/19560 | loss 3.284754 (-0.33z)| norm 0.2385 (-0.13z)| lr 6.36e-05 | 4163.16 ms | 32.4% bf16 MFU | 125658 tok/s step 16184/19560 | loss 3.244498 (-1.38z)| norm 0.2307 (-1.00z)| lr 6.36e-05 | 4165.99 ms | 32.4% bf16 MFU | 125668 tok/s step 16185/19560 | loss 3.286464 (-0.28z)| norm 0.2440 (+0.51z)| lr 6.36e-05 | 4161.26 ms | 32.4% bf16 MFU | 125684 tok/s step 16186/19560 | loss 3.257658 (-1.06z)| norm 0.2403 (+0.10z)| lr 6.35e-05 | 4160.81 ms | 32.4% bf16 MFU | 125700 tok/s step 16187/19560 | loss 3.279496 (-0.48z)| norm 0.2280 (-1.29z)| lr 6.35e-05 | 4161.71 ms | 32.4% bf16 MFU | 125714 tok/s step 16188/19560 | loss 3.310570 (+0.34z)| norm 0.2269 (-1.39z)| lr 6.34e-05 | 4160.35 ms | 32.5% bf16 MFU | 125729 tok/s step 16189/19560 | loss 3.367996 (+1.83z)| norm 0.2430 (+0.44z)| lr 6.34e-05 | 4165.29 ms | 32.4% bf16 MFU | 125736 tok/s step 16190/19560 | loss 3.285904 (-0.36z)| norm 0.2358 (-0.37z)| lr 6.34e-05 | 4159.99 ms | 32.5% bf16 MFU | 125751 tok/s step 16191/19560 | loss 3.319702 (+0.54z)| norm 0.2298 (-1.07z)| lr 6.33e-05 | 4162.64 ms | 32.4% bf16 MFU | 125761 tok/s step 16192/19560 | loss 3.240712 (-1.57z)| norm 0.2315 (-0.86z)| lr 6.33e-05 | 4159.42 ms | 32.5% bf16 MFU | 125775 tok/s step 16193/19560 | loss 3.241535 (-1.52z)| norm 0.2206 (-2.08z)| lr 6.33e-05 | 4160.38 ms | 32.5% bf16 MFU | 125788 tok/s step 16194/19560 | loss 3.316164 (+0.46z)| norm 0.2193 (-2.17z)| lr 6.32e-05 | 4163.99 ms | 32.4% bf16 MFU | 125794 tok/s step 16195/19560 | loss 3.331483 (+0.86z)| norm 0.2323 (-0.70z)| lr 6.32e-05 | 4159.56 ms | 32.5% bf16 MFU | 125806 tok/s step 16196/19560 | loss 3.354637 (+1.45z)| norm 0.2339 (-0.52z)| lr 6.31e-05 | 4159.64 ms | 32.5% bf16 MFU | 125818 tok/s step 16197/19560 | loss 3.262783 (-0.95z)| norm 0.2203 (-2.00z)| lr 6.31e-05 | 4157.87 ms | 32.5% bf16 MFU | 125832 tok/s step 16198/19560 | loss 3.298559 (-0.00z)| norm 0.2376 (-0.09z)| lr 6.31e-05 | 4158.77 ms | 32.5% bf16 MFU | 125844 tok/s step 16199/19560 | loss 3.254630 (-1.15z)| norm 0.2290 (-1.03z)| lr 6.30e-05 | 4163.56 ms | 32.4% bf16 MFU | 125848 tok/s step 16200/19560 | loss 3.273349 (-0.66z)| norm 0.2385 (+0.02z)| lr 6.30e-05 | 4161.84 ms | 32.4% bf16 MFU | 125854 tok/s step 16201/19560 | loss 3.348037 (+1.29z)| norm 0.2312 (-0.78z)| lr 6.30e-05 | 4161.40 ms | 32.4% bf16 MFU | 125861 tok/s step 16202/19560 | loss 3.223540 (-1.93z)| norm 0.2261 (-1.34z)| lr 6.29e-05 | 4155.87 ms | 32.5% bf16 MFU | 125875 tok/s step 16203/19560 | loss 3.278954 (-0.50z)| norm 0.2295 (-0.94z)| lr 6.29e-05 | 4161.02 ms | 32.4% bf16 MFU | 125882 tok/s step 16204/19560 | loss 3.261045 (-0.95z)| norm 0.2421 (+0.49z)| lr 6.29e-05 | 4185.26 ms | 32.3% bf16 MFU | 125851 tok/s step 16205/19560 | loss 3.346860 (+1.24z)| norm 0.2432 (+0.61z)| lr 6.28e-05 | 4159.31 ms | 32.5% bf16 MFU | 125861 tok/s step 16206/19560 | loss 3.261522 (-0.94z)| norm 0.2286 (-1.02z)| lr 6.28e-05 | 4161.64 ms | 32.4% bf16 MFU | 125867 tok/s step 16207/19560 | loss 3.293832 (-0.09z)| norm 0.2268 (-1.21z)| lr 6.27e-05 | 4158.12 ms | 32.5% bf16 MFU | 125878 tok/s step 16208/19560 | loss 3.249245 (-1.24z)| norm 0.2361 (-0.16z)| lr 6.27e-05 | 4159.73 ms | 32.5% bf16 MFU | 125886 tok/s step 16209/19560 | loss 3.379134 (+2.08z)| norm 0.2414 (+0.45z)| lr 6.27e-05 | 4160.89 ms | 32.4% bf16 MFU | 125892 tok/s step 16210/19560 | loss 3.291485 (-0.15z)| norm 0.2312 (-0.70z)| lr 6.26e-05 | 4160.58 ms | 32.5% bf16 MFU | 125898 tok/s step 16211/19560 | loss 3.324750 (+0.72z)| norm 0.2383 (+0.11z)| lr 6.26e-05 | 4159.98 ms | 32.5% bf16 MFU | 125905 tok/s step 16212/19560 | loss 3.262792 (-0.88z)| norm 0.2244 (-1.45z)| lr 6.26e-05 | 4159.95 ms | 32.5% bf16 MFU | 125911 tok/s step 16213/19560 | loss 3.301831 (+0.13z)| norm 0.2362 (-0.10z)| lr 6.25e-05 | 4164.30 ms | 32.4% bf16 MFU | 125911 tok/s step 16214/19560 | loss 3.226579 (-1.80z)| norm 0.2449 (+0.87z)| lr 6.25e-05 | 4156.23 ms | 32.5% bf16 MFU | 125922 tok/s step 16215/19560 | loss 3.338439 (+1.06z)| norm 0.2349 (-0.26z)| lr 6.25e-05 | 4161.90 ms | 32.4% bf16 MFU | 125925 tok/s step 16216/19560 | loss 3.322887 (+0.66z)| norm 0.2382 (+0.11z)| lr 6.24e-05 | 4160.32 ms | 32.5% bf16 MFU | 125930 tok/s step 16217/19560 | loss 3.353283 (+1.41z)| norm 0.2483 (+1.25z)| lr 6.24e-05 | 4159.12 ms | 32.5% bf16 MFU | 125936 tok/s step 16218/19560 | loss 3.243765 (-1.35z)| norm 0.2450 (+0.86z)| lr 6.23e-05 | 4162.85 ms | 32.4% bf16 MFU | 125937 tok/s step 16219/19560 | loss 3.336700 (+0.99z)| norm 0.2611 (+2.75z)| lr 6.23e-05 | 4161.59 ms | 32.4% bf16 MFU | 125939 tok/s step 16220/19560 | loss 3.253443 (-1.09z)| norm 0.2270 (-1.17z)| lr 6.23e-05 | 4158.94 ms | 32.5% bf16 MFU | 125945 tok/s step 16221/19560 | loss 3.268030 (-0.71z)| norm 0.2331 (-0.46z)| lr 6.22e-05 | 4159.58 ms | 32.5% bf16 MFU | 125950 tok/s step 16222/19560 | loss 3.290854 (-0.14z)| norm 0.2377 (+0.08z)| lr 6.22e-05 | 4160.46 ms | 32.5% bf16 MFU | 125953 tok/s step 16223/19560 | loss 3.312061 (+0.38z)| norm 0.2478 (+1.27z)| lr 6.22e-05 | 4155.56 ms | 32.5% bf16 MFU | 125964 tok/s step 16224/19560 | loss 3.364421 (+1.67z)| norm 0.2337 (-0.40z)| lr 6.21e-05 | 4159.38 ms | 32.5% bf16 MFU | 125968 tok/s step 16225/19560 | loss 3.398915 (+2.44z)| norm 0.2351 (-0.22z)| lr 6.21e-05 | 4159.18 ms | 32.5% bf16 MFU | 125973 tok/s step 16226/19560 | loss 3.373882 (+1.80z)| norm 0.2313 (-0.66z)| lr 6.21e-05 | 4156.29 ms | 32.5% bf16 MFU | 125981 tok/s step 16227/19560 | loss 3.250168 (-1.15z)| norm 0.2464 (+1.12z)| lr 6.20e-05 | 4157.28 ms | 32.5% bf16 MFU | 125988 tok/s step 16228/19560 | loss 3.269778 (-0.68z)| norm 0.2176 (-2.24z)| lr 6.20e-05 | 4158.49 ms | 32.5% bf16 MFU | 125992 tok/s step 16229/19560 | loss 3.339818 (+0.99z)| norm 0.2273 (-1.09z)| lr 6.19e-05 | 4158.89 ms | 32.5% bf16 MFU | 125996 tok/s step 16230/19560 | loss 3.265413 (-0.78z)| norm 0.2325 (-0.49z)| lr 6.19e-05 | 4157.91 ms | 32.5% bf16 MFU | 126001 tok/s step 16231/19560 | loss 3.263291 (-0.82z)| norm 0.2338 (-0.34z)| lr 6.19e-05 | 4160.01 ms | 32.5% bf16 MFU | 126002 tok/s step 16232/19560 | loss 3.331056 (+0.78z)| norm 0.2140 (-2.58z)| lr 6.18e-05 | 4216.27 ms | 32.0% bf16 MFU | 125919 tok/s step 16233/19560 | loss 3.282671 (-0.37z)| norm 0.2315 (-0.56z)| lr 6.18e-05 | 4161.34 ms | 32.4% bf16 MFU | 125923 tok/s step 16234/19560 | loss 3.340034 (+0.98z)| norm 0.2201 (-1.84z)| lr 6.18e-05 | 4161.96 ms | 32.4% bf16 MFU | 125925 tok/s step 16235/19560 | loss 3.273736 (-0.59z)| norm 0.2308 (-0.63z)| lr 6.17e-05 | 4159.76 ms | 32.5% bf16 MFU | 125931 tok/s step 16236/19560 | loss 3.339951 (+0.96z)| norm 0.2315 (-0.54z)| lr 6.17e-05 | 4163.09 ms | 32.4% bf16 MFU | 125931 tok/s step 16237/19560 | loss 3.269567 (-0.70z)| norm 0.2262 (-1.13z)| lr 6.17e-05 | 4162.31 ms | 32.4% bf16 MFU | 125933 tok/s step 16238/19560 | loss 3.256431 (-1.02z)| norm 0.2288 (-0.82z)| lr 6.16e-05 | 4162.34 ms | 32.4% bf16 MFU | 125934 tok/s step 16239/19560 | loss 3.372503 (+1.70z)| norm 0.2416 (+0.64z)| lr 6.16e-05 | 4158.89 ms | 32.5% bf16 MFU | 125941 tok/s step 16240/19560 | loss 3.284863 (-0.34z)| norm 0.2368 (+0.10z)| lr 6.16e-05 | 4160.04 ms | 32.5% bf16 MFU | 125945 tok/s step 16241/19560 | loss 3.327153 (+0.64z)| norm 0.2263 (-1.10z)| lr 6.15e-05 | 4160.57 ms | 32.5% bf16 MFU | 125949 tok/s step 16242/19560 | loss 3.326825 (+0.63z)| norm 0.2339 (-0.22z)| lr 6.15e-05 | 4156.25 ms | 32.5% bf16 MFU | 125958 tok/s step 16243/19560 | loss 3.324091 (+0.55z)| norm 0.2358 (+0.02z)| lr 6.14e-05 | 4162.76 ms | 32.4% bf16 MFU | 125958 tok/s step 16244/19560 | loss 3.273998 (-0.63z)| norm 0.2288 (-0.78z)| lr 6.14e-05 | 4161.27 ms | 32.4% bf16 MFU | 125960 tok/s step 16245/19560 | loss 3.372143 (+1.65z)| norm 0.2377 (+0.24z)| lr 6.14e-05 | 4157.33 ms | 32.5% bf16 MFU | 125967 tok/s step 16246/19560 | loss 3.284224 (-0.40z)| norm 0.2453 (+1.12z)| lr 6.13e-05 | 4157.44 ms | 32.5% bf16 MFU | 125974 tok/s step 16247/19560 | loss 3.225701 (-1.76z)| norm 0.2396 (+0.45z)| lr 6.13e-05 | 4159.00 ms | 32.5% bf16 MFU | 125979 tok/s step 16248/19560 | loss 3.240014 (-1.40z)| norm 0.2279 (-0.89z)| lr 6.13e-05 | 4159.99 ms | 32.5% bf16 MFU | 125981 tok/s step 16249/19560 | loss 3.342336 (+0.95z)| norm 0.2389 (+0.38z)| lr 6.12e-05 | 4156.58 ms | 32.5% bf16 MFU | 125989 tok/s step 16250/19560 | loss 3.321490 (+0.46z)| norm 0.2270 (-0.99z)| lr 6.12e-05 | 4157.45 ms | 32.5% bf16 MFU | 125995 tok/s val loss 3.267503 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3026/10042 = 0.301334 step 16251/19560 | loss 3.298766 (-0.06z)| norm 0.2515 (+1.79z)| lr 6.12e-05 | 4162.63 ms | 32.4% bf16 MFU | 125993 tok/s step 16252/19560 | loss 3.272073 (-0.68z)| norm 0.2378 (+0.22z)| lr 6.11e-05 | 4156.56 ms | 32.5% bf16 MFU | 126000 tok/s step 16253/19560 | loss 3.317335 (+0.35z)| norm 0.2337 (-0.24z)| lr 6.11e-05 | 4158.65 ms | 32.5% bf16 MFU | 126003 tok/s step 16254/19560 | loss 3.256118 (-1.05z)| norm 0.2282 (-0.87z)| lr 6.10e-05 | 4160.40 ms | 32.5% bf16 MFU | 126004 tok/s step 16255/19560 | loss 3.263169 (-0.88z)| norm 0.2348 (-0.08z)| lr 6.10e-05 | 4157.18 ms | 32.5% bf16 MFU | 126010 tok/s step 16256/19560 | loss 3.399022 (+2.21z)| norm 0.2314 (-0.48z)| lr 6.10e-05 | 4156.40 ms | 32.5% bf16 MFU | 126016 tok/s step 16257/19560 | loss 3.292440 (-0.21z)| norm 0.2311 (-0.51z)| lr 6.09e-05 | 4157.45 ms | 32.5% bf16 MFU | 126021 tok/s step 16258/19560 | loss 3.308631 (+0.18z)| norm 0.2324 (-0.34z)| lr 6.09e-05 | 4155.67 ms | 32.5% bf16 MFU | 126028 tok/s step 16259/19560 | loss 3.233074 (-1.55z)| norm 0.2256 (-1.16z)| lr 6.09e-05 | 4156.93 ms | 32.5% bf16 MFU | 126033 tok/s step 16260/19560 | loss 3.287081 (-0.31z)| norm 0.2437 (+1.04z)| lr 6.08e-05 | 4158.11 ms | 32.5% bf16 MFU | 126035 tok/s step 16261/19560 | loss 3.296744 (-0.09z)| norm 0.2303 (-0.58z)| lr 6.08e-05 | 4158.34 ms | 32.5% bf16 MFU | 126038 tok/s step 16262/19560 | loss 3.318494 (+0.42z)| norm 0.2306 (-0.54z)| lr 6.08e-05 | 4157.98 ms | 32.5% bf16 MFU | 126040 tok/s step 16263/19560 | loss 3.281794 (-0.44z)| norm 0.2243 (-1.29z)| lr 6.07e-05 | 4158.88 ms | 32.5% bf16 MFU | 126042 tok/s step 16264/19560 | loss 3.333842 (+0.77z)| norm 0.2646 (+3.38z)| lr 6.07e-05 | 4160.36 ms | 32.5% bf16 MFU | 126041 tok/s step 16265/19560 | loss 3.249174 (-1.19z)| norm 0.2381 (+0.33z)| lr 6.07e-05 | 4158.16 ms | 32.5% bf16 MFU | 126043 tok/s step 16266/19560 | loss 3.305828 (+0.11z)| norm 0.2475 (+1.40z)| lr 6.06e-05 | 4160.03 ms | 32.5% bf16 MFU | 126042 tok/s step 16267/19560 | loss 3.278831 (-0.51z)| norm 0.2430 (+0.87z)| lr 6.06e-05 | 4157.33 ms | 32.5% bf16 MFU | 126046 tok/s step 16268/19560 | loss 3.297417 (-0.08z)| norm 0.2559 (+2.33z)| lr 6.05e-05 | 4156.78 ms | 32.5% bf16 MFU | 126050 tok/s step 16269/19560 | loss 3.261879 (-0.90z)| norm 0.2286 (-0.80z)| lr 6.05e-05 | 4159.16 ms | 32.5% bf16 MFU | 126050 tok/s step 16270/19560 | loss 3.319954 (+0.44z)| norm 0.2450 (+1.07z)| lr 6.05e-05 | 4159.66 ms | 32.5% bf16 MFU | 126050 tok/s step 16271/19560 | loss 3.342308 (+0.96z)| norm 0.2411 (+0.63z)| lr 6.04e-05 | 4155.28 ms | 32.5% bf16 MFU | 126056 tok/s step 16272/19560 | loss 3.317930 (+0.40z)| norm 0.2500 (+1.61z)| lr 6.04e-05 | 4156.89 ms | 32.5% bf16 MFU | 126059 tok/s step 16273/19560 | loss 3.311158 (+0.24z)| norm 0.2308 (-0.57z)| lr 6.04e-05 | 4156.27 ms | 32.5% bf16 MFU | 126064 tok/s step 16274/19560 | loss 3.355504 (+1.31z)| norm 0.2365 (+0.09z)| lr 6.03e-05 | 4156.75 ms | 32.5% bf16 MFU | 126067 tok/s step 16275/19560 | loss 3.326623 (+0.62z)| norm 0.2463 (+1.22z)| lr 6.03e-05 | 4154.48 ms | 32.5% bf16 MFU | 126073 tok/s step 16276/19560 | loss 3.325098 (+0.59z)| norm 0.2309 (-0.55z)| lr 6.03e-05 | 4157.95 ms | 32.5% bf16 MFU | 126074 tok/s step 16277/19560 | loss 3.285403 (-0.35z)| norm 0.2268 (-1.02z)| lr 6.02e-05 | 4156.38 ms | 32.5% bf16 MFU | 126078 tok/s step 16278/19560 | loss 3.234735 (-1.53z)| norm 0.2360 (+0.04z)| lr 6.02e-05 | 4158.86 ms | 32.5% bf16 MFU | 126077 tok/s step 16279/19560 | loss 3.274686 (-0.59z)| norm 0.2314 (-0.48z)| lr 6.02e-05 | 4157.42 ms | 32.5% bf16 MFU | 126079 tok/s step 16280/19560 | loss 3.275003 (-0.57z)| norm 0.2194 (-1.86z)| lr 6.01e-05 | 4158.93 ms | 32.5% bf16 MFU | 126078 tok/s step 16281/19560 | loss 3.268732 (-0.73z)| norm 0.2248 (-1.21z)| lr 6.01e-05 | 4162.34 ms | 32.4% bf16 MFU | 126072 tok/s step 16282/19560 | loss 3.331545 (+0.77z)| norm 0.2264 (-1.03z)| lr 6.00e-05 | 4158.26 ms | 32.5% bf16 MFU | 126073 tok/s step 16283/19560 | loss 3.273498 (-0.61z)| norm 0.2346 (-0.06z)| lr 6.00e-05 | 4158.67 ms | 32.5% bf16 MFU | 126072 tok/s step 16284/19560 | loss 3.292975 (-0.16z)| norm 0.2312 (-0.46z)| lr 6.00e-05 | 4158.66 ms | 32.5% bf16 MFU | 126072 tok/s step 16285/19560 | loss 3.255663 (-1.05z)| norm 0.2257 (-1.10z)| lr 5.99e-05 | 4159.05 ms | 32.5% bf16 MFU | 126072 tok/s step 16286/19560 | loss 3.335453 (+0.87z)| norm 0.2317 (-0.39z)| lr 5.99e-05 | 4153.85 ms | 32.5% bf16 MFU | 126079 tok/s step 16287/19560 | loss 3.318393 (+0.46z)| norm 0.2307 (-0.49z)| lr 5.99e-05 | 4157.08 ms | 32.5% bf16 MFU | 126081 tok/s step 16288/19560 | loss 3.314176 (+0.37z)| norm 0.2153 (-2.25z)| lr 5.98e-05 | 4158.68 ms | 32.5% bf16 MFU | 126081 tok/s step 16289/19560 | loss 3.267860 (-0.76z)| norm 0.2268 (-0.90z)| lr 5.98e-05 | 4154.50 ms | 32.5% bf16 MFU | 126086 tok/s step 16290/19560 | loss 3.297372 (-0.04z)| norm 0.2174 (-1.95z)| lr 5.98e-05 | 4153.91 ms | 32.5% bf16 MFU | 126093 tok/s step 16291/19560 | loss 3.251911 (-1.14z)| norm 0.2274 (-0.82z)| lr 5.97e-05 | 4154.83 ms | 32.5% bf16 MFU | 126098 tok/s step 16292/19560 | loss 3.320698 (+0.52z)| norm 0.2274 (-0.80z)| lr 5.97e-05 | 4158.63 ms | 32.5% bf16 MFU | 126096 tok/s step 16293/19560 | loss 3.286033 (-0.30z)| norm 0.2266 (-0.89z)| lr 5.97e-05 | 4161.50 ms | 32.4% bf16 MFU | 126091 tok/s step 16294/19560 | loss 3.291228 (-0.16z)| norm 0.2299 (-0.51z)| lr 5.96e-05 | 4156.82 ms | 32.5% bf16 MFU | 126093 tok/s step 16295/19560 | loss 3.301615 (+0.10z)| norm 0.2488 (+1.59z)| lr 5.96e-05 | 4156.90 ms | 32.5% bf16 MFU | 126094 tok/s step 16296/19560 | loss 3.322050 (+0.60z)| norm 0.2373 (+0.31z)| lr 5.95e-05 | 4155.34 ms | 32.5% bf16 MFU | 126098 tok/s step 16297/19560 | loss 3.298119 (-0.00z)| norm 0.2317 (-0.32z)| lr 5.95e-05 | 4155.96 ms | 32.5% bf16 MFU | 126101 tok/s step 16298/19560 | loss 3.263685 (-0.85z)| norm 0.2339 (-0.07z)| lr 5.95e-05 | 4154.98 ms | 32.5% bf16 MFU | 126105 tok/s step 16299/19560 | loss 3.332709 (+0.88z)| norm 0.2441 (+1.08z)| lr 5.94e-05 | 4153.72 ms | 32.5% bf16 MFU | 126111 tok/s step 16300/19560 | loss 3.265544 (-0.80z)| norm 0.2266 (-0.87z)| lr 5.94e-05 | 4156.51 ms | 32.5% bf16 MFU | 126112 tok/s step 16301/19560 | loss 3.226439 (-1.78z)| norm 0.2331 (-0.13z)| lr 5.94e-05 | 4152.00 ms | 32.5% bf16 MFU | 126120 tok/s step 16302/19560 | loss 3.271824 (-0.64z)| norm 0.2333 (-0.11z)| lr 5.93e-05 | 4152.64 ms | 32.5% bf16 MFU | 126127 tok/s step 16303/19560 | loss 3.244468 (-1.30z)| norm 0.2476 (+1.51z)| lr 5.93e-05 | 4156.28 ms | 32.5% bf16 MFU | 126128 tok/s step 16304/19560 | loss 3.249916 (-1.15z)| norm 0.2426 (+0.93z)| lr 5.93e-05 | 4155.00 ms | 32.5% bf16 MFU | 126130 tok/s step 16305/19560 | loss 3.339974 (+1.07z)| norm 0.2301 (-0.47z)| lr 5.92e-05 | 4154.89 ms | 32.5% bf16 MFU | 126133 tok/s step 16306/19560 | loss 3.275763 (-0.52z)| norm 0.2228 (-1.27z)| lr 5.92e-05 | 4153.59 ms | 32.5% bf16 MFU | 126138 tok/s step 16307/19560 | loss 3.299427 (+0.06z)| norm 0.2300 (-0.46z)| lr 5.92e-05 | 4154.89 ms | 32.5% bf16 MFU | 126140 tok/s step 16308/19560 | loss 3.448205 (+3.55z)| norm 0.2426 (+0.96z)| lr 5.91e-05 | 4299.55 ms | 31.4% bf16 MFU | 125930 tok/s step 16309/19560 | loss 3.391289 (+2.14z)| norm 0.2534 (+2.14z)| lr 5.91e-05 | 4304.38 ms | 31.4% bf16 MFU | 125724 tok/s step 16310/19560 | loss 3.237235 (-1.43z)| norm 0.2443 (+1.13z)| lr 5.91e-05 | 4330.77 ms | 31.2% bf16 MFU | 125491 tok/s step 16311/19560 | loss 3.253876 (-1.02z)| norm 0.2287 (-0.60z)| lr 5.90e-05 | 4373.39 ms | 30.9% bf16 MFU | 125210 tok/s step 16312/19560 | loss 3.298699 (+0.02z)| norm 0.2182 (-1.74z)| lr 5.90e-05 | 4274.42 ms | 31.6% bf16 MFU | 125083 tok/s step 16313/19560 | loss 3.292220 (-0.13z)| norm 0.2393 (+0.59z)| lr 5.89e-05 | 4149.92 ms | 32.5% bf16 MFU | 125145 tok/s step 16314/19560 | loss 3.191103 (-2.46z)| norm 0.2368 (+0.32z)| lr 5.89e-05 | 4297.60 ms | 31.4% bf16 MFU | 124988 tok/s step 16315/19560 | loss 3.242935 (-1.25z)| norm 0.2419 (+0.86z)| lr 5.89e-05 | 4615.79 ms | 29.3% bf16 MFU | 124418 tok/s step 16316/19560 | loss 3.267452 (-0.67z)| norm 0.2287 (-0.60z)| lr 5.88e-05 | 4653.07 ms | 29.0% bf16 MFU | 123831 tok/s step 16317/19560 | loss 3.295985 (-0.00z)| norm 0.2303 (-0.41z)| lr 5.88e-05 | 4690.88 ms | 28.8% bf16 MFU | 123227 tok/s step 16318/19560 | loss 3.273252 (-0.53z)| norm 0.2366 (+0.29z)| lr 5.88e-05 | 4304.81 ms | 31.4% bf16 MFU | 123156 tok/s step 16319/19560 | loss 3.356107 (+1.38z)| norm 0.2379 (+0.43z)| lr 5.87e-05 | 4647.82 ms | 29.0% bf16 MFU | 122638 tok/s step 16320/19560 | loss 3.377206 (+1.83z)| norm 0.2334 (-0.07z)| lr 5.87e-05 | 4224.71 ms | 32.0% bf16 MFU | 122711 tok/s step 16321/19560 | loss 3.286943 (-0.25z)| norm 0.2265 (-0.86z)| lr 5.87e-05 | 4235.76 ms | 31.9% bf16 MFU | 122764 tok/s step 16322/19560 | loss 3.246616 (-1.16z)| norm 0.2310 (-0.36z)| lr 5.86e-05 | 4165.54 ms | 32.4% bf16 MFU | 122919 tok/s step 16323/19560 | loss 3.423046 (+2.80z)| norm 0.2298 (-0.49z)| lr 5.86e-05 | 4163.69 ms | 32.4% bf16 MFU | 123069 tok/s step 16324/19560 | loss 3.381923 (+1.86z)| norm 0.2304 (-0.42z)| lr 5.86e-05 | 4283.88 ms | 31.5% bf16 MFU | 123035 tok/s step 16325/19560 | loss 3.399806 (+2.19z)| norm 0.2565 (+2.45z)| lr 5.85e-05 | 4256.11 ms | 31.7% bf16 MFU | 123043 tok/s step 16326/19560 | loss 3.349812 (+1.09z)| norm 0.2245 (-1.09z)| lr 5.85e-05 | 4210.48 ms | 32.1% bf16 MFU | 123116 tok/s step 16327/19560 | loss 3.282079 (-0.38z)| norm 0.2446 (+1.12z)| lr 5.84e-05 | 4186.06 ms | 32.3% bf16 MFU | 123223 tok/s step 16328/19560 | loss 3.319221 (+0.42z)| norm 0.2351 (+0.07z)| lr 5.84e-05 | 4170.16 ms | 32.4% bf16 MFU | 123348 tok/s step 16329/19560 | loss 3.343415 (+0.94z)| norm 0.2195 (-1.62z)| lr 5.84e-05 | 4158.64 ms | 32.5% bf16 MFU | 123484 tok/s step 16330/19560 | loss 3.279096 (-0.47z)| norm 0.2344 (-0.01z)| lr 5.83e-05 | 4255.71 ms | 31.7% bf16 MFU | 123470 tok/s step 16331/19560 | loss 3.262899 (-0.82z)| norm 0.2317 (-0.30z)| lr 5.83e-05 | 4162.67 ms | 32.4% bf16 MFU | 123594 tok/s step 16332/19560 | loss 3.351519 (+1.11z)| norm 0.2376 (+0.35z)| lr 5.83e-05 | 4162.32 ms | 32.4% bf16 MFU | 123712 tok/s step 16333/19560 | loss 3.304677 (+0.09z)| norm 0.2305 (-0.42z)| lr 5.82e-05 | 4164.69 ms | 32.4% bf16 MFU | 123821 tok/s step 16334/19560 | loss 3.330652 (+0.65z)| norm 0.2433 (+0.98z)| lr 5.82e-05 | 4168.39 ms | 32.4% bf16 MFU | 123919 tok/s step 16335/19560 | loss 3.288862 (-0.27z)| norm 0.2441 (+1.05z)| lr 5.82e-05 | 4165.53 ms | 32.4% bf16 MFU | 124016 tok/s step 16336/19560 | loss 3.319695 (+0.40z)| norm 0.2370 (+0.27z)| lr 5.81e-05 | 4787.91 ms | 28.2% bf16 MFU | 123290 tok/s step 16337/19560 | loss 3.272329 (-0.64z)| norm 0.2448 (+1.12z)| lr 5.81e-05 | 4227.25 ms | 31.9% bf16 MFU | 123327 tok/s step 16338/19560 | loss 3.312341 (+0.25z)| norm 0.2309 (-0.40z)| lr 5.81e-05 | 4162.26 ms | 32.4% bf16 MFU | 123459 tok/s step 16339/19560 | loss 3.332104 (+0.69z)| norm 0.2315 (-0.33z)| lr 5.80e-05 | 4267.33 ms | 31.6% bf16 MFU | 123429 tok/s step 16340/19560 | loss 3.312073 (+0.23z)| norm 0.2234 (-1.22z)| lr 5.80e-05 | 4198.11 ms | 32.2% bf16 MFU | 123502 tok/s step 16341/19560 | loss 3.297992 (-0.08z)| norm 0.2179 (-1.78z)| lr 5.80e-05 | 4260.72 ms | 31.7% bf16 MFU | 123479 tok/s step 16342/19560 | loss 3.288677 (-0.30z)| norm 0.2170 (-1.84z)| lr 5.79e-05 | 4154.47 ms | 32.5% bf16 MFU | 123615 tok/s step 16343/19560 | loss 3.387923 (+1.91z)| norm 0.2438 (+1.02z)| lr 5.79e-05 | 4183.25 ms | 32.3% bf16 MFU | 123701 tok/s step 16344/19560 | loss 3.327669 (+0.56z)| norm 0.2292 (-0.53z)| lr 5.79e-05 | 4168.06 ms | 32.4% bf16 MFU | 123805 tok/s step 16345/19560 | loss 3.329934 (+0.62z)| norm 0.2236 (-1.11z)| lr 5.78e-05 | 4173.12 ms | 32.4% bf16 MFU | 123897 tok/s step 16346/19560 | loss 3.300228 (-0.06z)| norm 0.2475 (+1.44z)| lr 5.78e-05 | 4165.45 ms | 32.4% bf16 MFU | 123995 tok/s step 16347/19560 | loss 3.280177 (-0.50z)| norm 0.2335 (-0.03z)| lr 5.77e-05 | 4169.20 ms | 32.4% bf16 MFU | 124083 tok/s step 16348/19560 | loss 3.382398 (+1.78z)| norm 0.2369 (+0.34z)| lr 5.77e-05 | 4173.48 ms | 32.4% bf16 MFU | 124160 tok/s step 16349/19560 | loss 3.327340 (+0.53z)| norm 0.2558 (+2.36z)| lr 5.77e-05 | 4173.14 ms | 32.4% bf16 MFU | 124234 tok/s step 16350/19560 | loss 3.244204 (-1.32z)| norm 0.2288 (-0.56z)| lr 5.76e-05 | 4164.73 ms | 32.4% bf16 MFU | 124317 tok/s step 16351/19560 | loss 3.369939 (+1.46z)| norm 0.2472 (+1.43z)| lr 5.76e-05 | 4164.59 ms | 32.4% bf16 MFU | 124395 tok/s step 16352/19560 | loss 3.226042 (-1.69z)| norm 0.2546 (+2.18z)| lr 5.76e-05 | 4166.93 ms | 32.4% bf16 MFU | 124467 tok/s step 16353/19560 | loss 3.347147 (+1.00z)| norm 0.2364 (+0.24z)| lr 5.75e-05 | 4217.40 ms | 32.0% bf16 MFU | 124459 tok/s step 16354/19560 | loss 3.300160 (-0.04z)| norm 0.2403 (+0.65z)| lr 5.75e-05 | 4168.61 ms | 32.4% bf16 MFU | 124525 tok/s step 16355/19560 | loss 3.288743 (-0.30z)| norm 0.2357 (+0.17z)| lr 5.75e-05 | 4160.10 ms | 32.5% bf16 MFU | 124600 tok/s step 16356/19560 | loss 3.339062 (+0.83z)| norm 0.2284 (-0.63z)| lr 5.74e-05 | 4163.81 ms | 32.4% bf16 MFU | 124666 tok/s step 16357/19560 | loss 3.305584 (+0.07z)| norm 0.2398 (+0.60z)| lr 5.74e-05 | 4167.22 ms | 32.4% bf16 MFU | 124723 tok/s step 16358/19560 | loss 3.292783 (-0.22z)| norm 0.2325 (-0.20z)| lr 5.74e-05 | 4169.82 ms | 32.4% bf16 MFU | 124773 tok/s step 16359/19560 | loss 3.353567 (+1.15z)| norm 0.2461 (+1.26z)| lr 5.73e-05 | 4163.08 ms | 32.4% bf16 MFU | 124832 tok/s step 16360/19560 | loss 3.311600 (+0.19z)| norm 0.2364 (+0.20z)| lr 5.73e-05 | 4159.93 ms | 32.5% bf16 MFU | 124892 tok/s step 16361/19560 | loss 3.307988 (+0.11z)| norm 0.2264 (-0.89z)| lr 5.73e-05 | 4164.94 ms | 32.4% bf16 MFU | 124941 tok/s step 16362/19560 | loss 3.278076 (-0.57z)| norm 0.2262 (-0.92z)| lr 5.72e-05 | 4217.52 ms | 32.0% bf16 MFU | 124910 tok/s step 16363/19560 | loss 3.319215 (+0.37z)| norm 0.2400 (+0.59z)| lr 5.72e-05 | 4165.90 ms | 32.4% bf16 MFU | 124957 tok/s step 16364/19560 | loss 3.324746 (+0.50z)| norm 0.2244 (-1.12z)| lr 5.72e-05 | 4164.50 ms | 32.4% bf16 MFU | 125004 tok/s step 16365/19560 | loss 3.283786 (-0.45z)| norm 0.2299 (-0.52z)| lr 5.71e-05 | 4166.46 ms | 32.4% bf16 MFU | 125045 tok/s step 16366/19560 | loss 3.272089 (-0.72z)| norm 0.2374 (+0.30z)| lr 5.71e-05 | 4165.67 ms | 32.4% bf16 MFU | 125086 tok/s step 16367/19560 | loss 3.326632 (+0.55z)| norm 0.2307 (-0.42z)| lr 5.70e-05 | 4164.21 ms | 32.4% bf16 MFU | 125127 tok/s step 16368/19560 | loss 3.330085 (+0.62z)| norm 0.2227 (-1.29z)| lr 5.70e-05 | 4164.98 ms | 32.4% bf16 MFU | 125165 tok/s step 16369/19560 | loss 3.371606 (+1.56z)| norm 0.2514 (+1.81z)| lr 5.70e-05 | 4162.18 ms | 32.4% bf16 MFU | 125205 tok/s step 16370/19560 | loss 3.338690 (+0.80z)| norm 0.2205 (-1.51z)| lr 5.69e-05 | 4193.03 ms | 32.2% bf16 MFU | 125196 tok/s step 16371/19560 | loss 3.340068 (+0.83z)| norm 0.2264 (-0.87z)| lr 5.69e-05 | 4178.26 ms | 32.3% bf16 MFU | 125210 tok/s step 16372/19560 | loss 3.338666 (+0.79z)| norm 0.2222 (-1.31z)| lr 5.69e-05 | 4197.58 ms | 32.2% bf16 MFU | 125195 tok/s step 16373/19560 | loss 3.371288 (+1.54z)| norm 0.2438 (+0.98z)| lr 5.68e-05 | 4166.13 ms | 32.4% bf16 MFU | 125228 tok/s step 16374/19560 | loss 3.404647 (+2.24z)| norm 0.3371 (+7.82z)| lr 5.68e-05 | 4207.49 ms | 32.1% bf16 MFU | 125197 tok/s step 16375/19560 | loss 3.304820 (-0.03z)| norm 0.2298 (-0.41z)| lr 5.68e-05 | 4172.40 ms | 32.4% bf16 MFU | 125220 tok/s step 16376/19560 | loss 3.303747 (-0.06z)| norm 0.2403 (+0.39z)| lr 5.67e-05 | 4165.52 ms | 32.4% bf16 MFU | 125252 tok/s step 16377/19560 | loss 3.322995 (+0.39z)| norm 0.2391 (+0.30z)| lr 5.67e-05 | 4171.35 ms | 32.4% bf16 MFU | 125274 tok/s step 16378/19560 | loss 3.439674 (+2.95z)| norm 0.2543 (+1.44z)| lr 5.67e-05 | 4171.65 ms | 32.4% bf16 MFU | 125294 tok/s step 16379/19560 | loss 3.266677 (-0.89z)| norm 0.2721 (+2.72z)| lr 5.66e-05 | 4166.57 ms | 32.4% bf16 MFU | 125321 tok/s step 16380/19560 | loss 3.309068 (+0.04z)| norm 0.2395 (+0.29z)| lr 5.66e-05 | 4165.93 ms | 32.4% bf16 MFU | 125347 tok/s step 16381/19560 | loss 3.404228 (+2.11z)| norm 0.2441 (+0.63z)| lr 5.66e-05 | 4208.53 ms | 32.1% bf16 MFU | 125309 tok/s step 16382/19560 | loss 3.299126 (-0.20z)| norm 0.2492 (+0.99z)| lr 5.65e-05 | 4182.63 ms | 32.3% bf16 MFU | 125311 tok/s step 16383/19560 | loss 3.291869 (-0.37z)| norm 0.2343 (-0.11z)| lr 5.65e-05 | 4181.31 ms | 32.3% bf16 MFU | 125315 tok/s step 16384/19560 | loss 3.278701 (-0.64z)| norm 0.2189 (-1.25z)| lr 5.65e-05 | 4187.19 ms | 32.2% bf16 MFU | 125310 tok/s step 16385/19560 | loss 3.289511 (-0.40z)| norm 0.2376 (+0.13z)| lr 5.64e-05 | 4163.13 ms | 32.4% bf16 MFU | 125341 tok/s step 16386/19560 | loss 3.268061 (-0.87z)| norm 0.2187 (-1.25z)| lr 5.64e-05 | 4173.34 ms | 32.4% bf16 MFU | 125355 tok/s step 16387/19560 | loss 3.282491 (-0.56z)| norm 0.2242 (-0.84z)| lr 5.64e-05 | 4165.97 ms | 32.4% bf16 MFU | 125380 tok/s step 16388/19560 | loss 3.293152 (-0.33z)| norm 0.2308 (-0.35z)| lr 5.63e-05 | 4171.47 ms | 32.4% bf16 MFU | 125395 tok/s step 16389/19560 | loss 3.281693 (-0.58z)| norm 0.2223 (-0.97z)| lr 5.63e-05 | 4169.14 ms | 32.4% bf16 MFU | 125413 tok/s step 16390/19560 | loss 3.555984 (+4.98z)| norm 0.2765 (+2.87z)| lr 5.62e-05 | 4160.32 ms | 32.5% bf16 MFU | 125443 tok/s step 16391/19560 | loss 3.366972 (+1.14z)| norm 0.2311 (-0.35z)| lr 5.62e-05 | 4165.91 ms | 32.4% bf16 MFU | 125464 tok/s step 16392/19560 | loss 3.318449 (+0.17z)| norm 0.2368 (+0.08z)| lr 5.62e-05 | 4174.59 ms | 32.3% bf16 MFU | 125470 tok/s step 16393/19560 | loss 3.387417 (+1.53z)| norm 0.2489 (+0.94z)| lr 5.61e-05 | 4166.03 ms | 32.4% bf16 MFU | 125489 tok/s step 16394/19560 | loss 3.266666 (-0.88z)| norm 0.2323 (-0.25z)| lr 5.61e-05 | 4161.33 ms | 32.4% bf16 MFU | 125514 tok/s step 16395/19560 | loss 3.346903 (+0.71z)| norm 0.2579 (+1.58z)| lr 5.61e-05 | 4160.55 ms | 32.5% bf16 MFU | 125539 tok/s step 16396/19560 | loss 3.314904 (+0.07z)| norm 0.2348 (-0.06z)| lr 5.60e-05 | 4166.87 ms | 32.4% bf16 MFU | 125553 tok/s step 16397/19560 | loss 3.302056 (-0.19z)| norm 0.2263 (-0.67z)| lr 5.60e-05 | 4218.59 ms | 32.0% bf16 MFU | 125490 tok/s step 16398/19560 | loss 3.296357 (-0.30z)| norm 0.2463 (+0.77z)| lr 5.60e-05 | 4161.77 ms | 32.4% bf16 MFU | 125514 tok/s step 16399/19560 | loss 3.251678 (-1.18z)| norm 0.2235 (-0.86z)| lr 5.59e-05 | 4166.46 ms | 32.4% bf16 MFU | 125530 tok/s step 16400/19560 | loss 3.267645 (-0.85z)| norm 0.2422 (+0.49z)| lr 5.59e-05 | 4162.04 ms | 32.4% bf16 MFU | 125552 tok/s step 16401/19560 | loss 3.303000 (-0.15z)| norm 0.2379 (+0.17z)| lr 5.59e-05 | 4170.87 ms | 32.4% bf16 MFU | 125560 tok/s step 16402/19560 | loss 3.380948 (+1.40z)| norm 0.2488 (+0.95z)| lr 5.58e-05 | 4162.40 ms | 32.4% bf16 MFU | 125580 tok/s step 16403/19560 | loss 3.256215 (-1.06z)| norm 0.2340 (-0.11z)| lr 5.58e-05 | 4159.51 ms | 32.5% bf16 MFU | 125603 tok/s step 16404/19560 | loss 3.284878 (-0.49z)| norm 0.2577 (+1.56z)| lr 5.58e-05 | 4159.68 ms | 32.5% bf16 MFU | 125625 tok/s step 16405/19560 | loss 3.336136 (+0.51z)| norm 0.2278 (-0.56z)| lr 5.57e-05 | 4162.19 ms | 32.4% bf16 MFU | 125642 tok/s step 16406/19560 | loss 3.251851 (-1.16z)| norm 0.2326 (-0.22z)| lr 5.57e-05 | 4162.36 ms | 32.4% bf16 MFU | 125658 tok/s step 16407/19560 | loss 3.332527 (+0.43z)| norm 0.2364 (+0.05z)| lr 5.57e-05 | 4158.97 ms | 32.5% bf16 MFU | 125678 tok/s step 16408/19560 | loss 3.364239 (+1.05z)| norm 0.2650 (+2.03z)| lr 5.56e-05 | 4169.58 ms | 32.4% bf16 MFU | 125681 tok/s step 16409/19560 | loss 3.326072 (+0.28z)| norm 0.2232 (-0.91z)| lr 5.56e-05 | 4171.08 ms | 32.4% bf16 MFU | 125682 tok/s step 16410/19560 | loss 3.396886 (+1.66z)| norm 0.2361 (-0.01z)| lr 5.56e-05 | 4172.92 ms | 32.4% bf16 MFU | 125680 tok/s step 16411/19560 | loss 3.292368 (-0.39z)| norm 0.2337 (-0.17z)| lr 5.55e-05 | 4183.52 ms | 32.3% bf16 MFU | 125662 tok/s step 16412/19560 | loss 3.386823 (+1.44z)| norm 0.2415 (+0.37z)| lr 5.55e-05 | 4168.85 ms | 32.4% bf16 MFU | 125667 tok/s step 16413/19560 | loss 3.305246 (-0.16z)| norm 0.2217 (-1.02z)| lr 5.55e-05 | 4169.71 ms | 32.4% bf16 MFU | 125670 tok/s step 16414/19560 | loss 3.306964 (-0.13z)| norm 0.2261 (-0.71z)| lr 5.54e-05 | 4159.51 ms | 32.5% bf16 MFU | 125689 tok/s step 16415/19560 | loss 3.281832 (-0.61z)| norm 0.2370 (+0.05z)| lr 5.54e-05 | 4167.24 ms | 32.4% bf16 MFU | 125695 tok/s step 16416/19560 | loss 3.374918 (+1.20z)| norm 0.2285 (-0.55z)| lr 5.54e-05 | 4161.01 ms | 32.4% bf16 MFU | 125711 tok/s step 16417/19560 | loss 3.273653 (-0.78z)| norm 0.2231 (-0.93z)| lr 5.53e-05 | 4167.17 ms | 32.4% bf16 MFU | 125716 tok/s step 16418/19560 | loss 3.302529 (-0.22z)| norm 0.2266 (-0.69z)| lr 5.53e-05 | 4172.05 ms | 32.4% bf16 MFU | 125713 tok/s step 16419/19560 | loss 3.339661 (+0.50z)| norm 0.2226 (-0.97z)| lr 5.52e-05 | 4169.60 ms | 32.4% bf16 MFU | 125715 tok/s step 16420/19560 | loss 3.250123 (-1.24z)| norm 0.2214 (-1.05z)| lr 5.52e-05 | 4165.00 ms | 32.4% bf16 MFU | 125723 tok/s step 16421/19560 | loss 3.241124 (-1.40z)| norm 0.2338 (-0.18z)| lr 5.52e-05 | 4171.07 ms | 32.4% bf16 MFU | 125722 tok/s step 16422/19560 | loss 3.383559 (+1.33z)| norm 0.2326 (-0.27z)| lr 5.51e-05 | 4162.96 ms | 32.4% bf16 MFU | 125733 tok/s step 16423/19560 | loss 3.298070 (-0.31z)| norm 0.2285 (-0.54z)| lr 5.51e-05 | 4164.01 ms | 32.4% bf16 MFU | 125741 tok/s step 16424/19560 | loss 3.271210 (-0.81z)| norm 0.2262 (-0.70z)| lr 5.51e-05 | 4157.23 ms | 32.5% bf16 MFU | 125760 tok/s step 16425/19560 | loss 3.299024 (-0.28z)| norm 0.2209 (-1.06z)| lr 5.50e-05 | 4165.02 ms | 32.4% bf16 MFU | 125766 tok/s step 16426/19560 | loss 3.348554 (+0.66z)| norm 0.2240 (-0.84z)| lr 5.50e-05 | 4166.68 ms | 32.4% bf16 MFU | 125769 tok/s step 16427/19560 | loss 3.368182 (+1.02z)| norm 0.2321 (-0.27z)| lr 5.50e-05 | 4163.63 ms | 32.4% bf16 MFU | 125777 tok/s step 16428/19560 | loss 3.274212 (-0.78z)| norm 0.2234 (-0.88z)| lr 5.49e-05 | 4220.19 ms | 32.0% bf16 MFU | 125700 tok/s step 16429/19560 | loss 3.407718 (+1.75z)| norm 0.2477 (+0.82z)| lr 5.49e-05 | 4167.12 ms | 32.4% bf16 MFU | 125705 tok/s step 16430/19560 | loss 3.329360 (+0.25z)| norm 0.2382 (+0.16z)| lr 5.49e-05 | 4165.59 ms | 32.4% bf16 MFU | 125713 tok/s step 16431/19560 | loss 3.311836 (-0.10z)| norm 0.2379 (+0.14z)| lr 5.48e-05 | 4205.56 ms | 32.1% bf16 MFU | 125661 tok/s step 16432/19560 | loss 3.354990 (+0.72z)| norm 0.2375 (+0.12z)| lr 5.48e-05 | 4172.95 ms | 32.4% bf16 MFU | 125660 tok/s step 16433/19560 | loss 3.255929 (-1.19z)| norm 0.2256 (-0.72z)| lr 5.48e-05 | 4164.73 ms | 32.4% bf16 MFU | 125671 tok/s step 16434/19560 | loss 3.349785 (+0.62z)| norm 0.2408 (+0.34z)| lr 5.47e-05 | 4164.16 ms | 32.4% bf16 MFU | 125683 tok/s step 16435/19560 | loss 3.335881 (+0.34z)| norm 0.2413 (+0.37z)| lr 5.47e-05 | 4161.28 ms | 32.4% bf16 MFU | 125698 tok/s step 16436/19560 | loss 3.304203 (-0.25z)| norm 0.2187 (-1.20z)| lr 5.47e-05 | 4170.71 ms | 32.4% bf16 MFU | 125699 tok/s step 16437/19560 | loss 3.374936 (+1.16z)| norm 0.2416 (+0.41z)| lr 5.46e-05 | 4164.25 ms | 32.4% bf16 MFU | 125709 tok/s step 16438/19560 | loss 3.322909 (+0.11z)| norm 0.2313 (-0.31z)| lr 5.46e-05 | 4167.26 ms | 32.4% bf16 MFU | 125714 tok/s step 16439/19560 | loss 3.265921 (-1.04z)| norm 0.2521 (+1.14z)| lr 5.46e-05 | 4164.27 ms | 32.4% bf16 MFU | 125723 tok/s step 16440/19560 | loss 3.346400 (+0.57z)| norm 0.2371 (+0.08z)| lr 5.45e-05 | 4167.53 ms | 32.4% bf16 MFU | 125727 tok/s step 16441/19560 | loss 3.320055 (+0.04z)| norm 0.2356 (-0.02z)| lr 5.45e-05 | 4170.58 ms | 32.4% bf16 MFU | 125726 tok/s step 16442/19560 | loss 3.319545 (+0.01z)| norm 0.2459 (+0.70z)| lr 5.45e-05 | 4157.06 ms | 32.5% bf16 MFU | 125746 tok/s step 16443/19560 | loss 3.270179 (-1.03z)| norm 0.2412 (+0.37z)| lr 5.44e-05 | 4163.61 ms | 32.4% bf16 MFU | 125755 tok/s step 16444/19560 | loss 3.294800 (-0.52z)| norm 0.2309 (-0.37z)| lr 5.44e-05 | 4170.86 ms | 32.4% bf16 MFU | 125752 tok/s step 16445/19560 | loss 3.292612 (-0.56z)| norm 0.2371 (+0.07z)| lr 5.44e-05 | 4165.71 ms | 32.4% bf16 MFU | 125758 tok/s step 16446/19560 | loss 3.371714 (+1.07z)| norm 0.2797 (+2.95z)| lr 5.43e-05 | 4172.00 ms | 32.4% bf16 MFU | 125753 tok/s step 16447/19560 | loss 3.269071 (-1.06z)| norm 0.2274 (-0.61z)| lr 5.43e-05 | 4165.99 ms | 32.4% bf16 MFU | 125758 tok/s step 16448/19560 | loss 3.309581 (-0.20z)| norm 0.2243 (-0.81z)| lr 5.43e-05 | 4166.88 ms | 32.4% bf16 MFU | 125761 tok/s step 16449/19560 | loss 3.407166 (+1.80z)| norm 0.2645 (+1.88z)| lr 5.42e-05 | 4164.80 ms | 32.4% bf16 MFU | 125767 tok/s step 16450/19560 | loss 3.252060 (-1.41z)| norm 0.2272 (-0.63z)| lr 5.42e-05 | 4168.55 ms | 32.4% bf16 MFU | 125768 tok/s step 16451/19560 | loss 3.299152 (-0.42z)| norm 0.2313 (-0.35z)| lr 5.42e-05 | 4196.76 ms | 32.2% bf16 MFU | 125726 tok/s step 16452/19560 | loss 3.338245 (+0.41z)| norm 0.2278 (-0.59z)| lr 5.41e-05 | 4198.14 ms | 32.2% bf16 MFU | 125684 tok/s step 16453/19560 | loss 3.251223 (-1.42z)| norm 0.2223 (-0.94z)| lr 5.41e-05 | 4326.17 ms | 31.2% bf16 MFU | 125459 tok/s step 16454/19560 | loss 3.298828 (-0.39z)| norm 0.2233 (-0.88z)| lr 5.41e-05 | 4362.86 ms | 30.9% bf16 MFU | 125195 tok/s step 16455/19560 | loss 3.332634 (+0.32z)| norm 0.2393 (+0.21z)| lr 5.40e-05 | 4311.32 ms | 31.3% bf16 MFU | 125015 tok/s step 16456/19560 | loss 3.307817 (-0.21z)| norm 0.2421 (+0.39z)| lr 5.40e-05 | 4301.40 ms | 31.4% bf16 MFU | 124859 tok/s step 16457/19560 | loss 3.351403 (+0.72z)| norm 0.2297 (-0.45z)| lr 5.40e-05 | 4195.04 ms | 32.2% bf16 MFU | 124865 tok/s step 16458/19560 | loss 3.388134 (+1.48z)| norm 0.2418 (+0.36z)| lr 5.39e-05 | 4207.71 ms | 32.1% bf16 MFU | 124852 tok/s step 16459/19560 | loss 3.284548 (-0.73z)| norm 0.2316 (-0.33z)| lr 5.39e-05 | 4172.65 ms | 32.4% bf16 MFU | 124891 tok/s step 16460/19560 | loss 3.260595 (-1.22z)| norm 0.2355 (-0.06z)| lr 5.38e-05 | 4176.32 ms | 32.3% bf16 MFU | 124924 tok/s step 16461/19560 | loss 3.308086 (-0.21z)| norm 0.2332 (-0.22z)| lr 5.38e-05 | 4164.71 ms | 32.4% bf16 MFU | 124972 tok/s step 16462/19560 | loss 3.339511 (+0.45z)| norm 0.2330 (-0.22z)| lr 5.38e-05 | 4167.57 ms | 32.4% bf16 MFU | 125014 tok/s step 16463/19560 | loss 3.376034 (+1.21z)| norm 0.2248 (-0.77z)| lr 5.37e-05 | 4167.39 ms | 32.4% bf16 MFU | 125053 tok/s step 16464/19560 | loss 3.293726 (-0.52z)| norm 0.2184 (-1.19z)| lr 5.37e-05 | 4164.10 ms | 32.4% bf16 MFU | 125096 tok/s step 16465/19560 | loss 3.435618 (+2.40z)| norm 0.2746 (+2.51z)| lr 5.37e-05 | 4224.66 ms | 32.0% bf16 MFU | 125046 tok/s step 16466/19560 | loss 3.287961 (-0.65z)| norm 0.2290 (-0.47z)| lr 5.36e-05 | 4175.34 ms | 32.3% bf16 MFU | 125072 tok/s step 16467/19560 | loss 3.264647 (-1.12z)| norm 0.2371 (+0.05z)| lr 5.36e-05 | 4167.50 ms | 32.4% bf16 MFU | 125109 tok/s step 16468/19560 | loss 3.300690 (-0.38z)| norm 0.2238 (-0.82z)| lr 5.36e-05 | 4168.66 ms | 32.4% bf16 MFU | 125142 tok/s step 16469/19560 | loss 3.303332 (-0.32z)| norm 0.2484 (+0.78z)| lr 5.35e-05 | 4171.98 ms | 32.4% bf16 MFU | 125168 tok/s step 16470/19560 | loss 3.324658 (+0.11z)| norm 0.2382 (+0.10z)| lr 5.35e-05 | 4176.34 ms | 32.3% bf16 MFU | 125187 tok/s step 16471/19560 | loss 3.316073 (-0.06z)| norm 0.2261 (-0.70z)| lr 5.35e-05 | 4161.83 ms | 32.4% bf16 MFU | 125226 tok/s step 16472/19560 | loss 3.277991 (-0.84z)| norm 0.2205 (-1.05z)| lr 5.34e-05 | 4166.43 ms | 32.4% bf16 MFU | 125257 tok/s step 16473/19560 | loss 3.325611 (+0.15z)| norm 0.2262 (-0.68z)| lr 5.34e-05 | 4177.94 ms | 32.3% bf16 MFU | 125268 tok/s step 16474/19560 | loss 3.269644 (-1.00z)| norm 0.2486 (+0.79z)| lr 5.34e-05 | 4170.27 ms | 32.4% bf16 MFU | 125291 tok/s step 16475/19560 | loss 3.351065 (+0.67z)| norm 0.2244 (-0.79z)| lr 5.33e-05 | 4170.50 ms | 32.4% bf16 MFU | 125312 tok/s step 16476/19560 | loss 3.273695 (-0.91z)| norm 0.2210 (-1.01z)| lr 5.33e-05 | 4181.36 ms | 32.3% bf16 MFU | 125316 tok/s step 16477/19560 | loss 3.296130 (-0.44z)| norm 0.2370 (+0.05z)| lr 5.33e-05 | 4168.33 ms | 32.4% bf16 MFU | 125339 tok/s step 16478/19560 | loss 3.378086 (+1.24z)| norm 0.2287 (-0.49z)| lr 5.32e-05 | 4179.24 ms | 32.3% bf16 MFU | 125344 tok/s step 16479/19560 | loss 3.243049 (-1.54z)| norm 0.2212 (-0.97z)| lr 5.32e-05 | 4162.21 ms | 32.4% bf16 MFU | 125375 tok/s step 16480/19560 | loss 3.277069 (-0.86z)| norm 0.2284 (-0.49z)| lr 5.32e-05 | 4165.04 ms | 32.4% bf16 MFU | 125401 tok/s step 16481/19560 | loss 3.349692 (+0.66z)| norm 0.2266 (-0.60z)| lr 5.31e-05 | 4174.32 ms | 32.3% bf16 MFU | 125410 tok/s step 16482/19560 | loss 3.348191 (+0.62z)| norm 0.2314 (-0.28z)| lr 5.31e-05 | 4163.83 ms | 32.4% bf16 MFU | 125436 tok/s step 16483/19560 | loss 3.397093 (+1.61z)| norm 0.2500 (+0.94z)| lr 5.31e-05 | 4172.24 ms | 32.4% bf16 MFU | 125447 tok/s step 16484/19560 | loss 3.344717 (+0.53z)| norm 0.2380 (+0.15z)| lr 5.30e-05 | 4166.80 ms | 32.4% bf16 MFU | 125466 tok/s step 16485/19560 | loss 3.306503 (-0.27z)| norm 0.2616 (+1.67z)| lr 5.30e-05 | 4160.60 ms | 32.5% bf16 MFU | 125493 tok/s step 16486/19560 | loss 3.309924 (-0.20z)| norm 0.2223 (-0.88z)| lr 5.30e-05 | 4182.71 ms | 32.3% bf16 MFU | 125486 tok/s step 16487/19560 | loss 3.320774 (+0.03z)| norm 0.2441 (+0.54z)| lr 5.29e-05 | 4167.23 ms | 32.4% bf16 MFU | 125502 tok/s step 16488/19560 | loss 3.291780 (-0.57z)| norm 0.2356 (-0.02z)| lr 5.29e-05 | 4167.09 ms | 32.4% bf16 MFU | 125518 tok/s step 16489/19560 | loss 3.262153 (-1.17z)| norm 0.2312 (-0.31z)| lr 5.29e-05 | 4161.76 ms | 32.4% bf16 MFU | 125541 tok/s step 16490/19560 | loss 3.257857 (-1.25z)| norm 0.2389 (+0.19z)| lr 5.28e-05 | 4167.18 ms | 32.4% bf16 MFU | 125554 tok/s step 16491/19560 | loss 3.297314 (-0.43z)| norm 0.2323 (-0.24z)| lr 5.28e-05 | 4171.41 ms | 32.4% bf16 MFU | 125561 tok/s step 16492/19560 | loss 3.255131 (-1.28z)| norm 0.2327 (-0.22z)| lr 5.28e-05 | 4222.79 ms | 32.0% bf16 MFU | 125491 tok/s step 16493/19560 | loss 3.311996 (-0.12z)| norm 0.2379 (+0.12z)| lr 5.27e-05 | 4167.19 ms | 32.4% bf16 MFU | 125507 tok/s step 16494/19560 | loss 3.341405 (+0.47z)| norm 0.2532 (+1.11z)| lr 5.27e-05 | 4169.69 ms | 32.4% bf16 MFU | 125519 tok/s step 16495/19560 | loss 3.263524 (-1.11z)| norm 0.2348 (-0.09z)| lr 5.27e-05 | 4162.63 ms | 32.4% bf16 MFU | 125540 tok/s step 16496/19560 | loss 3.302205 (-0.32z)| norm 0.2325 (-0.25z)| lr 5.26e-05 | 4161.81 ms | 32.4% bf16 MFU | 125562 tok/s step 16497/19560 | loss 3.326973 (+0.19z)| norm 0.2269 (-0.61z)| lr 5.26e-05 | 4163.63 ms | 32.4% bf16 MFU | 125580 tok/s step 16498/19560 | loss 3.351351 (+0.69z)| norm 0.2397 (+0.23z)| lr 5.26e-05 | 4164.13 ms | 32.4% bf16 MFU | 125596 tok/s step 16499/19560 | loss 3.298172 (-0.39z)| norm 0.2580 (+1.40z)| lr 5.25e-05 | 4164.00 ms | 32.4% bf16 MFU | 125612 tok/s step 16500/19560 | loss 3.265442 (-1.05z)| norm 0.2456 (+0.58z)| lr 5.25e-05 | 4182.82 ms | 32.3% bf16 MFU | 125598 tok/s val loss 3.265605 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3025/10042 = 0.301235 step 16501/19560 | loss 3.316282 (-0.00z)| norm 0.2372 (+0.04z)| lr 5.25e-05 | 4165.70 ms | 32.4% bf16 MFU | 125611 tok/s step 16502/19560 | loss 3.360467 (+0.92z)| norm 0.2250 (-0.87z)| lr 5.24e-05 | 4170.28 ms | 32.4% bf16 MFU | 125617 tok/s step 16503/19560 | loss 3.303445 (-0.26z)| norm 0.2592 (+1.84z)| lr 5.24e-05 | 4167.42 ms | 32.4% bf16 MFU | 125626 tok/s step 16504/19560 | loss 3.298783 (-0.35z)| norm 0.2430 (+0.55z)| lr 5.24e-05 | 4182.24 ms | 32.3% bf16 MFU | 125613 tok/s step 16505/19560 | loss 3.281763 (-0.70z)| norm 0.2187 (-1.35z)| lr 5.23e-05 | 4525.07 ms | 29.8% bf16 MFU | 125126 tok/s step 16506/19560 | loss 3.283392 (-0.66z)| norm 0.2267 (-0.71z)| lr 5.23e-05 | 4613.69 ms | 29.3% bf16 MFU | 124551 tok/s step 16507/19560 | loss 3.324680 (+0.21z)| norm 0.2442 (+0.71z)| lr 5.23e-05 | 4440.32 ms | 30.4% bf16 MFU | 124227 tok/s step 16508/19560 | loss 3.343193 (+0.60z)| norm 0.2236 (-0.96z)| lr 5.22e-05 | 4396.02 ms | 30.7% bf16 MFU | 123979 tok/s step 16509/19560 | loss 3.291298 (-0.49z)| norm 0.2371 (+0.15z)| lr 5.22e-05 | 4318.95 ms | 31.3% bf16 MFU | 123850 tok/s step 16510/19560 | loss 3.294371 (-0.43z)| norm 0.2250 (-0.83z)| lr 5.22e-05 | 4252.68 ms | 31.7% bf16 MFU | 123822 tok/s step 16511/19560 | loss 3.314390 (+0.00z)| norm 0.2297 (-0.44z)| lr 5.21e-05 | 4314.36 ms | 31.3% bf16 MFU | 123707 tok/s step 16512/19560 | loss 3.335608 (+0.45z)| norm 0.2512 (+1.30z)| lr 5.21e-05 | 4308.47 ms | 31.3% bf16 MFU | 123606 tok/s step 16513/19560 | loss 3.279934 (-0.75z)| norm 0.2209 (-1.16z)| lr 5.21e-05 | 4174.71 ms | 32.3% bf16 MFU | 123705 tok/s step 16514/19560 | loss 3.298326 (-0.36z)| norm 0.2311 (-0.34z)| lr 5.20e-05 | 4163.21 ms | 32.4% bf16 MFU | 123816 tok/s step 16515/19560 | loss 3.260038 (-1.18z)| norm 0.2330 (-0.20z)| lr 5.20e-05 | 4230.62 ms | 31.9% bf16 MFU | 123822 tok/s step 16516/19560 | loss 3.286622 (-0.60z)| norm 0.2310 (-0.36z)| lr 5.20e-05 | 4219.48 ms | 32.0% bf16 MFU | 123843 tok/s step 16517/19560 | loss 3.268631 (-0.99z)| norm 0.2323 (-0.26z)| lr 5.19e-05 | 4173.62 ms | 32.4% bf16 MFU | 123932 tok/s step 16518/19560 | loss 3.291651 (-0.51z)| norm 0.2284 (-0.58z)| lr 5.19e-05 | 4200.81 ms | 32.1% bf16 MFU | 123976 tok/s step 16519/19560 | loss 3.295274 (-0.41z)| norm 0.2337 (-0.12z)| lr 5.19e-05 | 4293.31 ms | 31.4% bf16 MFU | 123883 tok/s step 16520/19560 | loss 3.286884 (-0.60z)| norm 0.2237 (-0.97z)| lr 5.18e-05 | 4170.20 ms | 32.4% bf16 MFU | 123975 tok/s step 16521/19560 | loss 3.247829 (-1.53z)| norm 0.2254 (-0.81z)| lr 5.18e-05 | 4226.07 ms | 31.9% bf16 MFU | 123979 tok/s step 16522/19560 | loss 3.371823 (+1.47z)| norm 0.2260 (-0.75z)| lr 5.18e-05 | 4160.53 ms | 32.5% bf16 MFU | 124081 tok/s step 16523/19560 | loss 3.334867 (+0.57z)| norm 0.2238 (-0.93z)| lr 5.17e-05 | 4190.32 ms | 32.2% bf16 MFU | 124133 tok/s step 16524/19560 | loss 3.302848 (-0.21z)| norm 0.2182 (-1.41z)| lr 5.17e-05 | 4163.93 ms | 32.4% bf16 MFU | 124222 tok/s step 16525/19560 | loss 3.265237 (-1.11z)| norm 0.2246 (-0.84z)| lr 5.17e-05 | 4172.63 ms | 32.4% bf16 MFU | 124293 tok/s step 16526/19560 | loss 3.282188 (-0.69z)| norm 0.2189 (-1.32z)| lr 5.16e-05 | 4175.01 ms | 32.3% bf16 MFU | 124357 tok/s step 16527/19560 | loss 3.326560 (+0.37z)| norm 0.2304 (-0.33z)| lr 5.16e-05 | 4169.01 ms | 32.4% bf16 MFU | 124427 tok/s step 16528/19560 | loss 3.288202 (-0.57z)| norm 0.2265 (-0.66z)| lr 5.16e-05 | 4195.09 ms | 32.2% bf16 MFU | 124455 tok/s step 16529/19560 | loss 3.400855 (+2.12z)| norm 0.2354 (+0.12z)| lr 5.15e-05 | 4171.23 ms | 32.4% bf16 MFU | 124517 tok/s step 16530/19560 | loss 3.235273 (-1.82z)| norm 0.2334 (-0.04z)| lr 5.15e-05 | 4178.49 ms | 32.3% bf16 MFU | 124564 tok/s step 16531/19560 | loss 3.312051 (+0.01z)| norm 0.2311 (-0.24z)| lr 5.15e-05 | 4165.94 ms | 32.4% bf16 MFU | 124629 tok/s step 16532/19560 | loss 3.354221 (+1.01z)| norm 0.2362 (+0.22z)| lr 5.14e-05 | 4159.84 ms | 32.5% bf16 MFU | 124699 tok/s step 16533/19560 | loss 3.273824 (-0.91z)| norm 0.2290 (-0.42z)| lr 5.14e-05 | 4173.08 ms | 32.4% bf16 MFU | 124746 tok/s step 16534/19560 | loss 3.269731 (-1.02z)| norm 0.2465 (+1.13z)| lr 5.14e-05 | 4169.38 ms | 32.4% bf16 MFU | 124796 tok/s step 16535/19560 | loss 3.244377 (-1.60z)| norm 0.2269 (-0.60z)| lr 5.13e-05 | 4168.23 ms | 32.4% bf16 MFU | 124845 tok/s step 16536/19560 | loss 3.257156 (-1.27z)| norm 0.2296 (-0.36z)| lr 5.13e-05 | 4168.83 ms | 32.4% bf16 MFU | 124891 tok/s step 16537/19560 | loss 3.311605 (+0.03z)| norm 0.2323 (-0.11z)| lr 5.13e-05 | 4215.20 ms | 32.0% bf16 MFU | 124866 tok/s step 16538/19560 | loss 3.269658 (-0.96z)| norm 0.2324 (-0.10z)| lr 5.12e-05 | 4170.23 ms | 32.4% bf16 MFU | 124908 tok/s step 16539/19560 | loss 3.275812 (-0.81z)| norm 0.2341 (+0.06z)| lr 5.12e-05 | 4175.02 ms | 32.3% bf16 MFU | 124942 tok/s step 16540/19560 | loss 3.306729 (-0.04z)| norm 0.2334 (-0.00z)| lr 5.12e-05 | 4167.02 ms | 32.4% bf16 MFU | 124986 tok/s step 16541/19560 | loss 3.291871 (-0.41z)| norm 0.2325 (-0.09z)| lr 5.11e-05 | 4173.06 ms | 32.4% bf16 MFU | 125018 tok/s step 16542/19560 | loss 3.222685 (-2.05z)| norm 0.2215 (-1.11z)| lr 5.11e-05 | 4169.97 ms | 32.4% bf16 MFU | 125054 tok/s step 16543/19560 | loss 3.329924 (+0.53z)| norm 0.2335 (-0.00z)| lr 5.11e-05 | 4165.26 ms | 32.4% bf16 MFU | 125095 tok/s step 16544/19560 | loss 3.296408 (-0.27z)| norm 0.2231 (-0.94z)| lr 5.10e-05 | 4172.64 ms | 32.4% bf16 MFU | 125122 tok/s step 16545/19560 | loss 3.289831 (-0.43z)| norm 0.2418 (+0.75z)| lr 5.10e-05 | 4172.85 ms | 32.4% bf16 MFU | 125148 tok/s step 16546/19560 | loss 3.283723 (-0.58z)| norm 0.2230 (-0.97z)| lr 5.10e-05 | 4167.34 ms | 32.4% bf16 MFU | 125181 tok/s step 16547/19560 | loss 3.316353 (+0.22z)| norm 0.2274 (-0.57z)| lr 5.09e-05 | 4222.20 ms | 32.0% bf16 MFU | 125131 tok/s step 16548/19560 | loss 3.248757 (-1.43z)| norm 0.2202 (-1.23z)| lr 5.09e-05 | 4203.34 ms | 32.1% bf16 MFU | 125111 tok/s step 16549/19560 | loss 3.260361 (-1.16z)| norm 0.2334 (-0.02z)| lr 5.09e-05 | 4165.95 ms | 32.4% bf16 MFU | 125148 tok/s step 16550/19560 | loss 3.245254 (-1.51z)| norm 0.2169 (-1.50z)| lr 5.08e-05 | 4204.28 ms | 32.1% bf16 MFU | 125126 tok/s step 16551/19560 | loss 3.253442 (-1.29z)| norm 0.2290 (-0.41z)| lr 5.08e-05 | 4184.73 ms | 32.3% bf16 MFU | 125134 tok/s step 16552/19560 | loss 3.289992 (-0.40z)| norm 0.2239 (-0.87z)| lr 5.08e-05 | 4175.85 ms | 32.3% bf16 MFU | 125155 tok/s step 16553/19560 | loss 3.238586 (-1.63z)| norm 0.2163 (-1.55z)| lr 5.07e-05 | 4170.07 ms | 32.4% bf16 MFU | 125183 tok/s step 16554/19560 | loss 3.273148 (-0.78z)| norm 0.2264 (-0.63z)| lr 5.07e-05 | 4165.89 ms | 32.4% bf16 MFU | 125217 tok/s step 16555/19560 | loss 3.243265 (-1.48z)| norm 0.2276 (-0.52z)| lr 5.07e-05 | 4180.91 ms | 32.3% bf16 MFU | 125226 tok/s step 16556/19560 | loss 3.299766 (-0.11z)| norm 0.2237 (-0.88z)| lr 5.06e-05 | 4163.36 ms | 32.4% bf16 MFU | 125261 tok/s step 16557/19560 | loss 3.238509 (-1.60z)| norm 0.2220 (-1.01z)| lr 5.06e-05 | 4170.37 ms | 32.4% bf16 MFU | 125284 tok/s step 16558/19560 | loss 3.329669 (+0.66z)| norm 0.2488 (+1.40z)| lr 5.06e-05 | 4180.20 ms | 32.3% bf16 MFU | 125291 tok/s step 16559/19560 | loss 3.285881 (-0.42z)| norm 0.2461 (+1.14z)| lr 5.05e-05 | 4165.94 ms | 32.4% bf16 MFU | 125319 tok/s step 16560/19560 | loss 3.316180 (+0.34z)| norm 0.2665 (+2.86z)| lr 5.05e-05 | 4172.93 ms | 32.4% bf16 MFU | 125335 tok/s step 16561/19560 | loss 3.277778 (-0.62z)| norm 0.2215 (-1.04z)| lr 5.05e-05 | 4184.18 ms | 32.3% bf16 MFU | 125333 tok/s step 16562/19560 | loss 3.333826 (+0.79z)| norm 0.2321 (-0.11z)| lr 5.04e-05 | 4175.50 ms | 32.3% bf16 MFU | 125345 tok/s step 16563/19560 | loss 3.273549 (-0.72z)| norm 0.2320 (-0.12z)| lr 5.04e-05 | 4170.93 ms | 32.4% bf16 MFU | 125363 tok/s step 16564/19560 | loss 3.224825 (-1.90z)| norm 0.2225 (-0.95z)| lr 5.04e-05 | 4167.93 ms | 32.4% bf16 MFU | 125384 tok/s step 16565/19560 | loss 3.278476 (-0.56z)| norm 0.2356 (+0.20z)| lr 5.03e-05 | 4164.18 ms | 32.4% bf16 MFU | 125410 tok/s step 16566/19560 | loss 3.325752 (+0.63z)| norm 0.2202 (-1.14z)| lr 5.03e-05 | 4161.92 ms | 32.4% bf16 MFU | 125438 tok/s step 16567/19560 | loss 3.285744 (-0.38z)| norm 0.2196 (-1.18z)| lr 5.03e-05 | 4167.95 ms | 32.4% bf16 MFU | 125456 tok/s step 16568/19560 | loss 3.325164 (+0.62z)| norm 0.2246 (-0.73z)| lr 5.02e-05 | 4164.83 ms | 32.4% bf16 MFU | 125477 tok/s step 16569/19560 | loss 3.300215 (-0.01z)| norm 0.2270 (-0.51z)| lr 5.02e-05 | 4155.78 ms | 32.5% bf16 MFU | 125511 tok/s step 16570/19560 | loss 3.307955 (+0.19z)| norm 0.2273 (-0.48z)| lr 5.02e-05 | 4202.17 ms | 32.1% bf16 MFU | 125474 tok/s step 16571/19560 | loss 3.293790 (-0.17z)| norm 0.2255 (-0.62z)| lr 5.01e-05 | 4171.89 ms | 32.4% bf16 MFU | 125484 tok/s step 16572/19560 | loss 3.322362 (+0.54z)| norm 0.2541 (+1.84z)| lr 5.01e-05 | 4165.30 ms | 32.4% bf16 MFU | 125503 tok/s step 16573/19560 | loss 3.238590 (-1.55z)| norm 0.2319 (-0.08z)| lr 5.01e-05 | 4170.45 ms | 32.4% bf16 MFU | 125514 tok/s step 16574/19560 | loss 3.330722 (+0.78z)| norm 0.2296 (-0.26z)| lr 5.00e-05 | 4162.13 ms | 32.4% bf16 MFU | 125536 tok/s step 16575/19560 | loss 3.338884 (+0.97z)| norm 0.2339 (+0.14z)| lr 5.00e-05 | 4169.39 ms | 32.4% bf16 MFU | 125547 tok/s step 16576/19560 | loss 3.305179 (+0.12z)| norm 0.2266 (-0.54z)| lr 5.00e-05 | 4161.13 ms | 32.4% bf16 MFU | 125569 tok/s step 16577/19560 | loss 3.277608 (-0.57z)| norm 0.2192 (-1.23z)| lr 4.99e-05 | 4182.25 ms | 32.3% bf16 MFU | 125559 tok/s step 16578/19560 | loss 3.355754 (+1.44z)| norm 0.2420 (+0.93z)| lr 4.99e-05 | 4170.78 ms | 32.4% bf16 MFU | 125566 tok/s step 16579/19560 | loss 3.336284 (+0.92z)| norm 0.2375 (+0.50z)| lr 4.99e-05 | 4175.00 ms | 32.3% bf16 MFU | 125567 tok/s step 16580/19560 | loss 3.322109 (+0.56z)| norm 0.2273 (-0.47z)| lr 4.98e-05 | 4164.29 ms | 32.4% bf16 MFU | 125584 tok/s step 16581/19560 | loss 3.308097 (+0.18z)| norm 0.2291 (-0.30z)| lr 4.98e-05 | 4187.41 ms | 32.2% bf16 MFU | 125565 tok/s step 16582/19560 | loss 3.242491 (-1.50z)| norm 0.2258 (-0.62z)| lr 4.98e-05 | 4168.04 ms | 32.4% bf16 MFU | 125576 tok/s step 16583/19560 | loss 3.335533 (+0.90z)| norm 0.2415 (+0.88z)| lr 4.98e-05 | 4171.73 ms | 32.4% bf16 MFU | 125581 tok/s step 16584/19560 | loss 3.300943 (+0.01z)| norm 0.2222 (-0.95z)| lr 4.97e-05 | 4165.96 ms | 32.4% bf16 MFU | 125594 tok/s step 16585/19560 | loss 3.330269 (+0.78z)| norm 0.2222 (-0.94z)| lr 4.97e-05 | 4168.88 ms | 32.4% bf16 MFU | 125603 tok/s step 16586/19560 | loss 3.390262 (+2.33z)| norm 0.2339 (+0.18z)| lr 4.97e-05 | 4168.51 ms | 32.4% bf16 MFU | 125611 tok/s step 16587/19560 | loss 3.351083 (+1.29z)| norm 0.2301 (-0.19z)| lr 4.96e-05 | 4166.61 ms | 32.4% bf16 MFU | 125622 tok/s step 16588/19560 | loss 3.310701 (+0.24z)| norm 0.2395 (+0.71z)| lr 4.96e-05 | 4174.79 ms | 32.3% bf16 MFU | 125620 tok/s step 16589/19560 | loss 3.269684 (-0.81z)| norm 0.2286 (-0.33z)| lr 4.96e-05 | 4199.73 ms | 32.1% bf16 MFU | 125581 tok/s step 16590/19560 | loss 3.321335 (+0.53z)| norm 0.2376 (+0.53z)| lr 4.95e-05 | 4177.98 ms | 32.3% bf16 MFU | 125577 tok/s step 16591/19560 | loss 3.305230 (+0.13z)| norm 0.2331 (+0.09z)| lr 4.95e-05 | 4164.88 ms | 32.4% bf16 MFU | 125592 tok/s step 16592/19560 | loss 3.255783 (-1.16z)| norm 0.2289 (-0.31z)| lr 4.95e-05 | 4172.55 ms | 32.4% bf16 MFU | 125595 tok/s step 16593/19560 | loss 3.291698 (-0.20z)| norm 0.2363 (+0.45z)| lr 4.94e-05 | 4166.31 ms | 32.4% bf16 MFU | 125607 tok/s step 16594/19560 | loss 3.406938 (+2.86z)| norm 0.2224 (-0.97z)| lr 4.94e-05 | 4170.00 ms | 32.4% bf16 MFU | 125613 tok/s step 16595/19560 | loss 3.281829 (-0.49z)| norm 0.2425 (+1.09z)| lr 4.94e-05 | 4170.65 ms | 32.4% bf16 MFU | 125618 tok/s step 16596/19560 | loss 3.295794 (-0.11z)| norm 0.2337 (+0.17z)| lr 4.93e-05 | 4168.84 ms | 32.4% bf16 MFU | 125625 tok/s step 16597/19560 | loss 3.276445 (-0.62z)| norm 0.2368 (+0.51z)| lr 4.93e-05 | 4165.53 ms | 32.4% bf16 MFU | 125637 tok/s step 16598/19560 | loss 3.331464 (+0.85z)| norm 0.2270 (-0.50z)| lr 4.93e-05 | 4181.73 ms | 32.3% bf16 MFU | 125624 tok/s step 16599/19560 | loss 3.335281 (+0.94z)| norm 0.2365 (+0.48z)| lr 4.92e-05 | 4163.26 ms | 32.4% bf16 MFU | 125640 tok/s step 16600/19560 | loss 3.300283 (+0.01z)| norm 0.2384 (+0.66z)| lr 4.92e-05 | 4165.11 ms | 32.4% bf16 MFU | 125651 tok/s step 16601/19560 | loss 3.290173 (-0.26z)| norm 0.2267 (-0.55z)| lr 4.92e-05 | 4172.25 ms | 32.4% bf16 MFU | 125652 tok/s step 16602/19560 | loss 3.284658 (-0.41z)| norm 0.2309 (-0.11z)| lr 4.91e-05 | 4179.70 ms | 32.3% bf16 MFU | 125641 tok/s step 16603/19560 | loss 3.264919 (-0.92z)| norm 0.2156 (-1.69z)| lr 4.91e-05 | 4163.43 ms | 32.4% bf16 MFU | 125655 tok/s step 16604/19560 | loss 3.307029 (+0.20z)| norm 0.2372 (+0.55z)| lr 4.91e-05 | 4164.65 ms | 32.4% bf16 MFU | 125667 tok/s step 16605/19560 | loss 3.291635 (-0.21z)| norm 0.2254 (-0.68z)| lr 4.90e-05 | 4157.42 ms | 32.5% bf16 MFU | 125689 tok/s step 16606/19560 | loss 3.254253 (-1.20z)| norm 0.2376 (+0.60z)| lr 4.90e-05 | 4174.89 ms | 32.3% bf16 MFU | 125684 tok/s step 16607/19560 | loss 3.261837 (-1.01z)| norm 0.2199 (-1.26z)| lr 4.90e-05 | 4166.22 ms | 32.4% bf16 MFU | 125692 tok/s step 16608/19560 | loss 3.301749 (+0.08z)| norm 0.2577 (+2.60z)| lr 4.89e-05 | 4157.73 ms | 32.5% bf16 MFU | 125712 tok/s step 16609/19560 | loss 3.305288 (+0.19z)| norm 0.2447 (+1.26z)| lr 4.89e-05 | 4163.43 ms | 32.4% bf16 MFU | 125723 tok/s step 16610/19560 | loss 3.340082 (+1.15z)| norm 0.2463 (+1.40z)| lr 4.89e-05 | 4174.60 ms | 32.3% bf16 MFU | 125716 tok/s step 16611/19560 | loss 3.282355 (-0.43z)| norm 0.2286 (-0.37z)| lr 4.88e-05 | 4168.64 ms | 32.4% bf16 MFU | 125719 tok/s step 16612/19560 | loss 3.244879 (-1.48z)| norm 0.2715 (+3.75z)| lr 4.88e-05 | 4170.36 ms | 32.4% bf16 MFU | 125719 tok/s step 16613/19560 | loss 3.306586 (+0.28z)| norm 0.2320 (-0.03z)| lr 4.88e-05 | 4175.96 ms | 32.3% bf16 MFU | 125710 tok/s step 16614/19560 | loss 3.294413 (-0.06z)| norm 0.2302 (-0.21z)| lr 4.87e-05 | 4164.45 ms | 32.4% bf16 MFU | 125720 tok/s step 16615/19560 | loss 3.280579 (-0.45z)| norm 0.2339 (+0.16z)| lr 4.87e-05 | 4174.20 ms | 32.3% bf16 MFU | 125714 tok/s step 16616/19560 | loss 3.288148 (-0.23z)| norm 0.2340 (+0.18z)| lr 4.87e-05 | 4169.14 ms | 32.4% bf16 MFU | 125716 tok/s step 16617/19560 | loss 3.297292 (+0.02z)| norm 0.2306 (-0.16z)| lr 4.86e-05 | 4157.45 ms | 32.5% bf16 MFU | 125735 tok/s step 16618/19560 | loss 3.337247 (+1.14z)| norm 0.2471 (+1.47z)| lr 4.86e-05 | 4166.93 ms | 32.4% bf16 MFU | 125740 tok/s step 16619/19560 | loss 3.254596 (-1.20z)| norm 0.2308 (-0.14z)| lr 4.86e-05 | 4169.92 ms | 32.4% bf16 MFU | 125739 tok/s step 16620/19560 | loss 3.243021 (-1.52z)| norm 0.2293 (-0.29z)| lr 4.85e-05 | 4172.30 ms | 32.4% bf16 MFU | 125735 tok/s step 16621/19560 | loss 3.334568 (+1.06z)| norm 0.2354 (+0.32z)| lr 4.85e-05 | 4175.73 ms | 32.3% bf16 MFU | 125726 tok/s step 16622/19560 | loss 3.259455 (-1.04z)| norm 0.2427 (+1.07z)| lr 4.85e-05 | 4173.23 ms | 32.4% bf16 MFU | 125722 tok/s step 16623/19560 | loss 3.331352 (+0.98z)| norm 0.2580 (+2.52z)| lr 4.85e-05 | 4153.62 ms | 32.5% bf16 MFU | 125747 tok/s step 16624/19560 | loss 3.288938 (-0.22z)| norm 0.2357 (+0.33z)| lr 4.84e-05 | 4176.49 ms | 32.3% bf16 MFU | 125736 tok/s step 16625/19560 | loss 3.346182 (+1.39z)| norm 0.2446 (+1.18z)| lr 4.84e-05 | 4173.00 ms | 32.4% bf16 MFU | 125731 tok/s step 16626/19560 | loss 3.263014 (-0.94z)| norm 0.2478 (+1.48z)| lr 4.84e-05 | 4171.37 ms | 32.4% bf16 MFU | 125729 tok/s step 16627/19560 | loss 3.349749 (+1.49z)| norm 0.2392 (+0.68z)| lr 4.83e-05 | 4170.78 ms | 32.4% bf16 MFU | 125728 tok/s step 16628/19560 | loss 3.355861 (+1.63z)| norm 0.2351 (+0.28z)| lr 4.83e-05 | 4167.32 ms | 32.4% bf16 MFU | 125732 tok/s step 16629/19560 | loss 3.269208 (-0.77z)| norm 0.2424 (+1.00z)| lr 4.83e-05 | 4159.48 ms | 32.5% bf16 MFU | 125748 tok/s step 16630/19560 | loss 3.284109 (-0.34z)| norm 0.2380 (+0.56z)| lr 4.82e-05 | 4171.71 ms | 32.4% bf16 MFU | 125744 tok/s step 16631/19560 | loss 3.301669 (+0.15z)| norm 0.2337 (+0.15z)| lr 4.82e-05 | 4171.73 ms | 32.4% bf16 MFU | 125741 tok/s step 16632/19560 | loss 3.260180 (-1.00z)| norm 0.2198 (-1.26z)| lr 4.82e-05 | 4159.73 ms | 32.5% bf16 MFU | 125756 tok/s step 16633/19560 | loss 3.249671 (-1.28z)| norm 0.2481 (+1.61z)| lr 4.81e-05 | 4171.29 ms | 32.4% bf16 MFU | 125752 tok/s step 16634/19560 | loss 3.303387 (+0.21z)| norm 0.2483 (+1.60z)| lr 4.81e-05 | 4165.68 ms | 32.4% bf16 MFU | 125758 tok/s step 16635/19560 | loss 3.235499 (-1.65z)| norm 0.2426 (+1.02z)| lr 4.81e-05 | 4171.42 ms | 32.4% bf16 MFU | 125754 tok/s step 16636/19560 | loss 3.299829 (+0.14z)| norm 0.2336 (+0.11z)| lr 4.80e-05 | 4169.26 ms | 32.4% bf16 MFU | 125754 tok/s step 16637/19560 | loss 3.287466 (-0.21z)| norm 0.2318 (-0.07z)| lr 4.80e-05 | 4173.23 ms | 32.4% bf16 MFU | 125748 tok/s step 16638/19560 | loss 3.248408 (-1.27z)| norm 0.2359 (+0.34z)| lr 4.80e-05 | 4169.49 ms | 32.4% bf16 MFU | 125747 tok/s step 16639/19560 | loss 3.269173 (-0.69z)| norm 0.2348 (+0.22z)| lr 4.79e-05 | 4167.50 ms | 32.4% bf16 MFU | 125750 tok/s step 16640/19560 | loss 3.302319 (+0.23z)| norm 0.2370 (+0.46z)| lr 4.79e-05 | 4166.31 ms | 32.4% bf16 MFU | 125755 tok/s step 16641/19560 | loss 3.277150 (-0.46z)| norm 0.2181 (-1.49z)| lr 4.79e-05 | 4175.19 ms | 32.3% bf16 MFU | 125746 tok/s step 16642/19560 | loss 3.314754 (+0.58z)| norm 0.2302 (-0.23z)| lr 4.78e-05 | 4175.69 ms | 32.3% bf16 MFU | 125736 tok/s step 16643/19560 | loss 3.290079 (-0.12z)| norm 0.2328 (+0.03z)| lr 4.78e-05 | 4165.07 ms | 32.4% bf16 MFU | 125743 tok/s step 16644/19560 | loss 3.287381 (-0.19z)| norm 0.2510 (+1.87z)| lr 4.78e-05 | 4170.42 ms | 32.4% bf16 MFU | 125742 tok/s step 16645/19560 | loss 3.258503 (-0.99z)| norm 0.2356 (+0.30z)| lr 4.77e-05 | 4175.04 ms | 32.3% bf16 MFU | 125734 tok/s step 16646/19560 | loss 3.281892 (-0.34z)| norm 0.2241 (-0.87z)| lr 4.77e-05 | 4169.95 ms | 32.4% bf16 MFU | 125733 tok/s step 16647/19560 | loss 3.292497 (-0.04z)| norm 0.2239 (-0.87z)| lr 4.77e-05 | 4173.21 ms | 32.4% bf16 MFU | 125728 tok/s step 16648/19560 | loss 3.286116 (-0.22z)| norm 0.2453 (+1.27z)| lr 4.76e-05 | 4166.88 ms | 32.4% bf16 MFU | 125733 tok/s step 16649/19560 | loss 3.271508 (-0.63z)| norm 0.2385 (+0.57z)| lr 4.76e-05 | 4189.15 ms | 32.2% bf16 MFU | 125704 tok/s step 16650/19560 | loss 3.291412 (-0.06z)| norm 0.2191 (-1.37z)| lr 4.76e-05 | 4163.83 ms | 32.4% bf16 MFU | 125715 tok/s step 16651/19560 | loss 3.277882 (-0.44z)| norm 0.2208 (-1.20z)| lr 4.76e-05 | 4165.25 ms | 32.4% bf16 MFU | 125723 tok/s step 16652/19560 | loss 3.264124 (-0.82z)| norm 0.2243 (-0.86z)| lr 4.75e-05 | 4171.05 ms | 32.4% bf16 MFU | 125721 tok/s step 16653/19560 | loss 3.309745 (+0.47z)| norm 0.2397 (+0.68z)| lr 4.75e-05 | 4163.09 ms | 32.4% bf16 MFU | 125732 tok/s step 16654/19560 | loss 3.277781 (-0.44z)| norm 0.2351 (+0.20z)| lr 4.75e-05 | 4173.63 ms | 32.4% bf16 MFU | 125726 tok/s step 16655/19560 | loss 3.256090 (-1.04z)| norm 0.2144 (-1.85z)| lr 4.74e-05 | 4172.53 ms | 32.4% bf16 MFU | 125723 tok/s step 16656/19560 | loss 3.302768 (+0.29z)| norm 0.2834 (+4.59z)| lr 4.74e-05 | 4159.13 ms | 32.5% bf16 MFU | 125739 tok/s step 16657/19560 | loss 3.264889 (-0.79z)| norm 0.2289 (-0.40z)| lr 4.74e-05 | 4171.70 ms | 32.4% bf16 MFU | 125736 tok/s step 16658/19560 | loss 3.322793 (+0.91z)| norm 0.2370 (+0.34z)| lr 4.73e-05 | 4172.96 ms | 32.4% bf16 MFU | 125731 tok/s step 16659/19560 | loss 3.289804 (-0.07z)| norm 0.2228 (-0.95z)| lr 4.73e-05 | 4158.35 ms | 32.5% bf16 MFU | 125749 tok/s step 16660/19560 | loss 3.203268 (-2.58z)| norm 0.2364 (+0.28z)| lr 4.73e-05 | 4174.38 ms | 32.3% bf16 MFU | 125741 tok/s step 16661/19560 | loss 3.341918 (+1.47z)| norm 0.2330 (-0.03z)| lr 4.72e-05 | 4170.16 ms | 32.4% bf16 MFU | 125740 tok/s step 16662/19560 | loss 3.279553 (-0.35z)| norm 0.2182 (-1.36z)| lr 4.72e-05 | 4164.66 ms | 32.4% bf16 MFU | 125748 tok/s step 16663/19560 | loss 3.278567 (-0.39z)| norm 0.2228 (-0.93z)| lr 4.72e-05 | 4169.91 ms | 32.4% bf16 MFU | 125747 tok/s step 16664/19560 | loss 3.287232 (-0.14z)| norm 0.2267 (-0.57z)| lr 4.71e-05 | 4177.40 ms | 32.3% bf16 MFU | 125735 tok/s step 16665/19560 | loss 3.313598 (+0.63z)| norm 0.2190 (-1.26z)| lr 4.71e-05 | 4168.73 ms | 32.4% bf16 MFU | 125737 tok/s step 16666/19560 | loss 3.359804 (+1.95z)| norm 0.2317 (-0.11z)| lr 4.71e-05 | 4167.99 ms | 32.4% bf16 MFU | 125739 tok/s step 16667/19560 | loss 3.266101 (-0.78z)| norm 0.2201 (-1.14z)| lr 4.70e-05 | 4165.27 ms | 32.4% bf16 MFU | 125746 tok/s step 16668/19560 | loss 3.309044 (+0.47z)| norm 0.2204 (-1.10z)| lr 4.70e-05 | 4165.15 ms | 32.4% bf16 MFU | 125752 tok/s step 16669/19560 | loss 3.305027 (+0.35z)| norm 0.2239 (-0.78z)| lr 4.70e-05 | 4170.71 ms | 32.4% bf16 MFU | 125750 tok/s step 16670/19560 | loss 3.246498 (-1.37z)| norm 0.2319 (-0.08z)| lr 4.69e-05 | 4175.66 ms | 32.3% bf16 MFU | 125740 tok/s step 16671/19560 | loss 3.321137 (+0.83z)| norm 0.2420 (+0.83z)| lr 4.69e-05 | 4170.57 ms | 32.4% bf16 MFU | 125739 tok/s step 16672/19560 | loss 3.270305 (-0.66z)| norm 0.2295 (-0.30z)| lr 4.69e-05 | 4169.06 ms | 32.4% bf16 MFU | 125740 tok/s step 16673/19560 | loss 3.334162 (+1.20z)| norm 0.2368 (+0.36z)| lr 4.68e-05 | 4174.58 ms | 32.3% bf16 MFU | 125732 tok/s step 16674/19560 | loss 3.293137 (-0.00z)| norm 0.2292 (-0.33z)| lr 4.68e-05 | 4174.62 ms | 32.3% bf16 MFU | 125725 tok/s step 16675/19560 | loss 3.312420 (+0.56z)| norm 0.2237 (-0.82z)| lr 4.68e-05 | 4162.54 ms | 32.4% bf16 MFU | 125737 tok/s step 16676/19560 | loss 3.342067 (+1.41z)| norm 0.2266 (-0.57z)| lr 4.68e-05 | 4162.01 ms | 32.4% bf16 MFU | 125748 tok/s step 16677/19560 | loss 3.344742 (+1.46z)| norm 0.2404 (+0.67z)| lr 4.67e-05 | 4161.60 ms | 32.4% bf16 MFU | 125760 tok/s step 16678/19560 | loss 3.305398 (+0.30z)| norm 0.2316 (-0.13z)| lr 4.67e-05 | 4162.51 ms | 32.4% bf16 MFU | 125770 tok/s step 16679/19560 | loss 3.299860 (+0.13z)| norm 0.2333 (+0.02z)| lr 4.67e-05 | 4161.39 ms | 32.4% bf16 MFU | 125781 tok/s step 16680/19560 | loss 3.316648 (+0.62z)| norm 0.2273 (-0.53z)| lr 4.66e-05 | 4215.32 ms | 32.0% bf16 MFU | 125711 tok/s step 16681/19560 | loss 3.329633 (+0.99z)| norm 0.2221 (-1.01z)| lr 4.66e-05 | 4239.37 ms | 31.8% bf16 MFU | 125609 tok/s step 16682/19560 | loss 3.290531 (-0.18z)| norm 0.2251 (-0.74z)| lr 4.66e-05 | 4165.42 ms | 32.4% bf16 MFU | 125621 tok/s step 16683/19560 | loss 3.314000 (+0.51z)| norm 0.2242 (-0.81z)| lr 4.65e-05 | 4172.15 ms | 32.4% bf16 MFU | 125624 tok/s step 16684/19560 | loss 3.292142 (-0.14z)| norm 0.2253 (-0.72z)| lr 4.65e-05 | 4166.90 ms | 32.4% bf16 MFU | 125634 tok/s step 16685/19560 | loss 3.261725 (-1.07z)| norm 0.2180 (-1.37z)| lr 4.65e-05 | 4162.81 ms | 32.4% bf16 MFU | 125649 tok/s step 16686/19560 | loss 3.278780 (-0.54z)| norm 0.2243 (-0.79z)| lr 4.64e-05 | 4346.36 ms | 31.1% bf16 MFU | 125398 tok/s step 16687/19560 | loss 3.324859 (+0.84z)| norm 0.2261 (-0.61z)| lr 4.64e-05 | 4157.76 ms | 32.5% bf16 MFU | 125433 tok/s step 16688/19560 | loss 3.327976 (+0.93z)| norm 0.2262 (-0.59z)| lr 4.64e-05 | 4153.74 ms | 32.5% bf16 MFU | 125472 tok/s step 16689/19560 | loss 3.280885 (-0.49z)| norm 0.2244 (-0.77z)| lr 4.63e-05 | 4645.49 ms | 29.1% bf16 MFU | 124842 tok/s step 16690/19560 | loss 3.325539 (+0.86z)| norm 0.2188 (-1.29z)| lr 4.63e-05 | 4293.51 ms | 31.4% bf16 MFU | 124705 tok/s step 16691/19560 | loss 3.319476 (+0.66z)| norm 0.2233 (-0.85z)| lr 4.63e-05 | 4292.08 ms | 31.5% bf16 MFU | 124578 tok/s step 16692/19560 | loss 3.266499 (-0.96z)| norm 0.2287 (-0.34z)| lr 4.62e-05 | 4643.21 ms | 29.1% bf16 MFU | 123994 tok/s step 16693/19560 | loss 3.249240 (-1.47z)| norm 0.2299 (-0.23z)| lr 4.62e-05 | 4242.33 ms | 31.8% bf16 MFU | 123974 tok/s step 16694/19560 | loss 3.295860 (-0.04z)| norm 0.2265 (-0.56z)| lr 4.62e-05 | 4195.68 ms | 32.2% bf16 MFU | 124023 tok/s step 16695/19560 | loss 3.330779 (+1.01z)| norm 0.2237 (-0.83z)| lr 4.62e-05 | 4289.24 ms | 31.5% bf16 MFU | 123934 tok/s step 16696/19560 | loss 3.243952 (-1.60z)| norm 0.2240 (-0.80z)| lr 4.61e-05 | 4526.40 ms | 29.8% bf16 MFU | 123529 tok/s step 16697/19560 | loss 3.328400 (+0.94z)| norm 0.2379 (+0.52z)| lr 4.61e-05 | 4608.57 ms | 29.3% bf16 MFU | 123040 tok/s step 16698/19560 | loss 3.291273 (-0.18z)| norm 0.2214 (-1.05z)| lr 4.61e-05 | 4413.67 ms | 30.6% bf16 MFU | 122828 tok/s step 16699/19560 | loss 3.306632 (+0.28z)| norm 0.2294 (-0.29z)| lr 4.60e-05 | 4351.69 ms | 31.0% bf16 MFU | 122710 tok/s step 16700/19560 | loss 3.242329 (-1.62z)| norm 0.2285 (-0.36z)| lr 4.60e-05 | 4209.70 ms | 32.1% bf16 MFU | 122802 tok/s step 16701/19560 | loss 3.270540 (-0.79z)| norm 0.2294 (-0.28z)| lr 4.60e-05 | 4256.66 ms | 31.7% bf16 MFU | 122820 tok/s step 16702/19560 | loss 3.326959 (+0.91z)| norm 0.2290 (-0.31z)| lr 4.59e-05 | 4384.01 ms | 30.8% bf16 MFU | 122659 tok/s step 16703/19560 | loss 3.307475 (+0.33z)| norm 0.2260 (-0.59z)| lr 4.59e-05 | 4289.88 ms | 31.5% bf16 MFU | 122637 tok/s step 16704/19560 | loss 3.215874 (-2.37z)| norm 0.2228 (-0.90z)| lr 4.59e-05 | 4240.68 ms | 31.8% bf16 MFU | 122686 tok/s step 16705/19560 | loss 3.222740 (-2.12z)| norm 0.2268 (-0.53z)| lr 4.58e-05 | 4169.43 ms | 32.4% bf16 MFU | 122839 tok/s step 16706/19560 | loss 3.277483 (-0.51z)| norm 0.2301 (-0.20z)| lr 4.58e-05 | 4390.35 ms | 30.8% bf16 MFU | 122668 tok/s step 16707/19560 | loss 3.315759 (+0.62z)| norm 0.2363 (+0.42z)| lr 4.58e-05 | 4262.75 ms | 31.7% bf16 MFU | 122684 tok/s step 16708/19560 | loss 3.300720 (+0.18z)| norm 0.2287 (-0.33z)| lr 4.57e-05 | 4172.51 ms | 32.4% bf16 MFU | 122833 tok/s step 16709/19560 | loss 3.296610 (+0.07z)| norm 0.2371 (+0.48z)| lr 4.57e-05 | 4211.29 ms | 32.1% bf16 MFU | 122916 tok/s step 16710/19560 | loss 3.336453 (+1.23z)| norm 0.2396 (+0.72z)| lr 4.57e-05 | 4250.57 ms | 31.8% bf16 MFU | 122938 tok/s step 16711/19560 | loss 3.265518 (-0.87z)| norm 0.2287 (-0.33z)| lr 4.56e-05 | 4170.42 ms | 32.4% bf16 MFU | 123076 tok/s step 16712/19560 | loss 3.260346 (-1.01z)| norm 0.2275 (-0.46z)| lr 4.56e-05 | 4215.53 ms | 32.0% bf16 MFU | 123141 tok/s step 16713/19560 | loss 3.271433 (-0.67z)| norm 0.2241 (-0.79z)| lr 4.56e-05 | 4167.14 ms | 32.4% bf16 MFU | 123275 tok/s step 16714/19560 | loss 3.280320 (-0.39z)| norm 0.2276 (-0.45z)| lr 4.56e-05 | 4175.48 ms | 32.3% bf16 MFU | 123389 tok/s step 16715/19560 | loss 3.312147 (+0.61z)| norm 0.2389 (+0.65z)| lr 4.55e-05 | 4174.63 ms | 32.3% bf16 MFU | 123499 tok/s step 16716/19560 | loss 3.266094 (-0.82z)| norm 0.2272 (-0.48z)| lr 4.55e-05 | 4247.38 ms | 31.8% bf16 MFU | 123496 tok/s step 16717/19560 | loss 3.230226 (-1.90z)| norm 0.2269 (-0.51z)| lr 4.55e-05 | 4202.69 ms | 32.1% bf16 MFU | 123559 tok/s step 16718/19560 | loss 3.330568 (+1.18z)| norm 0.2268 (-0.51z)| lr 4.54e-05 | 4165.52 ms | 32.4% bf16 MFU | 123674 tok/s step 16719/19560 | loss 3.367656 (+2.26z)| norm 0.2336 (+0.15z)| lr 4.54e-05 | 4169.35 ms | 32.4% bf16 MFU | 123778 tok/s step 16720/19560 | loss 3.293453 (+0.02z)| norm 0.2344 (+0.23z)| lr 4.54e-05 | 4241.95 ms | 31.8% bf16 MFU | 123769 tok/s step 16721/19560 | loss 3.245948 (-1.40z)| norm 0.2352 (+0.31z)| lr 4.53e-05 | 4177.04 ms | 32.3% bf16 MFU | 123856 tok/s step 16722/19560 | loss 3.300923 (+0.29z)| norm 0.2237 (-0.83z)| lr 4.53e-05 | 4168.33 ms | 32.4% bf16 MFU | 123952 tok/s step 16723/19560 | loss 3.301496 (+0.30z)| norm 0.2360 (+0.39z)| lr 4.53e-05 | 4175.28 ms | 32.3% bf16 MFU | 124033 tok/s step 16724/19560 | loss 3.307723 (+0.50z)| norm 0.2299 (-0.20z)| lr 4.52e-05 | 4216.45 ms | 32.0% bf16 MFU | 124049 tok/s step 16725/19560 | loss 3.322634 (+0.95z)| norm 0.2340 (+0.20z)| lr 4.52e-05 | 4171.55 ms | 32.4% bf16 MFU | 124130 tok/s step 16726/19560 | loss 3.344243 (+1.62z)| norm 0.2550 (+2.20z)| lr 4.52e-05 | 4169.15 ms | 32.4% bf16 MFU | 124212 tok/s step 16727/19560 | loss 3.244056 (-1.49z)| norm 0.2344 (+0.22z)| lr 4.51e-05 | 4171.60 ms | 32.4% bf16 MFU | 124285 tok/s step 16728/19560 | loss 3.199973 (-2.75z)| norm 0.2245 (-0.73z)| lr 4.51e-05 | 4182.45 ms | 32.3% bf16 MFU | 124338 tok/s step 16729/19560 | loss 3.342436 (+1.53z)| norm 0.2331 (+0.09z)| lr 4.51e-05 | 4167.56 ms | 32.4% bf16 MFU | 124412 tok/s step 16730/19560 | loss 3.321429 (+0.89z)| norm 0.2353 (+0.30z)| lr 4.51e-05 | 4176.30 ms | 32.3% bf16 MFU | 124468 tok/s step 16731/19560 | loss 3.307448 (+0.47z)| norm 0.2480 (+1.51z)| lr 4.50e-05 | 4163.58 ms | 32.4% bf16 MFU | 124541 tok/s step 16732/19560 | loss 3.302657 (+0.32z)| norm 0.2308 (-0.15z)| lr 4.50e-05 | 4243.22 ms | 31.8% bf16 MFU | 124492 tok/s step 16733/19560 | loss 3.275467 (-0.49z)| norm 0.2161 (-1.56z)| lr 4.50e-05 | 4275.55 ms | 31.6% bf16 MFU | 124398 tok/s step 16734/19560 | loss 3.305936 (+0.41z)| norm 0.2271 (-0.49z)| lr 4.49e-05 | 4252.47 ms | 31.8% bf16 MFU | 124343 tok/s step 16735/19560 | loss 3.255642 (-1.10z)| norm 0.2229 (-0.90z)| lr 4.49e-05 | 4187.70 ms | 32.2% bf16 MFU | 124386 tok/s step 16736/19560 | loss 3.294981 (+0.09z)| norm 0.2460 (+1.36z)| lr 4.49e-05 | 4171.36 ms | 32.4% bf16 MFU | 124451 tok/s step 16737/19560 | loss 3.288438 (-0.11z)| norm 0.2199 (-1.18z)| lr 4.48e-05 | 4175.17 ms | 32.3% bf16 MFU | 124507 tok/s step 16738/19560 | loss 3.268396 (-0.70z)| norm 0.2222 (-0.94z)| lr 4.48e-05 | 4175.50 ms | 32.3% bf16 MFU | 124560 tok/s step 16739/19560 | loss 3.258585 (-0.98z)| norm 0.2237 (-0.79z)| lr 4.48e-05 | 4174.35 ms | 32.3% bf16 MFU | 124611 tok/s step 16740/19560 | loss 3.337427 (+1.37z)| norm 0.2229 (-0.88z)| lr 4.47e-05 | 4178.65 ms | 32.3% bf16 MFU | 124654 tok/s step 16741/19560 | loss 3.253899 (-1.13z)| norm 0.2333 (+0.20z)| lr 4.47e-05 | 4162.24 ms | 32.4% bf16 MFU | 124720 tok/s step 16742/19560 | loss 3.249904 (-1.23z)| norm 0.2423 (+1.13z)| lr 4.47e-05 | 4169.21 ms | 32.4% bf16 MFU | 124771 tok/s step 16743/19560 | loss 3.254976 (-1.07z)| norm 0.2342 (+0.29z)| lr 4.46e-05 | 4192.56 ms | 32.2% bf16 MFU | 124785 tok/s step 16744/19560 | loss 3.330976 (+1.17z)| norm 0.2433 (+1.22z)| lr 4.46e-05 | 4167.75 ms | 32.4% bf16 MFU | 124836 tok/s step 16745/19560 | loss 3.211862 (-2.28z)| norm 0.2404 (+0.91z)| lr 4.46e-05 | 4172.86 ms | 32.4% bf16 MFU | 124876 tok/s step 16746/19560 | loss 3.288092 (-0.06z)| norm 0.2432 (+1.20z)| lr 4.46e-05 | 4177.11 ms | 32.3% bf16 MFU | 124908 tok/s step 16747/19560 | loss 3.291679 (+0.03z)| norm 0.2304 (-0.13z)| lr 4.45e-05 | 4174.98 ms | 32.3% bf16 MFU | 124942 tok/s step 16748/19560 | loss 3.247003 (-1.28z)| norm 0.2479 (+1.66z)| lr 4.45e-05 | 4178.25 ms | 32.3% bf16 MFU | 124969 tok/s step 16749/19560 | loss 3.280715 (-0.28z)| norm 0.2308 (-0.09z)| lr 4.45e-05 | 4205.72 ms | 32.1% bf16 MFU | 124953 tok/s step 16750/19560 | loss 3.250357 (-1.17z)| norm 0.2422 (+1.08z)| lr 4.44e-05 | 4198.48 ms | 32.2% bf16 MFU | 124949 tok/s val loss 3.262355 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3057/10042 = 0.304421 step 16751/19560 | loss 3.276974 (-0.38z)| norm 0.2447 (+1.38z)| lr 4.44e-05 | 4170.37 ms | 32.4% bf16 MFU | 124988 tok/s step 16752/19560 | loss 3.241122 (-1.41z)| norm 0.2326 (+0.11z)| lr 4.44e-05 | 4170.37 ms | 32.4% bf16 MFU | 125024 tok/s step 16753/19560 | loss 3.280577 (-0.24z)| norm 0.2380 (+0.69z)| lr 4.43e-05 | 4162.39 ms | 32.4% bf16 MFU | 125071 tok/s step 16754/19560 | loss 3.317527 (+0.84z)| norm 0.2274 (-0.42z)| lr 4.43e-05 | 4172.82 ms | 32.4% bf16 MFU | 125100 tok/s step 16755/19560 | loss 3.340043 (+1.52z)| norm 0.2296 (-0.18z)| lr 4.43e-05 | 4170.09 ms | 32.4% bf16 MFU | 125131 tok/s step 16756/19560 | loss 3.256138 (-0.97z)| norm 0.2292 (-0.21z)| lr 4.42e-05 | 4175.04 ms | 32.3% bf16 MFU | 125153 tok/s step 16757/19560 | loss 3.349987 (+1.82z)| norm 0.2265 (-0.49z)| lr 4.42e-05 | 4174.06 ms | 32.3% bf16 MFU | 125176 tok/s step 16758/19560 | loss 3.258263 (-0.91z)| norm 0.2378 (+0.72z)| lr 4.42e-05 | 4175.12 ms | 32.3% bf16 MFU | 125196 tok/s step 16759/19560 | loss 3.276842 (-0.35z)| norm 0.2298 (-0.14z)| lr 4.42e-05 | 4179.43 ms | 32.3% bf16 MFU | 125208 tok/s step 16760/19560 | loss 3.279596 (-0.27z)| norm 0.2199 (-1.20z)| lr 4.41e-05 | 4173.74 ms | 32.3% bf16 MFU | 125229 tok/s step 16761/19560 | loss 3.290769 (+0.05z)| norm 0.2284 (-0.27z)| lr 4.41e-05 | 4193.91 ms | 32.2% bf16 MFU | 125218 tok/s step 16762/19560 | loss 3.275640 (-0.39z)| norm 0.2271 (-0.40z)| lr 4.41e-05 | 4174.52 ms | 32.3% bf16 MFU | 125236 tok/s step 16763/19560 | loss 3.240308 (-1.46z)| norm 0.2298 (-0.09z)| lr 4.40e-05 | 4170.75 ms | 32.4% bf16 MFU | 125260 tok/s step 16764/19560 | loss 3.271548 (-0.51z)| norm 0.2160 (-1.60z)| lr 4.40e-05 | 4194.55 ms | 32.2% bf16 MFU | 125247 tok/s step 16765/19560 | loss 3.273841 (-0.44z)| norm 0.2400 (+1.04z)| lr 4.40e-05 | 4169.87 ms | 32.4% bf16 MFU | 125271 tok/s step 16766/19560 | loss 3.326897 (+1.13z)| norm 0.2417 (+1.21z)| lr 4.39e-05 | 4178.81 ms | 32.3% bf16 MFU | 125280 tok/s step 16767/19560 | loss 3.273324 (-0.48z)| norm 0.2247 (-0.63z)| lr 4.39e-05 | 4192.80 ms | 32.2% bf16 MFU | 125269 tok/s step 16768/19560 | loss 3.230559 (-1.72z)| norm 0.2281 (-0.26z)| lr 4.39e-05 | 4184.59 ms | 32.3% bf16 MFU | 125270 tok/s step 16769/19560 | loss 3.374614 (+2.47z)| norm 0.2402 (+1.04z)| lr 4.38e-05 | 4165.38 ms | 32.4% bf16 MFU | 125300 tok/s step 16770/19560 | loss 3.441727 (+4.09z)| norm 0.2801 (+4.86z)| lr 4.38e-05 | 4183.01 ms | 32.3% bf16 MFU | 125302 tok/s step 16771/19560 | loss 3.435773 (+3.69z)| norm 0.2518 (+2.01z)| lr 4.38e-05 | 4173.73 ms | 32.3% bf16 MFU | 125317 tok/s step 16772/19560 | loss 3.303782 (+0.31z)| norm 0.2349 (+0.38z)| lr 4.38e-05 | 4165.90 ms | 32.4% bf16 MFU | 125344 tok/s step 16773/19560 | loss 3.276340 (-0.39z)| norm 0.2374 (+0.62z)| lr 4.37e-05 | 4169.01 ms | 32.4% bf16 MFU | 125365 tok/s step 16774/19560 | loss 3.277391 (-0.37z)| norm 0.2272 (-0.38z)| lr 4.37e-05 | 4172.76 ms | 32.4% bf16 MFU | 125379 tok/s step 16775/19560 | loss 3.241773 (-1.26z)| norm 0.2289 (-0.23z)| lr 4.37e-05 | 4166.78 ms | 32.4% bf16 MFU | 125401 tok/s step 16776/19560 | loss 3.292446 (+0.03z)| norm 0.2269 (-0.41z)| lr 4.36e-05 | 4170.12 ms | 32.4% bf16 MFU | 125417 tok/s step 16777/19560 | loss 3.248266 (-1.09z)| norm 0.2231 (-0.78z)| lr 4.36e-05 | 4167.24 ms | 32.4% bf16 MFU | 125437 tok/s step 16778/19560 | loss 3.257560 (-0.84z)| norm 0.2345 (+0.35z)| lr 4.36e-05 | 4167.13 ms | 32.4% bf16 MFU | 125456 tok/s step 16779/19560 | loss 3.310734 (+0.49z)| norm 0.2177 (-1.33z)| lr 4.35e-05 | 4169.06 ms | 32.4% bf16 MFU | 125471 tok/s step 16780/19560 | loss 3.338875 (+1.19z)| norm 0.2355 (+0.44z)| lr 4.35e-05 | 4174.19 ms | 32.3% bf16 MFU | 125478 tok/s step 16781/19560 | loss 3.226951 (-1.60z)| norm 0.2286 (-0.24z)| lr 4.35e-05 | 4164.89 ms | 32.4% bf16 MFU | 125498 tok/s step 16782/19560 | loss 3.309173 (+0.44z)| norm 0.2391 (+0.82z)| lr 4.34e-05 | 4162.65 ms | 32.4% bf16 MFU | 125520 tok/s step 16783/19560 | loss 3.281454 (-0.25z)| norm 0.2252 (-0.59z)| lr 4.34e-05 | 4176.62 ms | 32.3% bf16 MFU | 125521 tok/s step 16784/19560 | loss 3.259136 (-0.80z)| norm 0.2226 (-0.91z)| lr 4.34e-05 | 4175.90 ms | 32.3% bf16 MFU | 125522 tok/s step 16785/19560 | loss 3.280693 (-0.27z)| norm 0.2325 (+0.21z)| lr 4.34e-05 | 4168.45 ms | 32.4% bf16 MFU | 125535 tok/s step 16786/19560 | loss 3.316382 (+0.63z)| norm 0.2416 (+1.25z)| lr 4.33e-05 | 4177.06 ms | 32.3% bf16 MFU | 125534 tok/s step 16787/19560 | loss 3.300562 (+0.23z)| norm 0.2230 (-0.87z)| lr 4.33e-05 | 4160.77 ms | 32.5% bf16 MFU | 125558 tok/s step 16788/19560 | loss 3.330471 (+0.97z)| norm 0.2138 (-1.88z)| lr 4.33e-05 | 4169.18 ms | 32.4% bf16 MFU | 125568 tok/s step 16789/19560 | loss 3.202068 (-2.23z)| norm 0.2278 (-0.30z)| lr 4.32e-05 | 4175.44 ms | 32.3% bf16 MFU | 125567 tok/s step 16790/19560 | loss 3.277698 (-0.34z)| norm 0.2282 (-0.27z)| lr 4.32e-05 | 4157.16 ms | 32.5% bf16 MFU | 125595 tok/s step 16791/19560 | loss 3.336389 (+1.11z)| norm 0.2144 (-1.80z)| lr 4.32e-05 | 4173.42 ms | 32.4% bf16 MFU | 125596 tok/s step 16792/19560 | loss 3.306912 (+0.37z)| norm 0.2442 (+1.51z)| lr 4.31e-05 | 4168.78 ms | 32.4% bf16 MFU | 125605 tok/s step 16793/19560 | loss 3.385444 (+2.27z)| norm 0.2204 (-1.14z)| lr 4.31e-05 | 4169.54 ms | 32.4% bf16 MFU | 125612 tok/s step 16794/19560 | loss 3.312239 (+0.50z)| norm 0.2259 (-0.52z)| lr 4.31e-05 | 4229.83 ms | 31.9% bf16 MFU | 125529 tok/s step 16795/19560 | loss 3.281552 (-0.26z)| norm 0.2239 (-0.75z)| lr 4.30e-05 | 4173.01 ms | 32.4% bf16 MFU | 125534 tok/s step 16796/19560 | loss 3.271198 (-0.51z)| norm 0.2398 (+1.01z)| lr 4.30e-05 | 4178.75 ms | 32.3% bf16 MFU | 125531 tok/s step 16797/19560 | loss 3.258543 (-0.81z)| norm 0.2201 (-1.18z)| lr 4.30e-05 | 4158.48 ms | 32.5% bf16 MFU | 125558 tok/s step 16798/19560 | loss 3.336723 (+1.10z)| norm 0.2178 (-1.42z)| lr 4.30e-05 | 4160.37 ms | 32.5% bf16 MFU | 125581 tok/s step 16799/19560 | loss 3.240283 (-1.26z)| norm 0.2259 (-0.50z)| lr 4.29e-05 | 4158.92 ms | 32.5% bf16 MFU | 125605 tok/s step 16800/19560 | loss 3.309701 (+0.44z)| norm 0.2276 (-0.32z)| lr 4.29e-05 | 4280.25 ms | 31.5% bf16 MFU | 125449 tok/s step 16801/19560 | loss 3.270384 (-0.52z)| norm 0.2246 (-0.63z)| lr 4.29e-05 | 4160.08 ms | 32.5% bf16 MFU | 125478 tok/s step 16802/19560 | loss 3.264226 (-0.66z)| norm 0.2226 (-0.85z)| lr 4.28e-05 | 4158.07 ms | 32.5% bf16 MFU | 125509 tok/s step 16803/19560 | loss 3.295114 (+0.10z)| norm 0.2366 (+0.69z)| lr 4.28e-05 | 4184.92 ms | 32.3% bf16 MFU | 125497 tok/s step 16804/19560 | loss 3.333235 (+1.04z)| norm 0.2370 (+0.72z)| lr 4.28e-05 | 4169.57 ms | 32.4% bf16 MFU | 125510 tok/s step 16805/19560 | loss 3.275144 (-0.38z)| norm 0.2147 (-1.71z)| lr 4.27e-05 | 4165.38 ms | 32.4% bf16 MFU | 125528 tok/s step 16806/19560 | loss 3.285837 (-0.11z)| norm 0.2146 (-1.68z)| lr 4.27e-05 | 4171.22 ms | 32.4% bf16 MFU | 125536 tok/s step 16807/19560 | loss 3.296725 (+0.16z)| norm 0.2278 (-0.25z)| lr 4.27e-05 | 4248.43 ms | 31.8% bf16 MFU | 125429 tok/s step 16808/19560 | loss 3.273786 (-0.40z)| norm 0.2205 (-1.03z)| lr 4.26e-05 | 4167.79 ms | 32.4% bf16 MFU | 125448 tok/s step 16809/19560 | loss 3.242596 (-1.16z)| norm 0.2263 (-0.41z)| lr 4.26e-05 | 4170.69 ms | 32.4% bf16 MFU | 125461 tok/s step 16810/19560 | loss 3.259780 (-0.72z)| norm 0.2158 (-1.52z)| lr 4.26e-05 | 4164.83 ms | 32.4% bf16 MFU | 125482 tok/s step 16811/19560 | loss 3.272315 (-0.41z)| norm 0.2341 (+0.43z)| lr 4.26e-05 | 4172.77 ms | 32.4% bf16 MFU | 125490 tok/s step 16812/19560 | loss 3.274205 (-0.36z)| norm 0.2334 (+0.35z)| lr 4.25e-05 | 4161.16 ms | 32.4% bf16 MFU | 125515 tok/s step 16813/19560 | loss 3.288560 (-0.00z)| norm 0.2305 (+0.02z)| lr 4.25e-05 | 4193.18 ms | 32.2% bf16 MFU | 125491 tok/s step 16814/19560 | loss 3.314576 (+0.63z)| norm 0.2329 (+0.27z)| lr 4.25e-05 | 4163.92 ms | 32.4% bf16 MFU | 125512 tok/s step 16815/19560 | loss 3.293417 (+0.11z)| norm 0.2320 (+0.18z)| lr 4.24e-05 | 4171.56 ms | 32.4% bf16 MFU | 125521 tok/s step 16816/19560 | loss 3.288968 (+0.01z)| norm 0.2254 (-0.53z)| lr 4.24e-05 | 4173.13 ms | 32.4% bf16 MFU | 125526 tok/s step 16817/19560 | loss 3.270728 (-0.44z)| norm 0.2256 (-0.52z)| lr 4.24e-05 | 4166.60 ms | 32.4% bf16 MFU | 125542 tok/s step 16818/19560 | loss 3.305386 (+0.43z)| norm 0.2308 (+0.04z)| lr 4.23e-05 | 4158.97 ms | 32.5% bf16 MFU | 125568 tok/s step 16819/19560 | loss 3.293879 (+0.15z)| norm 0.2203 (-1.10z)| lr 4.23e-05 | 4164.28 ms | 32.4% bf16 MFU | 125584 tok/s step 16820/19560 | loss 3.292800 (+0.11z)| norm 0.2229 (-0.82z)| lr 4.23e-05 | 4166.09 ms | 32.4% bf16 MFU | 125597 tok/s step 16821/19560 | loss 3.345754 (+1.42z)| norm 0.2284 (-0.21z)| lr 4.23e-05 | 4165.39 ms | 32.4% bf16 MFU | 125611 tok/s step 16822/19560 | loss 3.243602 (-1.12z)| norm 0.2205 (-1.07z)| lr 4.22e-05 | 4262.36 ms | 31.7% bf16 MFU | 125481 tok/s step 16823/19560 | loss 3.237782 (-1.24z)| norm 0.2357 (+0.57z)| lr 4.22e-05 | 4156.08 ms | 32.5% bf16 MFU | 125514 tok/s step 16824/19560 | loss 3.289634 (+0.03z)| norm 0.2439 (+1.44z)| lr 4.22e-05 | 4169.14 ms | 32.4% bf16 MFU | 125526 tok/s step 16825/19560 | loss 3.260864 (-0.67z)| norm 0.2190 (-1.23z)| lr 4.21e-05 | 4171.75 ms | 32.4% bf16 MFU | 125534 tok/s step 16826/19560 | loss 3.245440 (-1.04z)| norm 0.2262 (-0.46z)| lr 4.21e-05 | 4193.58 ms | 32.2% bf16 MFU | 125508 tok/s step 16827/19560 | loss 3.273048 (-0.35z)| norm 0.2373 (+0.73z)| lr 4.21e-05 | 4161.97 ms | 32.4% bf16 MFU | 125531 tok/s step 16828/19560 | loss 3.310533 (+0.57z)| norm 0.2282 (-0.26z)| lr 4.20e-05 | 4166.04 ms | 32.4% bf16 MFU | 125547 tok/s step 16829/19560 | loss 3.271842 (-0.39z)| norm 0.2315 (+0.11z)| lr 4.20e-05 | 4161.98 ms | 32.4% bf16 MFU | 125568 tok/s step 16830/19560 | loss 3.277617 (-0.24z)| norm 0.2190 (-1.22z)| lr 4.20e-05 | 4160.26 ms | 32.5% bf16 MFU | 125591 tok/s step 16831/19560 | loss 3.274161 (-0.32z)| norm 0.2218 (-0.92z)| lr 4.20e-05 | 4175.55 ms | 32.3% bf16 MFU | 125589 tok/s step 16832/19560 | loss 3.282686 (-0.12z)| norm 0.2311 (+0.06z)| lr 4.19e-05 | 4163.99 ms | 32.4% bf16 MFU | 125605 tok/s step 16833/19560 | loss 3.265611 (-0.57z)| norm 0.2207 (-1.04z)| lr 4.19e-05 | 4161.18 ms | 32.4% bf16 MFU | 125625 tok/s step 16834/19560 | loss 3.269643 (-0.46z)| norm 0.2219 (-0.90z)| lr 4.19e-05 | 4165.84 ms | 32.4% bf16 MFU | 125636 tok/s step 16835/19560 | loss 3.331348 (+1.11z)| norm 0.2279 (-0.25z)| lr 4.18e-05 | 4172.08 ms | 32.4% bf16 MFU | 125638 tok/s step 16836/19560 | loss 3.280328 (-0.19z)| norm 0.2302 (-0.01z)| lr 4.18e-05 | 4203.95 ms | 32.1% bf16 MFU | 125592 tok/s step 16837/19560 | loss 3.274983 (-0.32z)| norm 0.2415 (+1.18z)| lr 4.18e-05 | 4164.59 ms | 32.4% bf16 MFU | 125607 tok/s step 16838/19560 | loss 3.292937 (+0.15z)| norm 0.2271 (-0.34z)| lr 4.17e-05 | 4168.11 ms | 32.4% bf16 MFU | 125616 tok/s step 16839/19560 | loss 3.252788 (-0.88z)| norm 0.2184 (-1.25z)| lr 4.17e-05 | 4177.58 ms | 32.3% bf16 MFU | 125610 tok/s step 16840/19560 | loss 3.278448 (-0.23z)| norm 0.2347 (+0.47z)| lr 4.17e-05 | 4166.76 ms | 32.4% bf16 MFU | 125621 tok/s step 16841/19560 | loss 3.259915 (-0.70z)| norm 0.2450 (+1.53z)| lr 4.17e-05 | 4168.16 ms | 32.4% bf16 MFU | 125629 tok/s step 16842/19560 | loss 3.284207 (-0.08z)| norm 0.2224 (-0.84z)| lr 4.16e-05 | 4178.73 ms | 32.3% bf16 MFU | 125621 tok/s step 16843/19560 | loss 3.312880 (+0.66z)| norm 0.2314 (+0.11z)| lr 4.16e-05 | 4163.25 ms | 32.4% bf16 MFU | 125636 tok/s step 16844/19560 | loss 3.256911 (-0.78z)| norm 0.2490 (+1.91z)| lr 4.16e-05 | 4163.47 ms | 32.4% bf16 MFU | 125651 tok/s step 16845/19560 | loss 3.316934 (+0.75z)| norm 0.2323 (+0.18z)| lr 4.15e-05 | 4159.62 ms | 32.5% bf16 MFU | 125670 tok/s step 16846/19560 | loss 3.273128 (-0.37z)| norm 0.2251 (-0.56z)| lr 4.15e-05 | 4152.65 ms | 32.5% bf16 MFU | 125699 tok/s step 16847/19560 | loss 3.281579 (-0.14z)| norm 0.2279 (-0.26z)| lr 4.15e-05 | 4173.77 ms | 32.3% bf16 MFU | 125695 tok/s step 16848/19560 | loss 3.290612 (+0.10z)| norm 0.2189 (-1.18z)| lr 4.14e-05 | 4181.58 ms | 32.3% bf16 MFU | 125679 tok/s step 16849/19560 | loss 3.319701 (+0.86z)| norm 0.2272 (-0.32z)| lr 4.14e-05 | 4164.75 ms | 32.4% bf16 MFU | 125690 tok/s step 16850/19560 | loss 3.278708 (-0.22z)| norm 0.2196 (-1.10z)| lr 4.14e-05 | 4195.22 ms | 32.2% bf16 MFU | 125654 tok/s step 16851/19560 | loss 3.265126 (-0.57z)| norm 0.2214 (-0.90z)| lr 4.14e-05 | 4170.62 ms | 32.4% bf16 MFU | 125657 tok/s step 16852/19560 | loss 3.239694 (-1.22z)| norm 0.2297 (-0.05z)| lr 4.13e-05 | 4211.93 ms | 32.1% bf16 MFU | 125598 tok/s step 16853/19560 | loss 3.279064 (-0.18z)| norm 0.2267 (-0.35z)| lr 4.13e-05 | 4164.13 ms | 32.4% bf16 MFU | 125613 tok/s step 16854/19560 | loss 3.280104 (-0.14z)| norm 0.2152 (-1.52z)| lr 4.13e-05 | 4166.99 ms | 32.4% bf16 MFU | 125623 tok/s step 16855/19560 | loss 3.223352 (-1.64z)| norm 0.2280 (-0.18z)| lr 4.12e-05 | 4162.29 ms | 32.4% bf16 MFU | 125640 tok/s step 16856/19560 | loss 3.312363 (+0.71z)| norm 0.2230 (-0.70z)| lr 4.12e-05 | 4165.37 ms | 32.4% bf16 MFU | 125652 tok/s step 16857/19560 | loss 3.258592 (-0.73z)| norm 0.2188 (-1.12z)| lr 4.12e-05 | 4171.68 ms | 32.4% bf16 MFU | 125653 tok/s step 16858/19560 | loss 3.253831 (-0.85z)| norm 0.2301 (+0.05z)| lr 4.11e-05 | 4164.69 ms | 32.4% bf16 MFU | 125665 tok/s step 16859/19560 | loss 3.281803 (-0.08z)| norm 0.2393 (+1.03z)| lr 4.11e-05 | 4164.50 ms | 32.4% bf16 MFU | 125676 tok/s step 16860/19560 | loss 3.217792 (-1.78z)| norm 0.2164 (-1.35z)| lr 4.11e-05 | 4164.22 ms | 32.4% bf16 MFU | 125688 tok/s step 16861/19560 | loss 3.324108 (+1.06z)| norm 0.2238 (-0.60z)| lr 4.11e-05 | 4162.85 ms | 32.4% bf16 MFU | 125701 tok/s step 16862/19560 | loss 3.347984 (+1.68z)| norm 0.2410 (+1.19z)| lr 4.10e-05 | 4244.89 ms | 31.8% bf16 MFU | 125591 tok/s step 16863/19560 | loss 3.324507 (+1.04z)| norm 0.2463 (+1.71z)| lr 4.10e-05 | 4168.74 ms | 32.4% bf16 MFU | 125600 tok/s step 16864/19560 | loss 3.292112 (+0.18z)| norm 0.2309 (+0.13z)| lr 4.10e-05 | 4161.91 ms | 32.4% bf16 MFU | 125618 tok/s step 16865/19560 | loss 3.304163 (+0.50z)| norm 0.2208 (-0.92z)| lr 4.09e-05 | 4172.76 ms | 32.4% bf16 MFU | 125620 tok/s step 16866/19560 | loss 3.241464 (-1.15z)| norm 0.2359 (+0.65z)| lr 4.09e-05 | 4165.11 ms | 32.4% bf16 MFU | 125633 tok/s step 16867/19560 | loss 3.320699 (+0.92z)| norm 0.2283 (-0.16z)| lr 4.09e-05 | 4225.34 ms | 32.0% bf16 MFU | 125555 tok/s step 16868/19560 | loss 3.250471 (-0.91z)| norm 0.2195 (-1.07z)| lr 4.08e-05 | 4163.96 ms | 32.4% bf16 MFU | 125573 tok/s step 16869/19560 | loss 3.348805 (+1.65z)| norm 0.2274 (-0.24z)| lr 4.08e-05 | 4164.18 ms | 32.4% bf16 MFU | 125589 tok/s step 16870/19560 | loss 3.297427 (+0.30z)| norm 0.2243 (-0.56z)| lr 4.08e-05 | 4162.43 ms | 32.4% bf16 MFU | 125608 tok/s step 16871/19560 | loss 3.310057 (+0.62z)| norm 0.2299 (+0.04z)| lr 4.08e-05 | 4166.08 ms | 32.4% bf16 MFU | 125620 tok/s step 16872/19560 | loss 3.295220 (+0.24z)| norm 0.2334 (+0.42z)| lr 4.07e-05 | 4164.08 ms | 32.4% bf16 MFU | 125634 tok/s step 16873/19560 | loss 3.236861 (-1.32z)| norm 0.2197 (-1.01z)| lr 4.07e-05 | 4163.31 ms | 32.4% bf16 MFU | 125649 tok/s step 16874/19560 | loss 3.315036 (+0.75z)| norm 0.2230 (-0.66z)| lr 4.07e-05 | 4189.11 ms | 32.2% bf16 MFU | 125624 tok/s step 16875/19560 | loss 3.244539 (-1.11z)| norm 0.2230 (-0.64z)| lr 4.06e-05 | 4258.32 ms | 31.7% bf16 MFU | 125499 tok/s step 16876/19560 | loss 3.291073 (+0.12z)| norm 0.2191 (-1.06z)| lr 4.06e-05 | 4158.32 ms | 32.5% bf16 MFU | 125528 tok/s step 16877/19560 | loss 3.325451 (+1.02z)| norm 0.2216 (-0.77z)| lr 4.06e-05 | 4164.63 ms | 32.4% bf16 MFU | 125546 tok/s step 16878/19560 | loss 3.298120 (+0.28z)| norm 0.2192 (-1.02z)| lr 4.05e-05 | 4169.34 ms | 32.4% bf16 MFU | 125556 tok/s step 16879/19560 | loss 3.277923 (-0.25z)| norm 0.2174 (-1.20z)| lr 4.05e-05 | 4171.85 ms | 32.4% bf16 MFU | 125562 tok/s step 16880/19560 | loss 3.282690 (-0.13z)| norm 0.2213 (-0.76z)| lr 4.05e-05 | 4171.03 ms | 32.4% bf16 MFU | 125569 tok/s step 16881/19560 | loss 3.308308 (+0.54z)| norm 0.2247 (-0.38z)| lr 4.05e-05 | 4170.66 ms | 32.4% bf16 MFU | 125576 tok/s step 16882/19560 | loss 3.310576 (+0.61z)| norm 0.2198 (-0.91z)| lr 4.04e-05 | 4159.52 ms | 32.5% bf16 MFU | 125599 tok/s step 16883/19560 | loss 3.264466 (-0.61z)| norm 0.2150 (-1.41z)| lr 4.04e-05 | 4161.09 ms | 32.4% bf16 MFU | 125619 tok/s step 16884/19560 | loss 3.202147 (-2.24z)| norm 0.2200 (-0.86z)| lr 4.04e-05 | 4160.67 ms | 32.5% bf16 MFU | 125639 tok/s step 16885/19560 | loss 3.368773 (+2.15z)| norm 0.2257 (-0.24z)| lr 4.03e-05 | 4168.94 ms | 32.4% bf16 MFU | 125645 tok/s step 16886/19560 | loss 3.290796 (+0.09z)| norm 0.2332 (+0.58z)| lr 4.03e-05 | 4156.01 ms | 32.5% bf16 MFU | 125670 tok/s step 16887/19560 | loss 3.295348 (+0.21z)| norm 0.2218 (-0.65z)| lr 4.03e-05 | 4870.41 ms | 27.7% bf16 MFU | 124769 tok/s step 16888/19560 | loss 3.294208 (+0.18z)| norm 0.2294 (+0.16z)| lr 4.02e-05 | 4451.40 ms | 30.3% bf16 MFU | 124420 tok/s step 16889/19560 | loss 3.256769 (-0.80z)| norm 0.2399 (+1.28z)| lr 4.02e-05 | 4524.40 ms | 29.8% bf16 MFU | 123993 tok/s step 16890/19560 | loss 3.278488 (-0.23z)| norm 0.2235 (-0.48z)| lr 4.02e-05 | 4645.24 ms | 29.1% bf16 MFU | 123436 tok/s step 16891/19560 | loss 3.269019 (-0.49z)| norm 0.2168 (-1.18z)| lr 4.02e-05 | 4251.78 ms | 31.8% bf16 MFU | 123430 tok/s step 16892/19560 | loss 3.241633 (-1.20z)| norm 0.2439 (+1.68z)| lr 4.01e-05 | 4458.29 ms | 30.3% bf16 MFU | 123139 tok/s step 16893/19560 | loss 3.269931 (-0.45z)| norm 0.2328 (+0.51z)| lr 4.01e-05 | 4458.00 ms | 30.3% bf16 MFU | 122862 tok/s step 16894/19560 | loss 3.270872 (-0.42z)| norm 0.2342 (+0.67z)| lr 4.01e-05 | 4377.63 ms | 30.8% bf16 MFU | 122707 tok/s step 16895/19560 | loss 3.216467 (-1.82z)| norm 0.2312 (+0.34z)| lr 4.00e-05 | 4292.23 ms | 31.5% bf16 MFU | 122679 tok/s step 16896/19560 | loss 3.265597 (-0.55z)| norm 0.2343 (+0.67z)| lr 4.00e-05 | 4189.40 ms | 32.2% bf16 MFU | 122802 tok/s step 16897/19560 | loss 3.294137 (+0.22z)| norm 0.2203 (-0.83z)| lr 4.00e-05 | 4369.30 ms | 30.9% bf16 MFU | 122662 tok/s step 16898/19560 | loss 3.300178 (+0.44z)| norm 0.2255 (-0.25z)| lr 4.00e-05 | 4167.49 ms | 32.4% bf16 MFU | 122819 tok/s step 16899/19560 | loss 3.215880 (-2.06z)| norm 0.2406 (+1.68z)| lr 3.99e-05 | 4153.73 ms | 32.5% bf16 MFU | 122989 tok/s step 16900/19560 | loss 3.351992 (+2.06z)| norm 0.2164 (-1.38z)| lr 3.99e-05 | 4185.18 ms | 32.3% bf16 MFU | 123103 tok/s step 16901/19560 | loss 3.321360 (+1.12z)| norm 0.2278 (+0.08z)| lr 3.99e-05 | 4173.16 ms | 32.4% bf16 MFU | 123230 tok/s step 16902/19560 | loss 3.276309 (-0.23z)| norm 0.2154 (-1.49z)| lr 3.98e-05 | 4282.19 ms | 31.5% bf16 MFU | 123190 tok/s step 16903/19560 | loss 3.290962 (+0.20z)| norm 0.2225 (-0.57z)| lr 3.98e-05 | 4166.22 ms | 32.4% bf16 MFU | 123323 tok/s step 16904/19560 | loss 3.306290 (+0.66z)| norm 0.2206 (-0.81z)| lr 3.98e-05 | 4235.12 ms | 31.9% bf16 MFU | 123346 tok/s step 16905/19560 | loss 3.232450 (-1.56z)| norm 0.2244 (-0.33z)| lr 3.97e-05 | 4192.43 ms | 32.2% bf16 MFU | 123432 tok/s step 16906/19560 | loss 3.279158 (-0.16z)| norm 0.2903 (+6.51z)| lr 3.97e-05 | 4174.70 ms | 32.3% bf16 MFU | 123540 tok/s step 16907/19560 | loss 3.254291 (-0.89z)| norm 0.2207 (-0.70z)| lr 3.97e-05 | 4214.31 ms | 32.0% bf16 MFU | 123583 tok/s step 16908/19560 | loss 3.247619 (-1.08z)| norm 0.2175 (-1.01z)| lr 3.97e-05 | 4171.34 ms | 32.4% bf16 MFU | 123688 tok/s step 16909/19560 | loss 3.209405 (-2.21z)| norm 0.2142 (-1.34z)| lr 3.96e-05 | 4188.09 ms | 32.2% bf16 MFU | 123763 tok/s step 16910/19560 | loss 3.306777 (+0.71z)| norm 0.2221 (-0.52z)| lr 3.96e-05 | 4309.80 ms | 31.3% bf16 MFU | 123657 tok/s step 16911/19560 | loss 3.266364 (-0.50z)| norm 0.2234 (-0.38z)| lr 3.96e-05 | 4218.44 ms | 32.0% bf16 MFU | 123689 tok/s step 16912/19560 | loss 3.247952 (-1.05z)| norm 0.2355 (+0.86z)| lr 3.95e-05 | 4183.86 ms | 32.3% bf16 MFU | 123770 tok/s step 16913/19560 | loss 3.274680 (-0.25z)| norm 0.2317 (+0.47z)| lr 3.95e-05 | 4196.59 ms | 32.2% bf16 MFU | 123828 tok/s step 16914/19560 | loss 3.261397 (-0.63z)| norm 0.2268 (-0.02z)| lr 3.95e-05 | 4207.13 ms | 32.1% bf16 MFU | 123868 tok/s step 16915/19560 | loss 3.217865 (-1.89z)| norm 0.2231 (-0.41z)| lr 3.95e-05 | 4160.56 ms | 32.5% bf16 MFU | 123975 tok/s step 16916/19560 | loss 3.289066 (+0.23z)| norm 0.2343 (+0.74z)| lr 3.94e-05 | 4164.34 ms | 32.4% bf16 MFU | 124071 tok/s step 16917/19560 | loss 3.255281 (-0.81z)| norm 0.2199 (-0.76z)| lr 3.94e-05 | 4175.32 ms | 32.3% bf16 MFU | 124146 tok/s step 16918/19560 | loss 3.269698 (-0.37z)| norm 0.2190 (-0.84z)| lr 3.94e-05 | 4199.50 ms | 32.2% bf16 MFU | 124181 tok/s step 16919/19560 | loss 3.290031 (+0.26z)| norm 0.2309 (+0.39z)| lr 3.93e-05 | 4182.87 ms | 32.3% bf16 MFU | 124239 tok/s step 16920/19560 | loss 3.211740 (-2.09z)| norm 0.2263 (-0.08z)| lr 3.93e-05 | 4195.35 ms | 32.2% bf16 MFU | 124275 tok/s step 16921/19560 | loss 3.269912 (-0.31z)| norm 0.2232 (-0.41z)| lr 3.93e-05 | 4180.47 ms | 32.3% bf16 MFU | 124332 tok/s step 16922/19560 | loss 3.278603 (-0.03z)| norm 0.2214 (-0.60z)| lr 3.92e-05 | 4173.66 ms | 32.3% bf16 MFU | 124397 tok/s step 16923/19560 | loss 3.278178 (-0.04z)| norm 0.2192 (-0.82z)| lr 3.92e-05 | 4168.79 ms | 32.4% bf16 MFU | 124465 tok/s step 16924/19560 | loss 3.259864 (-0.62z)| norm 0.2235 (-0.36z)| lr 3.92e-05 | 4177.85 ms | 32.3% bf16 MFU | 124516 tok/s step 16925/19560 | loss 3.276000 (-0.11z)| norm 0.2189 (-0.85z)| lr 3.92e-05 | 4167.74 ms | 32.4% bf16 MFU | 124580 tok/s step 16926/19560 | loss 3.272863 (-0.20z)| norm 0.2425 (+1.63z)| lr 3.91e-05 | 4190.48 ms | 32.2% bf16 MFU | 124607 tok/s step 16927/19560 | loss 3.348117 (+2.16z)| norm 0.2198 (-0.76z)| lr 3.91e-05 | 4168.82 ms | 32.4% bf16 MFU | 124665 tok/s step 16928/19560 | loss 3.327892 (+1.51z)| norm 0.2356 (+0.90z)| lr 3.91e-05 | 4173.91 ms | 32.3% bf16 MFU | 124712 tok/s step 16929/19560 | loss 3.287629 (+0.23z)| norm 0.2266 (-0.05z)| lr 3.90e-05 | 4184.45 ms | 32.3% bf16 MFU | 124741 tok/s step 16930/19560 | loss 3.298261 (+0.56z)| norm 0.2361 (+0.93z)| lr 3.90e-05 | 4179.57 ms | 32.3% bf16 MFU | 124776 tok/s step 16931/19560 | loss 3.287672 (+0.23z)| norm 0.2177 (-0.98z)| lr 3.90e-05 | 4209.67 ms | 32.1% bf16 MFU | 124765 tok/s step 16932/19560 | loss 3.309973 (+0.95z)| norm 0.2182 (-0.91z)| lr 3.90e-05 | 4180.43 ms | 32.3% bf16 MFU | 124797 tok/s step 16933/19560 | loss 3.255787 (-0.77z)| norm 0.2254 (-0.17z)| lr 3.89e-05 | 4181.25 ms | 32.3% bf16 MFU | 124827 tok/s step 16934/19560 | loss 3.250002 (-0.94z)| norm 0.2273 (+0.02z)| lr 3.89e-05 | 4237.42 ms | 31.9% bf16 MFU | 124772 tok/s step 16935/19560 | loss 3.186796 (-2.82z)| norm 0.2234 (-0.39z)| lr 3.89e-05 | 4276.69 ms | 31.6% bf16 MFU | 124663 tok/s step 16936/19560 | loss 3.256873 (-0.67z)| norm 0.2148 (-1.29z)| lr 3.88e-05 | 4169.70 ms | 32.4% bf16 MFU | 124717 tok/s step 16937/19560 | loss 3.320203 (+1.25z)| norm 0.2372 (+1.06z)| lr 3.88e-05 | 4169.04 ms | 32.4% bf16 MFU | 124769 tok/s step 16938/19560 | loss 3.389485 (+3.20z)| norm 0.2289 (+0.18z)| lr 3.88e-05 | 4165.98 ms | 32.4% bf16 MFU | 124823 tok/s step 16939/19560 | loss 3.268751 (-0.34z)| norm 0.2223 (-0.51z)| lr 3.88e-05 | 4168.87 ms | 32.4% bf16 MFU | 124870 tok/s step 16940/19560 | loss 3.231198 (-1.42z)| norm 0.2274 (+0.03z)| lr 3.87e-05 | 4165.25 ms | 32.4% bf16 MFU | 124920 tok/s step 16941/19560 | loss 3.296329 (+0.47z)| norm 0.2231 (-0.41z)| lr 3.87e-05 | 4182.41 ms | 32.3% bf16 MFU | 124942 tok/s step 16942/19560 | loss 3.309095 (+0.84z)| norm 0.2296 (+0.28z)| lr 3.87e-05 | 4177.33 ms | 32.3% bf16 MFU | 124970 tok/s step 16943/19560 | loss 3.395950 (+3.21z)| norm 0.2478 (+2.16z)| lr 3.86e-05 | 4164.75 ms | 32.4% bf16 MFU | 125016 tok/s step 16944/19560 | loss 3.282965 (+0.06z)| norm 0.2232 (-0.41z)| lr 3.86e-05 | 4174.11 ms | 32.3% bf16 MFU | 125045 tok/s step 16945/19560 | loss 3.305414 (+0.68z)| norm 0.2339 (+0.71z)| lr 3.86e-05 | 4171.75 ms | 32.4% bf16 MFU | 125077 tok/s step 16946/19560 | loss 3.298423 (+0.48z)| norm 0.2220 (-0.53z)| lr 3.85e-05 | 4163.76 ms | 32.4% bf16 MFU | 125119 tok/s step 16947/19560 | loss 3.258353 (-0.63z)| norm 0.2200 (-0.74z)| lr 3.85e-05 | 4171.03 ms | 32.4% bf16 MFU | 125148 tok/s step 16948/19560 | loss 3.270495 (-0.28z)| norm 0.2287 (+0.17z)| lr 3.85e-05 | 4180.67 ms | 32.3% bf16 MFU | 125161 tok/s step 16949/19560 | loss 3.200839 (-2.18z)| norm 0.2560 (+2.89z)| lr 3.85e-05 | 4175.98 ms | 32.3% bf16 MFU | 125180 tok/s step 16950/19560 | loss 3.265730 (-0.39z)| norm 0.2085 (-1.87z)| lr 3.84e-05 | 4173.63 ms | 32.4% bf16 MFU | 125202 tok/s step 16951/19560 | loss 3.245180 (-0.96z)| norm 0.2255 (-0.16z)| lr 3.84e-05 | 4169.35 ms | 32.4% bf16 MFU | 125229 tok/s step 16952/19560 | loss 3.265770 (-0.38z)| norm 0.2160 (-1.10z)| lr 3.84e-05 | 4188.02 ms | 32.2% bf16 MFU | 125227 tok/s step 16953/19560 | loss 3.291446 (+0.33z)| norm 0.2197 (-0.72z)| lr 3.83e-05 | 4189.02 ms | 32.2% bf16 MFU | 125224 tok/s step 16954/19560 | loss 3.316898 (+1.02z)| norm 0.2219 (-0.50z)| lr 3.83e-05 | 4196.07 ms | 32.2% bf16 MFU | 125210 tok/s step 16955/19560 | loss 3.260455 (-0.55z)| norm 0.2159 (-1.09z)| lr 3.83e-05 | 4224.62 ms | 32.0% bf16 MFU | 125155 tok/s step 16956/19560 | loss 3.262687 (-0.48z)| norm 0.2169 (-0.98z)| lr 3.83e-05 | 4168.20 ms | 32.4% bf16 MFU | 125186 tok/s step 16957/19560 | loss 3.265081 (-0.41z)| norm 0.2168 (-0.98z)| lr 3.82e-05 | 4186.34 ms | 32.3% bf16 MFU | 125189 tok/s step 16958/19560 | loss 3.378850 (+2.66z)| norm 0.2255 (-0.11z)| lr 3.82e-05 | 4170.39 ms | 32.4% bf16 MFU | 125215 tok/s step 16959/19560 | loss 3.411320 (+3.35z)| norm 0.2240 (-0.26z)| lr 3.82e-05 | 4168.33 ms | 32.4% bf16 MFU | 125243 tok/s step 16960/19560 | loss 3.300343 (+0.48z)| norm 0.2193 (-0.72z)| lr 3.81e-05 | 4181.65 ms | 32.3% bf16 MFU | 125250 tok/s step 16961/19560 | loss 3.249140 (-0.84z)| norm 0.2220 (-0.46z)| lr 3.81e-05 | 4182.51 ms | 32.3% bf16 MFU | 125255 tok/s step 16962/19560 | loss 3.330513 (+1.24z)| norm 0.2256 (-0.10z)| lr 3.81e-05 | 4172.98 ms | 32.4% bf16 MFU | 125274 tok/s step 16963/19560 | loss 3.335796 (+1.37z)| norm 0.2229 (-0.37z)| lr 3.81e-05 | 4173.71 ms | 32.3% bf16 MFU | 125291 tok/s step 16964/19560 | loss 3.291700 (+0.24z)| norm 0.2267 (+0.02z)| lr 3.80e-05 | 4169.02 ms | 32.4% bf16 MFU | 125315 tok/s step 16965/19560 | loss 3.291221 (+0.23z)| norm 0.2459 (+1.92z)| lr 3.80e-05 | 4177.78 ms | 32.3% bf16 MFU | 125324 tok/s step 16966/19560 | loss 3.294710 (+0.31z)| norm 0.2196 (-0.68z)| lr 3.80e-05 | 4171.38 ms | 32.4% bf16 MFU | 125342 tok/s step 16967/19560 | loss 3.289896 (+0.18z)| norm 0.2225 (-0.40z)| lr 3.79e-05 | 4175.33 ms | 32.3% bf16 MFU | 125353 tok/s step 16968/19560 | loss 3.341995 (+1.49z)| norm 0.2288 (+0.23z)| lr 3.79e-05 | 4263.98 ms | 31.7% bf16 MFU | 125233 tok/s step 16969/19560 | loss 3.309742 (+0.66z)| norm 0.2167 (-0.96z)| lr 3.79e-05 | 4168.03 ms | 32.4% bf16 MFU | 125261 tok/s step 16970/19560 | loss 3.343046 (+1.49z)| norm 0.2272 (+0.10z)| lr 3.79e-05 | 4254.88 ms | 31.7% bf16 MFU | 125159 tok/s step 16971/19560 | loss 3.280340 (-0.09z)| norm 0.2223 (-0.39z)| lr 3.78e-05 | 4249.91 ms | 31.8% bf16 MFU | 125069 tok/s step 16972/19560 | loss 3.289032 (+0.13z)| norm 0.2180 (-0.82z)| lr 3.78e-05 | 4166.08 ms | 32.4% bf16 MFU | 125108 tok/s step 16973/19560 | loss 3.383718 (+2.45z)| norm 0.2543 (+2.81z)| lr 3.78e-05 | 4160.43 ms | 32.5% bf16 MFU | 125154 tok/s step 16974/19560 | loss 3.271066 (-0.33z)| norm 0.2196 (-0.64z)| lr 3.77e-05 | 4160.93 ms | 32.4% bf16 MFU | 125196 tok/s step 16975/19560 | loss 3.356018 (+1.73z)| norm 0.2676 (+3.85z)| lr 3.77e-05 | 4169.11 ms | 32.4% bf16 MFU | 125224 tok/s step 16976/19560 | loss 3.247335 (-0.91z)| norm 0.2228 (-0.34z)| lr 3.77e-05 | 4181.98 ms | 32.3% bf16 MFU | 125231 tok/s step 16977/19560 | loss 3.264733 (-0.48z)| norm 0.2301 (+0.34z)| lr 3.77e-05 | 4175.46 ms | 32.3% bf16 MFU | 125248 tok/s step 16978/19560 | loss 3.276088 (-0.20z)| norm 0.2232 (-0.30z)| lr 3.76e-05 | 4180.31 ms | 32.3% bf16 MFU | 125256 tok/s step 16979/19560 | loss 3.300427 (+0.39z)| norm 0.2407 (+1.31z)| lr 3.76e-05 | 4163.65 ms | 32.4% bf16 MFU | 125290 tok/s step 16980/19560 | loss 3.233966 (-1.23z)| norm 0.2292 (+0.24z)| lr 3.76e-05 | 4163.45 ms | 32.4% bf16 MFU | 125321 tok/s step 16981/19560 | loss 3.282469 (-0.05z)| norm 0.2207 (-0.54z)| lr 3.75e-05 | 4175.86 ms | 32.3% bf16 MFU | 125333 tok/s step 16982/19560 | loss 3.282070 (-0.06z)| norm 0.2264 (-0.02z)| lr 3.75e-05 | 4184.49 ms | 32.3% bf16 MFU | 125331 tok/s step 16983/19560 | loss 3.297826 (+0.31z)| norm 0.2222 (-0.41z)| lr 3.75e-05 | 4237.24 ms | 31.9% bf16 MFU | 125251 tok/s step 16984/19560 | loss 3.208212 (-1.85z)| norm 0.2165 (-0.94z)| lr 3.75e-05 | 4175.71 ms | 32.3% bf16 MFU | 125266 tok/s step 16985/19560 | loss 3.278778 (-0.14z)| norm 0.2232 (-0.32z)| lr 3.74e-05 | 4170.78 ms | 32.4% bf16 MFU | 125288 tok/s step 16986/19560 | loss 3.311115 (+0.64z)| norm 0.2170 (-0.89z)| lr 3.74e-05 | 4158.74 ms | 32.5% bf16 MFU | 125327 tok/s step 16987/19560 | loss 3.359599 (+1.78z)| norm 0.2286 (+0.21z)| lr 3.74e-05 | 4171.72 ms | 32.4% bf16 MFU | 125345 tok/s step 16988/19560 | loss 3.312575 (+0.64z)| norm 0.2302 (+0.35z)| lr 3.73e-05 | 4173.62 ms | 32.4% bf16 MFU | 125359 tok/s step 16989/19560 | loss 3.206654 (-1.89z)| norm 0.2128 (-1.27z)| lr 3.73e-05 | 4182.92 ms | 32.3% bf16 MFU | 125358 tok/s step 16990/19560 | loss 3.292261 (+0.18z)| norm 0.2231 (-0.30z)| lr 3.73e-05 | 4169.20 ms | 32.4% bf16 MFU | 125377 tok/s step 16991/19560 | loss 3.313885 (+0.70z)| norm 0.2204 (-0.54z)| lr 3.73e-05 | 4164.89 ms | 32.4% bf16 MFU | 125403 tok/s step 16992/19560 | loss 3.278738 (-0.15z)| norm 0.2261 (+0.01z)| lr 3.72e-05 | 4172.01 ms | 32.4% bf16 MFU | 125416 tok/s step 16993/19560 | loss 3.264195 (-0.49z)| norm 0.2328 (+0.63z)| lr 3.72e-05 | 4228.68 ms | 31.9% bf16 MFU | 125344 tok/s step 16994/19560 | loss 3.277804 (-0.17z)| norm 0.2250 (-0.10z)| lr 3.72e-05 | 4177.41 ms | 32.3% bf16 MFU | 125352 tok/s step 16995/19560 | loss 3.249742 (-0.84z)| norm 0.2210 (-0.48z)| lr 3.71e-05 | 4165.36 ms | 32.4% bf16 MFU | 125378 tok/s step 16996/19560 | loss 3.253267 (-0.76z)| norm 0.2371 (+1.04z)| lr 3.71e-05 | 4170.11 ms | 32.4% bf16 MFU | 125396 tok/s step 16997/19560 | loss 3.317631 (+0.83z)| norm 0.2156 (-0.99z)| lr 3.71e-05 | 4163.38 ms | 32.4% bf16 MFU | 125422 tok/s step 16998/19560 | loss 3.267272 (-0.40z)| norm 0.2133 (-1.20z)| lr 3.71e-05 | 4173.82 ms | 32.3% bf16 MFU | 125432 tok/s step 16999/19560 | loss 3.263752 (-0.48z)| norm 0.2228 (-0.29z)| lr 3.70e-05 | 4164.28 ms | 32.4% bf16 MFU | 125455 tok/s step 17000/19560 | loss 3.222626 (-1.47z)| norm 0.2273 (+0.13z)| lr 3.70e-05 | 4173.31 ms | 32.4% bf16 MFU | 125464 tok/s val loss 3.260574 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3041/10042 = 0.302828 step 17001/19560 | loss 3.255007 (-0.68z)| norm 0.2303 (+0.41z)| lr 3.70e-05 | 4164.89 ms | 32.4% bf16 MFU | 125485 tok/s step 17002/19560 | loss 3.270586 (-0.29z)| norm 0.2289 (+0.28z)| lr 3.69e-05 | 4172.20 ms | 32.4% bf16 MFU | 125494 tok/s step 17003/19560 | loss 3.287067 (+0.10z)| norm 0.2102 (-1.47z)| lr 3.69e-05 | 4167.08 ms | 32.4% bf16 MFU | 125510 tok/s step 17004/19560 | loss 3.250694 (-0.78z)| norm 0.2109 (-1.39z)| lr 3.69e-05 | 4176.12 ms | 32.3% bf16 MFU | 125512 tok/s step 17005/19560 | loss 3.314769 (+0.79z)| norm 0.2287 (+0.26z)| lr 3.69e-05 | 4173.55 ms | 32.4% bf16 MFU | 125517 tok/s step 17006/19560 | loss 3.247674 (-0.85z)| norm 0.2292 (+0.30z)| lr 3.68e-05 | 4180.97 ms | 32.3% bf16 MFU | 125511 tok/s step 17007/19560 | loss 3.239301 (-1.04z)| norm 0.2242 (-0.17z)| lr 3.68e-05 | 4477.07 ms | 30.2% bf16 MFU | 125091 tok/s step 17008/19560 | loss 3.380889 (+2.35z)| norm 0.2316 (+0.52z)| lr 3.68e-05 | 4168.49 ms | 32.4% bf16 MFU | 125125 tok/s step 17009/19560 | loss 3.266907 (-0.37z)| norm 0.2368 (+0.99z)| lr 3.67e-05 | 4178.74 ms | 32.3% bf16 MFU | 125142 tok/s step 17010/19560 | loss 3.279500 (-0.06z)| norm 0.2225 (-0.35z)| lr 3.67e-05 | 4272.56 ms | 31.6% bf16 MFU | 125020 tok/s step 17011/19560 | loss 3.299106 (+0.40z)| norm 0.2379 (+1.07z)| lr 3.67e-05 | 4173.17 ms | 32.4% bf16 MFU | 125051 tok/s step 17012/19560 | loss 3.427871 (+3.34z)| norm 0.2452 (+1.72z)| lr 3.67e-05 | 4210.87 ms | 32.1% bf16 MFU | 125024 tok/s step 17013/19560 | loss 3.220352 (-1.46z)| norm 0.2412 (+1.33z)| lr 3.66e-05 | 4289.41 ms | 31.5% bf16 MFU | 124884 tok/s step 17014/19560 | loss 3.305398 (+0.52z)| norm 0.2230 (-0.34z)| lr 3.66e-05 | 4176.23 ms | 32.3% bf16 MFU | 124917 tok/s step 17015/19560 | loss 3.286091 (+0.07z)| norm 0.2200 (-0.61z)| lr 3.66e-05 | 4170.47 ms | 32.4% bf16 MFU | 124957 tok/s step 17016/19560 | loss 3.301173 (+0.42z)| norm 0.2397 (+1.19z)| lr 3.65e-05 | 4833.66 ms | 27.9% bf16 MFU | 124132 tok/s step 17017/19560 | loss 3.278270 (-0.12z)| norm 0.2263 (-0.03z)| lr 3.65e-05 | 4175.39 ms | 32.3% bf16 MFU | 124204 tok/s step 17018/19560 | loss 3.231493 (-1.19z)| norm 0.2204 (-0.57z)| lr 3.65e-05 | 4191.49 ms | 32.2% bf16 MFU | 124248 tok/s step 17019/19560 | loss 3.260259 (-0.52z)| norm 0.2269 (+0.03z)| lr 3.65e-05 | 4179.06 ms | 32.3% bf16 MFU | 124308 tok/s step 17020/19560 | loss 3.275782 (-0.17z)| norm 0.2201 (-0.59z)| lr 3.64e-05 | 4167.83 ms | 32.4% bf16 MFU | 124383 tok/s step 17021/19560 | loss 3.262886 (-0.47z)| norm 0.2223 (-0.38z)| lr 3.64e-05 | 4164.22 ms | 32.4% bf16 MFU | 124459 tok/s step 17022/19560 | loss 3.277460 (-0.13z)| norm 0.2294 (+0.28z)| lr 3.64e-05 | 4189.73 ms | 32.2% bf16 MFU | 124493 tok/s step 17023/19560 | loss 3.306539 (+0.54z)| norm 0.2218 (-0.41z)| lr 3.63e-05 | 4182.91 ms | 32.3% bf16 MFU | 124535 tok/s step 17024/19560 | loss 3.252295 (-0.74z)| norm 0.2308 (+0.42z)| lr 3.63e-05 | 4173.90 ms | 32.3% bf16 MFU | 124589 tok/s step 17025/19560 | loss 3.265806 (-0.41z)| norm 0.2252 (-0.10z)| lr 3.63e-05 | 4178.77 ms | 32.3% bf16 MFU | 124633 tok/s step 17026/19560 | loss 3.277867 (-0.13z)| norm 0.2206 (-0.53z)| lr 3.63e-05 | 4173.05 ms | 32.4% bf16 MFU | 124683 tok/s step 17027/19560 | loss 3.295631 (+0.28z)| norm 0.2282 (+0.20z)| lr 3.62e-05 | 4172.52 ms | 32.4% bf16 MFU | 124731 tok/s step 17028/19560 | loss 3.359641 (+1.79z)| norm 0.2346 (+0.78z)| lr 3.62e-05 | 4174.42 ms | 32.3% bf16 MFU | 124774 tok/s step 17029/19560 | loss 3.251889 (-0.75z)| norm 0.2132 (-1.21z)| lr 3.62e-05 | 4171.80 ms | 32.4% bf16 MFU | 124819 tok/s step 17030/19560 | loss 3.272790 (-0.25z)| norm 0.2198 (-0.60z)| lr 3.61e-05 | 4173.18 ms | 32.4% bf16 MFU | 124860 tok/s step 17031/19560 | loss 3.272457 (-0.25z)| norm 0.2242 (-0.19z)| lr 3.61e-05 | 4327.86 ms | 31.2% bf16 MFU | 124674 tok/s step 17032/19560 | loss 3.250703 (-0.76z)| norm 0.2178 (-0.79z)| lr 3.61e-05 | 4176.77 ms | 32.3% bf16 MFU | 124717 tok/s step 17033/19560 | loss 3.286009 (+0.07z)| norm 0.2213 (-0.45z)| lr 3.61e-05 | 4169.36 ms | 32.4% bf16 MFU | 124768 tok/s step 17034/19560 | loss 3.270115 (-0.31z)| norm 0.2241 (-0.18z)| lr 3.60e-05 | 4198.89 ms | 32.2% bf16 MFU | 124773 tok/s step 17035/19560 | loss 3.291222 (+0.19z)| norm 0.2232 (-0.28z)| lr 3.60e-05 | 4179.35 ms | 32.3% bf16 MFU | 124807 tok/s step 17036/19560 | loss 3.295337 (+0.28z)| norm 0.2228 (-0.33z)| lr 3.60e-05 | 4206.90 ms | 32.1% bf16 MFU | 124798 tok/s step 17037/19560 | loss 3.253754 (-0.73z)| norm 0.2232 (-0.29z)| lr 3.60e-05 | 4173.42 ms | 32.4% bf16 MFU | 124839 tok/s step 17038/19560 | loss 3.243885 (-0.96z)| norm 0.2369 (+1.22z)| lr 3.59e-05 | 4173.24 ms | 32.4% bf16 MFU | 124879 tok/s step 17039/19560 | loss 3.307948 (+0.58z)| norm 0.2321 (+0.68z)| lr 3.59e-05 | 4179.68 ms | 32.3% bf16 MFU | 124907 tok/s step 17040/19560 | loss 3.406459 (+2.83z)| norm 0.2348 (+0.97z)| lr 3.59e-05 | 4171.18 ms | 32.4% bf16 MFU | 124946 tok/s step 17041/19560 | loss 3.254334 (-0.72z)| norm 0.2268 (+0.09z)| lr 3.58e-05 | 4182.64 ms | 32.3% bf16 MFU | 124966 tok/s step 17042/19560 | loss 3.264010 (-0.49z)| norm 0.2322 (+0.69z)| lr 3.58e-05 | 4360.53 ms | 31.0% bf16 MFU | 124730 tok/s step 17043/19560 | loss 3.275591 (-0.23z)| norm 0.2288 (+0.30z)| lr 3.58e-05 | 4173.06 ms | 32.4% bf16 MFU | 124775 tok/s step 17044/19560 | loss 3.216582 (-1.59z)| norm 0.2266 (+0.07z)| lr 3.58e-05 | 4168.17 ms | 32.4% bf16 MFU | 124825 tok/s step 17045/19560 | loss 3.277520 (-0.18z)| norm 0.2324 (+0.71z)| lr 3.57e-05 | 4168.78 ms | 32.4% bf16 MFU | 124872 tok/s step 17046/19560 | loss 3.295971 (+0.25z)| norm 0.2344 (+0.91z)| lr 3.57e-05 | 4191.45 ms | 32.2% bf16 MFU | 124883 tok/s step 17047/19560 | loss 3.258504 (-0.62z)| norm 0.2236 (-0.28z)| lr 3.57e-05 | 4177.09 ms | 32.3% bf16 MFU | 124915 tok/s step 17048/19560 | loss 3.280295 (-0.13z)| norm 0.2321 (+0.66z)| lr 3.56e-05 | 4182.23 ms | 32.3% bf16 MFU | 124937 tok/s step 17049/19560 | loss 3.291731 (+0.14z)| norm 0.2300 (+0.42z)| lr 3.56e-05 | 4173.89 ms | 32.3% bf16 MFU | 124971 tok/s step 17050/19560 | loss 3.273302 (-0.29z)| norm 0.2351 (+0.97z)| lr 3.56e-05 | 4173.07 ms | 32.4% bf16 MFU | 125004 tok/s step 17051/19560 | loss 3.319546 (+0.79z)| norm 0.2467 (+2.19z)| lr 3.56e-05 | 4184.88 ms | 32.3% bf16 MFU | 125018 tok/s step 17052/19560 | loss 3.238073 (-1.12z)| norm 0.2247 (-0.20z)| lr 3.55e-05 | 4178.29 ms | 32.3% bf16 MFU | 125041 tok/s step 17053/19560 | loss 3.269403 (-0.39z)| norm 0.2215 (-0.55z)| lr 3.55e-05 | 4180.24 ms | 32.3% bf16 MFU | 125060 tok/s step 17054/19560 | loss 3.321906 (+0.83z)| norm 0.2213 (-0.56z)| lr 3.55e-05 | 4178.90 ms | 32.3% bf16 MFU | 125080 tok/s step 17055/19560 | loss 3.265123 (-0.48z)| norm 0.2164 (-1.09z)| lr 3.54e-05 | 4174.91 ms | 32.3% bf16 MFU | 125105 tok/s step 17056/19560 | loss 3.282636 (-0.06z)| norm 0.2381 (+1.28z)| lr 3.54e-05 | 4174.69 ms | 32.3% bf16 MFU | 125129 tok/s step 17057/19560 | loss 3.331533 (+1.08z)| norm 0.2419 (+1.67z)| lr 3.54e-05 | 4178.46 ms | 32.3% bf16 MFU | 125146 tok/s step 17058/19560 | loss 3.247284 (-0.89z)| norm 0.2405 (+1.51z)| lr 3.54e-05 | 4185.08 ms | 32.3% bf16 MFU | 125153 tok/s step 17059/19560 | loss 3.290985 (+0.14z)| norm 0.2358 (+0.98z)| lr 3.53e-05 | 4199.81 ms | 32.1% bf16 MFU | 125137 tok/s step 17060/19560 | loss 3.279458 (-0.13z)| norm 0.2361 (+1.00z)| lr 3.53e-05 | 4182.06 ms | 32.3% bf16 MFU | 125148 tok/s step 17061/19560 | loss 3.297570 (+0.29z)| norm 0.2279 (+0.11z)| lr 3.53e-05 | 4169.13 ms | 32.4% bf16 MFU | 125179 tok/s step 17062/19560 | loss 3.249507 (-0.84z)| norm 0.2234 (-0.37z)| lr 3.53e-05 | 4180.86 ms | 32.3% bf16 MFU | 125190 tok/s step 17063/19560 | loss 3.291380 (+0.13z)| norm 0.2229 (-0.42z)| lr 3.52e-05 | 4168.95 ms | 32.4% bf16 MFU | 125218 tok/s step 17064/19560 | loss 3.294536 (+0.20z)| norm 0.2266 (-0.03z)| lr 3.52e-05 | 4175.61 ms | 32.3% bf16 MFU | 125235 tok/s step 17065/19560 | loss 3.223325 (-1.49z)| norm 0.2231 (-0.40z)| lr 3.52e-05 | 4176.31 ms | 32.3% bf16 MFU | 125251 tok/s step 17066/19560 | loss 3.289961 (+0.13z)| norm 0.2274 (+0.07z)| lr 3.51e-05 | 4181.06 ms | 32.3% bf16 MFU | 125258 tok/s step 17067/19560 | loss 3.224693 (-1.46z)| norm 0.2291 (+0.25z)| lr 3.51e-05 | 4175.89 ms | 32.3% bf16 MFU | 125272 tok/s step 17068/19560 | loss 3.323335 (+0.93z)| norm 0.2186 (-0.88z)| lr 3.51e-05 | 4175.88 ms | 32.3% bf16 MFU | 125286 tok/s step 17069/19560 | loss 3.307424 (+0.54z)| norm 0.2186 (-0.88z)| lr 3.51e-05 | 4209.30 ms | 32.1% bf16 MFU | 125250 tok/s step 17070/19560 | loss 3.306857 (+0.53z)| norm 0.2242 (-0.27z)| lr 3.50e-05 | 4192.84 ms | 32.2% bf16 MFU | 125240 tok/s step 17071/19560 | loss 3.402371 (+2.85z)| norm 0.2211 (-0.59z)| lr 3.50e-05 | 4171.54 ms | 32.4% bf16 MFU | 125262 tok/s step 17072/19560 | loss 3.269814 (-0.38z)| norm 0.2430 (+1.78z)| lr 3.50e-05 | 4187.06 ms | 32.2% bf16 MFU | 125259 tok/s step 17073/19560 | loss 3.243956 (-0.99z)| norm 0.2225 (-0.44z)| lr 3.49e-05 | 4173.07 ms | 32.4% bf16 MFU | 125278 tok/s step 17074/19560 | loss 3.244710 (-0.96z)| norm 0.2354 (+0.96z)| lr 3.49e-05 | 4177.12 ms | 32.3% bf16 MFU | 125290 tok/s step 17075/19560 | loss 3.292412 (+0.19z)| norm 0.2219 (-0.52z)| lr 3.49e-05 | 4169.69 ms | 32.4% bf16 MFU | 125312 tok/s step 17076/19560 | loss 3.280849 (-0.09z)| norm 0.2229 (-0.41z)| lr 3.49e-05 | 4174.48 ms | 32.3% bf16 MFU | 125326 tok/s step 17077/19560 | loss 3.324231 (+0.95z)| norm 0.2270 (+0.07z)| lr 3.48e-05 | 4557.33 ms | 29.6% bf16 MFU | 124812 tok/s step 17078/19560 | loss 3.300785 (+0.36z)| norm 0.2322 (+0.65z)| lr 3.48e-05 | 5042.94 ms | 26.8% bf16 MFU | 123770 tok/s step 17079/19560 | loss 3.253573 (-0.80z)| norm 0.2423 (+1.77z)| lr 3.48e-05 | 4421.29 ms | 30.5% bf16 MFU | 123511 tok/s step 17080/19560 | loss 3.344518 (+1.41z)| norm 0.2483 (+2.39z)| lr 3.48e-05 | 4572.87 ms | 29.5% bf16 MFU | 123068 tok/s step 17081/19560 | loss 3.280214 (-0.15z)| norm 0.2225 (-0.50z)| lr 3.47e-05 | 4327.34 ms | 31.2% bf16 MFU | 122972 tok/s step 17082/19560 | loss 3.235306 (-1.23z)| norm 0.2339 (+0.76z)| lr 3.47e-05 | 4366.18 ms | 30.9% bf16 MFU | 122827 tok/s step 17083/19560 | loss 3.261399 (-0.60z)| norm 0.2232 (-0.44z)| lr 3.47e-05 | 4273.82 ms | 31.6% bf16 MFU | 122820 tok/s step 17084/19560 | loss 3.263010 (-0.56z)| norm 0.2254 (-0.21z)| lr 3.46e-05 | 4277.90 ms | 31.6% bf16 MFU | 122807 tok/s step 17085/19560 | loss 3.251582 (-0.83z)| norm 0.2268 (-0.05z)| lr 3.46e-05 | 4205.30 ms | 32.1% bf16 MFU | 122900 tok/s step 17086/19560 | loss 3.300215 (+0.37z)| norm 0.2335 (+0.70z)| lr 3.46e-05 | 4316.29 ms | 31.3% bf16 MFU | 122828 tok/s step 17087/19560 | loss 3.256735 (-0.70z)| norm 0.2243 (-0.35z)| lr 3.46e-05 | 4305.92 ms | 31.4% bf16 MFU | 122775 tok/s step 17088/19560 | loss 3.307022 (+0.59z)| norm 0.2475 (+2.23z)| lr 3.45e-05 | 4309.02 ms | 31.3% bf16 MFU | 122720 tok/s step 17089/19560 | loss 3.285876 (+0.04z)| norm 0.2313 (+0.41z)| lr 3.45e-05 | 4216.75 ms | 32.0% bf16 MFU | 122800 tok/s step 17090/19560 | loss 3.290189 (+0.16z)| norm 0.2300 (+0.26z)| lr 3.45e-05 | 4223.66 ms | 32.0% bf16 MFU | 122867 tok/s step 17091/19560 | loss 3.270638 (-0.34z)| norm 0.2197 (-0.88z)| lr 3.45e-05 | 4195.58 ms | 32.2% bf16 MFU | 122972 tok/s step 17092/19560 | loss 3.252565 (-0.80z)| norm 0.2270 (-0.07z)| lr 3.44e-05 | 4175.74 ms | 32.3% bf16 MFU | 123101 tok/s step 17093/19560 | loss 3.300094 (+0.44z)| norm 0.2407 (+1.47z)| lr 3.44e-05 | 4167.73 ms | 32.4% bf16 MFU | 123236 tok/s step 17094/19560 | loss 3.264515 (-0.48z)| norm 0.2271 (-0.07z)| lr 3.44e-05 | 4228.29 ms | 31.9% bf16 MFU | 123274 tok/s step 17095/19560 | loss 3.207210 (-1.93z)| norm 0.2204 (-0.82z)| lr 3.43e-05 | 4259.99 ms | 31.7% bf16 MFU | 123264 tok/s step 17096/19560 | loss 3.285219 (+0.09z)| norm 0.2246 (-0.34z)| lr 3.43e-05 | 4238.17 ms | 31.9% bf16 MFU | 123286 tok/s step 17097/19560 | loss 3.229475 (-1.33z)| norm 0.2193 (-0.95z)| lr 3.43e-05 | 4329.27 ms | 31.2% bf16 MFU | 123177 tok/s step 17098/19560 | loss 3.309042 (+0.73z)| norm 0.2355 (+0.88z)| lr 3.43e-05 | 4183.22 ms | 32.3% bf16 MFU | 123284 tok/s step 17099/19560 | loss 3.272814 (-0.21z)| norm 0.2158 (-1.33z)| lr 3.42e-05 | 4170.92 ms | 32.4% bf16 MFU | 123405 tok/s step 17100/19560 | loss 3.340813 (+1.53z)| norm 0.2260 (-0.19z)| lr 3.42e-05 | 4414.65 ms | 30.6% bf16 MFU | 123173 tok/s step 17101/19560 | loss 3.218529 (-1.61z)| norm 0.2293 (+0.21z)| lr 3.42e-05 | 4357.43 ms | 31.0% bf16 MFU | 123030 tok/s step 17102/19560 | loss 3.299399 (+0.50z)| norm 0.2317 (+0.48z)| lr 3.42e-05 | 4355.18 ms | 31.0% bf16 MFU | 122898 tok/s step 17103/19560 | loss 3.270366 (-0.25z)| norm 0.2220 (-0.68z)| lr 3.41e-05 | 4228.27 ms | 31.9% bf16 MFU | 122953 tok/s step 17104/19560 | loss 3.296654 (+0.44z)| norm 0.2228 (-0.57z)| lr 3.41e-05 | 4168.32 ms | 32.4% bf16 MFU | 123094 tok/s step 17105/19560 | loss 3.290885 (+0.28z)| norm 0.2197 (-0.95z)| lr 3.41e-05 | 4306.75 ms | 31.4% bf16 MFU | 123026 tok/s step 17106/19560 | loss 3.196779 (-2.16z)| norm 0.2240 (-0.41z)| lr 3.40e-05 | 4224.79 ms | 32.0% bf16 MFU | 123080 tok/s step 17107/19560 | loss 3.238714 (-1.05z)| norm 0.2234 (-0.48z)| lr 3.40e-05 | 4176.89 ms | 32.3% bf16 MFU | 123202 tok/s step 17108/19560 | loss 3.240428 (-1.01z)| norm 0.2252 (-0.24z)| lr 3.40e-05 | 4264.16 ms | 31.7% bf16 MFU | 123189 tok/s step 17109/19560 | loss 3.320932 (+1.08z)| norm 0.2495 (+2.79z)| lr 3.40e-05 | 4171.71 ms | 32.4% bf16 MFU | 123314 tok/s step 17110/19560 | loss 3.271146 (-0.21z)| norm 0.2222 (-0.63z)| lr 3.39e-05 | 4169.04 ms | 32.4% bf16 MFU | 123436 tok/s step 17111/19560 | loss 3.240715 (-0.99z)| norm 0.2152 (-1.49z)| lr 3.39e-05 | 4175.91 ms | 32.3% bf16 MFU | 123542 tok/s step 17112/19560 | loss 3.224381 (-1.42z)| norm 0.2265 (-0.09z)| lr 3.39e-05 | 4167.56 ms | 32.4% bf16 MFU | 123655 tok/s step 17113/19560 | loss 3.209461 (-1.78z)| norm 0.2293 (+0.25z)| lr 3.39e-05 | 4545.10 ms | 29.7% bf16 MFU | 123240 tok/s step 17114/19560 | loss 3.230014 (-1.23z)| norm 0.2632 (+4.17z)| lr 3.38e-05 | 4173.52 ms | 32.4% bf16 MFU | 123359 tok/s step 17115/19560 | loss 3.190719 (-2.20z)| norm 0.2253 (-0.27z)| lr 3.38e-05 | 4293.91 ms | 31.4% bf16 MFU | 123296 tok/s step 17116/19560 | loss 3.216493 (-1.51z)| norm 0.2278 (+0.02z)| lr 3.38e-05 | 4173.16 ms | 32.4% bf16 MFU | 123413 tok/s step 17117/19560 | loss 3.250740 (-0.65z)| norm 0.2334 (+0.67z)| lr 3.37e-05 | 4185.77 ms | 32.3% bf16 MFU | 123505 tok/s step 17118/19560 | loss 3.304041 (+0.71z)| norm 0.2176 (-1.19z)| lr 3.37e-05 | 4258.43 ms | 31.7% bf16 MFU | 123485 tok/s step 17119/19560 | loss 3.211937 (-1.62z)| norm 0.2517 (+2.73z)| lr 3.37e-05 | 4181.21 ms | 32.3% bf16 MFU | 123581 tok/s step 17120/19560 | loss 3.248407 (-0.68z)| norm 0.2279 (-0.01z)| lr 3.37e-05 | 4177.25 ms | 32.3% bf16 MFU | 123677 tok/s step 17121/19560 | loss 3.211386 (-1.60z)| norm 0.2196 (-0.95z)| lr 3.36e-05 | 4611.89 ms | 29.3% bf16 MFU | 123177 tok/s step 17122/19560 | loss 3.266092 (-0.22z)| norm 0.2278 (-0.01z)| lr 3.36e-05 | 4181.98 ms | 32.3% bf16 MFU | 123287 tok/s step 17123/19560 | loss 3.240194 (-0.87z)| norm 0.2293 (+0.16z)| lr 3.36e-05 | 4287.23 ms | 31.5% bf16 MFU | 123237 tok/s step 17124/19560 | loss 3.281275 (+0.16z)| norm 0.2164 (-1.31z)| lr 3.36e-05 | 4177.93 ms | 32.3% bf16 MFU | 123350 tok/s step 17125/19560 | loss 3.280834 (+0.16z)| norm 0.2177 (-1.16z)| lr 3.35e-05 | 4173.06 ms | 32.4% bf16 MFU | 123464 tok/s step 17126/19560 | loss 3.276046 (+0.04z)| norm 0.2324 (+0.51z)| lr 3.35e-05 | 4185.31 ms | 32.3% bf16 MFU | 123554 tok/s step 17127/19560 | loss 3.262948 (-0.30z)| norm 0.2177 (-1.18z)| lr 3.35e-05 | 4207.04 ms | 32.1% bf16 MFU | 123608 tok/s step 17128/19560 | loss 3.286724 (+0.29z)| norm 0.2232 (-0.54z)| lr 3.34e-05 | 4182.04 ms | 32.3% bf16 MFU | 123696 tok/s step 17129/19560 | loss 3.228516 (-1.18z)| norm 0.2185 (-1.07z)| lr 3.34e-05 | 4176.19 ms | 32.3% bf16 MFU | 123788 tok/s step 17130/19560 | loss 3.240966 (-0.85z)| norm 0.2379 (+1.15z)| lr 3.34e-05 | 4177.07 ms | 32.3% bf16 MFU | 123874 tok/s step 17131/19560 | loss 3.258312 (-0.41z)| norm 0.2438 (+1.80z)| lr 3.34e-05 | 4189.22 ms | 32.2% bf16 MFU | 123938 tok/s step 17132/19560 | loss 3.289852 (+0.38z)| norm 0.2325 (+0.49z)| lr 3.33e-05 | 4190.09 ms | 32.2% bf16 MFU | 123998 tok/s step 17133/19560 | loss 3.278695 (+0.11z)| norm 0.2196 (-1.00z)| lr 3.33e-05 | 4180.53 ms | 32.3% bf16 MFU | 124068 tok/s step 17134/19560 | loss 3.280490 (+0.15z)| norm 0.2399 (+1.34z)| lr 3.33e-05 | 4185.64 ms | 32.3% bf16 MFU | 124128 tok/s step 17135/19560 | loss 3.293854 (+0.48z)| norm 0.2308 (+0.29z)| lr 3.33e-05 | 4170.98 ms | 32.4% bf16 MFU | 124206 tok/s step 17136/19560 | loss 3.253251 (-0.55z)| norm 0.2252 (-0.35z)| lr 3.32e-05 | 4179.08 ms | 32.3% bf16 MFU | 124269 tok/s step 17137/19560 | loss 3.284942 (+0.28z)| norm 0.2296 (+0.16z)| lr 3.32e-05 | 4182.08 ms | 32.3% bf16 MFU | 124324 tok/s step 17138/19560 | loss 3.205715 (-1.76z)| norm 0.2412 (+1.47z)| lr 3.32e-05 | 4183.87 ms | 32.3% bf16 MFU | 124373 tok/s step 17139/19560 | loss 3.251133 (-0.58z)| norm 0.2256 (-0.31z)| lr 3.31e-05 | 4173.15 ms | 32.4% bf16 MFU | 124436 tok/s step 17140/19560 | loss 3.250175 (-0.60z)| norm 0.2354 (+0.84z)| lr 3.31e-05 | 4184.97 ms | 32.3% bf16 MFU | 124478 tok/s step 17141/19560 | loss 3.271656 (-0.02z)| norm 0.2427 (+1.69z)| lr 3.31e-05 | 4188.90 ms | 32.2% bf16 MFU | 124512 tok/s step 17142/19560 | loss 3.251211 (-0.58z)| norm 0.2278 (-0.06z)| lr 3.31e-05 | 4184.56 ms | 32.3% bf16 MFU | 124551 tok/s step 17143/19560 | loss 3.257441 (-0.40z)| norm 0.2231 (-0.61z)| lr 3.30e-05 | 4196.42 ms | 32.2% bf16 MFU | 124571 tok/s step 17144/19560 | loss 3.263839 (-0.21z)| norm 0.2216 (-0.77z)| lr 3.30e-05 | 4176.05 ms | 32.3% bf16 MFU | 124619 tok/s step 17145/19560 | loss 3.305722 (+0.95z)| norm 0.2163 (-1.37z)| lr 3.30e-05 | 4195.59 ms | 32.2% bf16 MFU | 124637 tok/s step 17146/19560 | loss 3.258655 (-0.37z)| norm 0.2223 (-0.68z)| lr 3.30e-05 | 4180.29 ms | 32.3% bf16 MFU | 124676 tok/s step 17147/19560 | loss 3.297680 (+0.71z)| norm 0.2327 (+0.53z)| lr 3.29e-05 | 4191.09 ms | 32.2% bf16 MFU | 124697 tok/s step 17148/19560 | loss 3.239980 (-0.89z)| norm 0.2279 (-0.04z)| lr 3.29e-05 | 4169.44 ms | 32.4% bf16 MFU | 124749 tok/s step 17149/19560 | loss 3.286176 (+0.39z)| norm 0.2276 (-0.08z)| lr 3.29e-05 | 4184.33 ms | 32.3% bf16 MFU | 124777 tok/s step 17150/19560 | loss 3.280732 (+0.24z)| norm 0.2167 (-1.34z)| lr 3.28e-05 | 4186.83 ms | 32.2% bf16 MFU | 124799 tok/s step 17151/19560 | loss 3.240798 (-0.86z)| norm 0.2084 (-2.25z)| lr 3.28e-05 | 4186.29 ms | 32.3% bf16 MFU | 124821 tok/s step 17152/19560 | loss 3.259017 (-0.35z)| norm 0.2337 (+0.64z)| lr 3.28e-05 | 4190.10 ms | 32.2% bf16 MFU | 124836 tok/s step 17153/19560 | loss 3.229890 (-1.15z)| norm 0.2252 (-0.33z)| lr 3.28e-05 | 4182.69 ms | 32.3% bf16 MFU | 124862 tok/s step 17154/19560 | loss 3.227351 (-1.21z)| norm 0.2172 (-1.24z)| lr 3.27e-05 | 4184.85 ms | 32.3% bf16 MFU | 124883 tok/s step 17155/19560 | loss 3.276019 (+0.14z)| norm 0.2164 (-1.31z)| lr 3.27e-05 | 4173.78 ms | 32.3% bf16 MFU | 124919 tok/s step 17156/19560 | loss 3.273946 (+0.11z)| norm 0.2302 (+0.26z)| lr 3.27e-05 | 4169.11 ms | 32.4% bf16 MFU | 124961 tok/s step 17157/19560 | loss 3.288373 (+0.51z)| norm 0.2140 (-1.58z)| lr 3.27e-05 | 4176.78 ms | 32.3% bf16 MFU | 124989 tok/s step 17158/19560 | loss 3.209876 (-1.69z)| norm 0.2157 (-1.38z)| lr 3.26e-05 | 4174.58 ms | 32.3% bf16 MFU | 125019 tok/s step 17159/19560 | loss 3.250247 (-0.55z)| norm 0.2260 (-0.22z)| lr 3.26e-05 | 4179.42 ms | 32.3% bf16 MFU | 125041 tok/s step 17160/19560 | loss 3.220509 (-1.37z)| norm 0.2167 (-1.27z)| lr 3.26e-05 | 4177.96 ms | 32.3% bf16 MFU | 125063 tok/s step 17161/19560 | loss 3.277410 (+0.22z)| norm 0.2187 (-1.04z)| lr 3.26e-05 | 4179.47 ms | 32.3% bf16 MFU | 125082 tok/s step 17162/19560 | loss 3.239409 (-0.83z)| norm 0.2095 (-2.03z)| lr 3.25e-05 | 4192.74 ms | 32.2% bf16 MFU | 125080 tok/s step 17163/19560 | loss 3.249938 (-0.53z)| norm 0.2227 (-0.56z)| lr 3.25e-05 | 4193.75 ms | 32.2% bf16 MFU | 125077 tok/s step 17164/19560 | loss 3.262552 (-0.17z)| norm 0.2239 (-0.43z)| lr 3.25e-05 | 4172.34 ms | 32.4% bf16 MFU | 125106 tok/s step 17165/19560 | loss 3.270484 (+0.05z)| norm 0.2279 (+0.01z)| lr 3.24e-05 | 4179.74 ms | 32.3% bf16 MFU | 125123 tok/s step 17166/19560 | loss 3.276365 (+0.21z)| norm 0.2201 (-0.84z)| lr 3.24e-05 | 4179.39 ms | 32.3% bf16 MFU | 125139 tok/s step 17167/19560 | loss 3.341111 (+1.98z)| norm 0.2219 (-0.63z)| lr 3.24e-05 | 4197.06 ms | 32.2% bf16 MFU | 125128 tok/s step 17168/19560 | loss 3.260026 (-0.24z)| norm 0.2316 (+0.45z)| lr 3.24e-05 | 4178.83 ms | 32.3% bf16 MFU | 125144 tok/s step 17169/19560 | loss 3.247231 (-0.61z)| norm 0.2166 (-1.21z)| lr 3.23e-05 | 4196.36 ms | 32.2% bf16 MFU | 125134 tok/s step 17170/19560 | loss 3.273215 (+0.15z)| norm 0.2469 (+2.11z)| lr 3.23e-05 | 4178.93 ms | 32.3% bf16 MFU | 125150 tok/s step 17171/19560 | loss 3.293993 (+0.75z)| norm 0.2186 (-0.97z)| lr 3.23e-05 | 4184.11 ms | 32.3% bf16 MFU | 125158 tok/s step 17172/19560 | loss 3.274226 (+0.16z)| norm 0.2285 (+0.10z)| lr 3.23e-05 | 4163.53 ms | 32.4% bf16 MFU | 125196 tok/s step 17173/19560 | loss 3.292192 (+0.69z)| norm 0.2279 (+0.05z)| lr 3.22e-05 | 4172.54 ms | 32.4% bf16 MFU | 125219 tok/s step 17174/19560 | loss 3.280187 (+0.34z)| norm 0.2209 (-0.70z)| lr 3.22e-05 | 4178.36 ms | 32.3% bf16 MFU | 125232 tok/s step 17175/19560 | loss 3.330709 (+1.79z)| norm 0.2223 (-0.55z)| lr 3.22e-05 | 4180.81 ms | 32.3% bf16 MFU | 125241 tok/s step 17176/19560 | loss 3.297564 (+0.82z)| norm 0.2248 (-0.27z)| lr 3.22e-05 | 4180.74 ms | 32.3% bf16 MFU | 125249 tok/s step 17177/19560 | loss 3.288448 (+0.55z)| norm 0.2250 (-0.25z)| lr 3.21e-05 | 4181.07 ms | 32.3% bf16 MFU | 125256 tok/s step 17178/19560 | loss 3.375429 (+2.96z)| norm 0.2153 (-1.28z)| lr 3.21e-05 | 4190.03 ms | 32.2% bf16 MFU | 125250 tok/s step 17179/19560 | loss 3.304650 (+0.97z)| norm 0.2213 (-0.63z)| lr 3.21e-05 | 4198.77 ms | 32.2% bf16 MFU | 125231 tok/s step 17180/19560 | loss 3.257966 (-0.35z)| norm 0.2172 (-1.06z)| lr 3.20e-05 | 4185.60 ms | 32.3% bf16 MFU | 125232 tok/s step 17181/19560 | loss 3.272025 (+0.05z)| norm 0.2254 (-0.17z)| lr 3.20e-05 | 4186.03 ms | 32.3% bf16 MFU | 125233 tok/s step 17182/19560 | loss 3.255462 (-0.41z)| norm 0.2163 (-1.16z)| lr 3.20e-05 | 4166.33 ms | 32.4% bf16 MFU | 125263 tok/s step 17183/19560 | loss 3.275171 (+0.15z)| norm 0.2318 (+0.53z)| lr 3.20e-05 | 4230.01 ms | 31.9% bf16 MFU | 125197 tok/s step 17184/19560 | loss 3.277563 (+0.22z)| norm 0.2243 (-0.29z)| lr 3.19e-05 | 4177.61 ms | 32.3% bf16 MFU | 125212 tok/s step 17185/19560 | loss 3.325466 (+1.59z)| norm 0.2249 (-0.21z)| lr 3.19e-05 | 4187.83 ms | 32.2% bf16 MFU | 125211 tok/s step 17186/19560 | loss 3.234379 (-1.01z)| norm 0.2236 (-0.35z)| lr 3.19e-05 | 4197.02 ms | 32.2% bf16 MFU | 125197 tok/s step 17187/19560 | loss 3.291589 (+0.63z)| norm 0.2144 (-1.36z)| lr 3.19e-05 | 4171.85 ms | 32.4% bf16 MFU | 125221 tok/s step 17188/19560 | loss 3.278543 (+0.25z)| norm 0.2226 (-0.42z)| lr 3.18e-05 | 4182.35 ms | 32.3% bf16 MFU | 125227 tok/s step 17189/19560 | loss 3.246821 (-0.64z)| norm 0.2183 (-0.90z)| lr 3.18e-05 | 4178.31 ms | 32.3% bf16 MFU | 125240 tok/s step 17190/19560 | loss 3.240406 (-0.82z)| norm 0.2129 (-1.48z)| lr 3.18e-05 | 4172.92 ms | 32.4% bf16 MFU | 125260 tok/s step 17191/19560 | loss 3.290155 (+0.60z)| norm 0.2198 (-0.71z)| lr 3.18e-05 | 4183.39 ms | 32.3% bf16 MFU | 125263 tok/s step 17192/19560 | loss 3.250561 (-0.52z)| norm 0.2228 (-0.37z)| lr 3.17e-05 | 4181.44 ms | 32.3% bf16 MFU | 125269 tok/s step 17193/19560 | loss 3.293189 (+0.68z)| norm 0.2210 (-0.58z)| lr 3.17e-05 | 4501.45 ms | 30.0% bf16 MFU | 124829 tok/s step 17194/19560 | loss 3.262167 (-0.20z)| norm 0.2242 (-0.21z)| lr 3.17e-05 | 4223.42 ms | 32.0% bf16 MFU | 124795 tok/s step 17195/19560 | loss 3.207506 (-1.75z)| norm 0.2235 (-0.28z)| lr 3.17e-05 | 4172.24 ms | 32.4% bf16 MFU | 124838 tok/s step 17196/19560 | loss 3.250883 (-0.51z)| norm 0.2168 (-1.02z)| lr 3.16e-05 | 4178.91 ms | 32.3% bf16 MFU | 124869 tok/s step 17197/19560 | loss 3.267157 (-0.03z)| norm 0.2215 (-0.51z)| lr 3.16e-05 | 4163.53 ms | 32.4% bf16 MFU | 124922 tok/s step 17198/19560 | loss 3.256993 (-0.31z)| norm 0.2111 (-1.63z)| lr 3.16e-05 | 4186.52 ms | 32.3% bf16 MFU | 124938 tok/s step 17199/19560 | loss 3.338540 (+2.17z)| norm 0.2403 (+1.55z)| lr 3.15e-05 | 4215.45 ms | 32.0% bf16 MFU | 124909 tok/s step 17200/19560 | loss 3.321502 (+1.62z)| norm 0.2286 (+0.29z)| lr 3.15e-05 | 4202.99 ms | 32.1% bf16 MFU | 124901 tok/s step 17201/19560 | loss 3.257699 (-0.30z)| norm 0.2236 (-0.27z)| lr 3.15e-05 | 4176.77 ms | 32.3% bf16 MFU | 124932 tok/s step 17202/19560 | loss 3.268996 (+0.03z)| norm 0.2224 (-0.39z)| lr 3.15e-05 | 4228.29 ms | 31.9% bf16 MFU | 124885 tok/s step 17203/19560 | loss 3.277671 (+0.30z)| norm 0.2240 (-0.21z)| lr 3.14e-05 | 4189.37 ms | 32.2% bf16 MFU | 124898 tok/s step 17204/19560 | loss 3.344639 (+2.26z)| norm 0.2477 (+2.35z)| lr 3.14e-05 | 4182.17 ms | 32.3% bf16 MFU | 124922 tok/s step 17205/19560 | loss 3.226393 (-1.23z)| norm 0.2155 (-1.14z)| lr 3.14e-05 | 4164.48 ms | 32.4% bf16 MFU | 124970 tok/s step 17206/19560 | loss 3.231103 (-1.07z)| norm 0.2236 (-0.25z)| lr 3.14e-05 | 4331.13 ms | 31.2% bf16 MFU | 124774 tok/s step 17207/19560 | loss 3.200454 (-1.94z)| norm 0.2202 (-0.61z)| lr 3.13e-05 | 4182.46 ms | 32.3% bf16 MFU | 124803 tok/s step 17208/19560 | loss 3.253390 (-0.37z)| norm 0.2195 (-0.68z)| lr 3.13e-05 | 4260.75 ms | 31.7% bf16 MFU | 124716 tok/s step 17209/19560 | loss 3.311883 (+1.36z)| norm 0.2229 (-0.30z)| lr 3.13e-05 | 4177.93 ms | 32.3% bf16 MFU | 124754 tok/s step 17210/19560 | loss 3.272697 (+0.19z)| norm 0.2200 (-0.61z)| lr 3.13e-05 | 4245.99 ms | 31.8% bf16 MFU | 124691 tok/s step 17211/19560 | loss 3.254055 (-0.37z)| norm 0.2376 (+1.35z)| lr 3.12e-05 | 4231.37 ms | 31.9% bf16 MFU | 124651 tok/s step 17212/19560 | loss 3.296473 (+0.89z)| norm 0.2118 (-1.51z)| lr 3.12e-05 | 4168.87 ms | 32.4% bf16 MFU | 124707 tok/s step 17213/19560 | loss 3.265529 (-0.04z)| norm 0.2303 (+0.53z)| lr 3.12e-05 | 4176.52 ms | 32.3% bf16 MFU | 124748 tok/s step 17214/19560 | loss 3.245599 (-0.62z)| norm 0.2226 (-0.31z)| lr 3.12e-05 | 4207.84 ms | 32.1% bf16 MFU | 124741 tok/s step 17215/19560 | loss 3.281784 (+0.46z)| norm 0.2186 (-0.75z)| lr 3.11e-05 | 4219.00 ms | 32.0% bf16 MFU | 124717 tok/s step 17216/19560 | loss 3.212711 (-1.58z)| norm 0.2223 (-0.33z)| lr 3.11e-05 | 4183.76 ms | 32.3% bf16 MFU | 124747 tok/s step 17217/19560 | loss 3.342727 (+2.23z)| norm 0.2240 (-0.13z)| lr 3.11e-05 | 4241.69 ms | 31.8% bf16 MFU | 124690 tok/s step 17218/19560 | loss 3.229309 (-1.06z)| norm 0.2255 (+0.05z)| lr 3.10e-05 | 4260.40 ms | 31.7% bf16 MFU | 124608 tok/s step 17219/19560 | loss 3.258362 (-0.21z)| norm 0.2195 (-0.63z)| lr 3.10e-05 | 4167.47 ms | 32.4% bf16 MFU | 124668 tok/s step 17220/19560 | loss 3.256029 (-0.28z)| norm 0.2307 (+0.64z)| lr 3.10e-05 | 4209.64 ms | 32.1% bf16 MFU | 124662 tok/s step 17221/19560 | loss 3.251370 (-0.41z)| norm 0.2244 (-0.07z)| lr 3.10e-05 | 4209.23 ms | 32.1% bf16 MFU | 124657 tok/s step 17222/19560 | loss 3.255229 (-0.29z)| norm 0.2262 (+0.14z)| lr 3.09e-05 | 4168.51 ms | 32.4% bf16 MFU | 124713 tok/s step 17223/19560 | loss 3.204165 (-1.78z)| norm 0.2223 (-0.31z)| lr 3.09e-05 | 4171.46 ms | 32.4% bf16 MFU | 124761 tok/s step 17224/19560 | loss 3.269451 (+0.13z)| norm 0.2122 (-1.45z)| lr 3.09e-05 | 4172.61 ms | 32.4% bf16 MFU | 124806 tok/s step 17225/19560 | loss 3.274621 (+0.27z)| norm 0.2321 (+0.81z)| lr 3.09e-05 | 4221.10 ms | 32.0% bf16 MFU | 124776 tok/s step 17226/19560 | loss 3.207936 (-1.65z)| norm 0.2260 (+0.13z)| lr 3.08e-05 | 4169.19 ms | 32.4% bf16 MFU | 124824 tok/s step 17227/19560 | loss 3.235544 (-0.84z)| norm 0.2177 (-0.82z)| lr 3.08e-05 | 4171.71 ms | 32.4% bf16 MFU | 124867 tok/s step 17228/19560 | loss 3.308531 (+1.31z)| norm 0.2261 (+0.14z)| lr 3.08e-05 | 4187.20 ms | 32.2% bf16 MFU | 124884 tok/s step 17229/19560 | loss 3.196077 (-1.98z)| norm 0.2272 (+0.27z)| lr 3.08e-05 | 4240.43 ms | 31.8% bf16 MFU | 124822 tok/s step 17230/19560 | loss 3.230421 (-0.96z)| norm 0.2176 (-0.82z)| lr 3.07e-05 | 4181.40 ms | 32.3% bf16 MFU | 124850 tok/s step 17231/19560 | loss 3.264433 (+0.03z)| norm 0.2222 (-0.30z)| lr 3.07e-05 | 4289.82 ms | 31.5% bf16 MFU | 124719 tok/s step 17232/19560 | loss 3.228033 (-1.02z)| norm 0.2281 (+0.37z)| lr 3.07e-05 | 4237.84 ms | 31.9% bf16 MFU | 124669 tok/s step 17233/19560 | loss 3.207632 (-1.58z)| norm 0.2278 (+0.33z)| lr 3.07e-05 | 4179.04 ms | 32.3% bf16 MFU | 124708 tok/s step 17234/19560 | loss 3.288527 (+0.75z)| norm 0.2208 (-0.47z)| lr 3.06e-05 | 4169.91 ms | 32.4% bf16 MFU | 124759 tok/s step 17235/19560 | loss 3.269298 (+0.18z)| norm 0.2258 (+0.11z)| lr 3.06e-05 | 4280.04 ms | 31.5% bf16 MFU | 124646 tok/s step 17236/19560 | loss 3.337699 (+2.14z)| norm 0.2276 (+0.31z)| lr 3.06e-05 | 4168.76 ms | 32.4% bf16 MFU | 124702 tok/s step 17237/19560 | loss 3.231769 (-0.92z)| norm 0.2366 (+1.39z)| lr 3.06e-05 | 4220.71 ms | 32.0% bf16 MFU | 124678 tok/s step 17238/19560 | loss 3.254171 (-0.26z)| norm 0.2217 (-0.36z)| lr 3.05e-05 | 4168.44 ms | 32.4% bf16 MFU | 124733 tok/s step 17239/19560 | loss 3.250776 (-0.36z)| norm 0.2300 (+0.60z)| lr 3.05e-05 | 4172.58 ms | 32.4% bf16 MFU | 124779 tok/s step 17240/19560 | loss 3.242361 (-0.61z)| norm 0.2185 (-0.74z)| lr 3.05e-05 | 4223.80 ms | 32.0% bf16 MFU | 124746 tok/s step 17241/19560 | loss 3.293815 (+0.88z)| norm 0.2398 (+1.73z)| lr 3.04e-05 | 4206.02 ms | 32.1% bf16 MFU | 124741 tok/s step 17242/19560 | loss 3.274117 (+0.29z)| norm 0.2304 (+0.73z)| lr 3.04e-05 | 4163.93 ms | 32.4% bf16 MFU | 124800 tok/s step 17243/19560 | loss 3.240064 (-0.74z)| norm 0.2136 (-1.38z)| lr 3.04e-05 | 4170.27 ms | 32.4% bf16 MFU | 124846 tok/s step 17244/19560 | loss 3.222763 (-1.27z)| norm 0.2197 (-0.60z)| lr 3.04e-05 | 4193.66 ms | 32.2% bf16 MFU | 124854 tok/s step 17245/19560 | loss 3.239002 (-0.77z)| norm 0.2187 (-0.72z)| lr 3.03e-05 | 4176.68 ms | 32.3% bf16 MFU | 124888 tok/s step 17246/19560 | loss 3.288246 (+0.72z)| norm 0.2186 (-0.73z)| lr 3.03e-05 | 4165.29 ms | 32.4% bf16 MFU | 124937 tok/s step 17247/19560 | loss 3.288604 (+0.72z)| norm 0.2268 (+0.34z)| lr 3.03e-05 | 4172.88 ms | 32.4% bf16 MFU | 124972 tok/s step 17248/19560 | loss 3.272365 (+0.22z)| norm 0.2159 (-1.09z)| lr 3.03e-05 | 4256.74 ms | 31.7% bf16 MFU | 124882 tok/s step 17249/19560 | loss 3.230714 (-1.07z)| norm 0.2131 (-1.44z)| lr 3.02e-05 | 4206.64 ms | 32.1% bf16 MFU | 124870 tok/s step 17250/19560 | loss 3.296754 (+0.95z)| norm 0.2191 (-0.64z)| lr 3.02e-05 | 4169.32 ms | 32.4% bf16 MFU | 124914 tok/s val loss 3.258652 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3049/10042 = 0.303625 step 17251/19560 | loss 3.240633 (-0.77z)| norm 0.2171 (-0.89z)| lr 3.02e-05 | 4203.11 ms | 32.1% bf16 MFU | 124905 tok/s step 17252/19560 | loss 3.234885 (-0.93z)| norm 0.2168 (-0.93z)| lr 3.02e-05 | 4224.58 ms | 32.0% bf16 MFU | 124865 tok/s step 17253/19560 | loss 3.334846 (+2.07z)| norm 0.2267 (+0.36z)| lr 3.01e-05 | 4181.33 ms | 32.3% bf16 MFU | 124891 tok/s step 17254/19560 | loss 3.239353 (-0.78z)| norm 0.2105 (-1.73z)| lr 3.01e-05 | 4178.34 ms | 32.3% bf16 MFU | 124920 tok/s step 17255/19560 | loss 3.227349 (-1.13z)| norm 0.2338 (+1.27z)| lr 3.01e-05 | 4178.62 ms | 32.3% bf16 MFU | 124948 tok/s step 17256/19560 | loss 3.284827 (+0.59z)| norm 0.2202 (-0.48z)| lr 3.01e-05 | 4174.07 ms | 32.3% bf16 MFU | 124981 tok/s step 17257/19560 | loss 3.191399 (-2.16z)| norm 0.2231 (-0.11z)| lr 3.00e-05 | 4176.51 ms | 32.3% bf16 MFU | 125008 tok/s step 17258/19560 | loss 3.271504 (+0.19z)| norm 0.2246 (+0.09z)| lr 3.00e-05 | 4175.90 ms | 32.3% bf16 MFU | 125035 tok/s step 17259/19560 | loss 3.284070 (+0.55z)| norm 0.2229 (-0.10z)| lr 3.00e-05 | 4177.97 ms | 32.3% bf16 MFU | 125058 tok/s step 17260/19560 | loss 3.290569 (+0.74z)| norm 0.2227 (-0.13z)| lr 3.00e-05 | 4169.67 ms | 32.4% bf16 MFU | 125092 tok/s step 17261/19560 | loss 3.286730 (+0.63z)| norm 0.2212 (-0.33z)| lr 2.99e-05 | 4187.30 ms | 32.2% bf16 MFU | 125098 tok/s step 17262/19560 | loss 3.289787 (+0.71z)| norm 0.2583 (+4.41z)| lr 2.99e-05 | 4166.05 ms | 32.4% bf16 MFU | 125135 tok/s step 17263/19560 | loss 3.200995 (-1.85z)| norm 0.2234 (-0.04z)| lr 2.99e-05 | 4199.42 ms | 32.2% bf16 MFU | 125121 tok/s step 17264/19560 | loss 3.198863 (-1.87z)| norm 0.2240 (+0.03z)| lr 2.99e-05 | 4177.95 ms | 32.3% bf16 MFU | 125139 tok/s step 17265/19560 | loss 3.308769 (+1.26z)| norm 0.2177 (-0.76z)| lr 2.98e-05 | 4170.45 ms | 32.4% bf16 MFU | 125168 tok/s step 17266/19560 | loss 3.259141 (-0.17z)| norm 0.2133 (-1.30z)| lr 2.98e-05 | 4587.39 ms | 29.4% bf16 MFU | 124624 tok/s step 17267/19560 | loss 3.310388 (+1.29z)| norm 0.2142 (-1.17z)| lr 2.98e-05 | 4280.71 ms | 31.5% bf16 MFU | 124517 tok/s step 17268/19560 | loss 3.230196 (-1.00z)| norm 0.2289 (+0.74z)| lr 2.98e-05 | 5103.68 ms | 26.5% bf16 MFU | 123427 tok/s step 17269/19560 | loss 3.298866 (+0.95z)| norm 0.2277 (+0.61z)| lr 2.97e-05 | 4907.53 ms | 27.5% bf16 MFU | 122598 tok/s step 17270/19560 | loss 3.270519 (+0.14z)| norm 0.2258 (+0.36z)| lr 2.97e-05 | 4534.03 ms | 29.8% bf16 MFU | 122250 tok/s step 17271/19560 | loss 3.289896 (+0.68z)| norm 0.2283 (+0.69z)| lr 2.97e-05 | 4573.29 ms | 29.5% bf16 MFU | 121869 tok/s step 17272/19560 | loss 3.246561 (-0.54z)| norm 0.2268 (+0.48z)| lr 2.97e-05 | 4520.21 ms | 29.9% bf16 MFU | 121575 tok/s step 17273/19560 | loss 3.275384 (+0.28z)| norm 0.2379 (+1.91z)| lr 2.96e-05 | 4471.43 ms | 30.2% bf16 MFU | 121359 tok/s step 17274/19560 | loss 3.211209 (-1.52z)| norm 0.2209 (-0.32z)| lr 2.96e-05 | 4220.12 ms | 32.0% bf16 MFU | 121503 tok/s step 17275/19560 | loss 3.279811 (+0.42z)| norm 0.2163 (-0.91z)| lr 2.96e-05 | 4200.49 ms | 32.1% bf16 MFU | 121668 tok/s step 17276/19560 | loss 3.244917 (-0.57z)| norm 0.2079 (-1.97z)| lr 2.95e-05 | 4214.28 ms | 32.0% bf16 MFU | 121805 tok/s step 17277/19560 | loss 3.204888 (-1.67z)| norm 0.2126 (-1.34z)| lr 2.95e-05 | 4317.01 ms | 31.3% bf16 MFU | 121787 tok/s step 17278/19560 | loss 3.270969 (+0.19z)| norm 0.2223 (-0.09z)| lr 2.95e-05 | 4202.75 ms | 32.1% bf16 MFU | 121935 tok/s step 17279/19560 | loss 3.222998 (-1.15z)| norm 0.2241 (+0.13z)| lr 2.95e-05 | 4227.07 ms | 31.9% bf16 MFU | 122040 tok/s step 17280/19560 | loss 3.222712 (-1.15z)| norm 0.2188 (-0.55z)| lr 2.94e-05 | 4277.76 ms | 31.6% bf16 MFU | 122066 tok/s step 17281/19560 | loss 3.310869 (+1.29z)| norm 0.2240 (+0.13z)| lr 2.94e-05 | 4217.18 ms | 32.0% bf16 MFU | 122179 tok/s step 17282/19560 | loss 3.267309 (+0.07z)| norm 0.2131 (-1.30z)| lr 2.94e-05 | 4170.48 ms | 32.4% bf16 MFU | 122356 tok/s step 17283/19560 | loss 3.293614 (+0.80z)| norm 0.2222 (-0.10z)| lr 2.94e-05 | 4211.85 ms | 32.1% bf16 MFU | 122462 tok/s step 17284/19560 | loss 3.279466 (+0.40z)| norm 0.2204 (-0.33z)| lr 2.93e-05 | 4198.36 ms | 32.2% bf16 MFU | 122583 tok/s step 17285/19560 | loss 3.344397 (+2.16z)| norm 0.2233 (+0.04z)| lr 2.93e-05 | 4190.08 ms | 32.2% bf16 MFU | 122710 tok/s step 17286/19560 | loss 3.272862 (+0.19z)| norm 0.2130 (-1.32z)| lr 2.93e-05 | 4189.66 ms | 32.2% bf16 MFU | 122831 tok/s step 17287/19560 | loss 3.225471 (-1.11z)| norm 0.2203 (-0.35z)| lr 2.93e-05 | 4178.11 ms | 32.3% bf16 MFU | 122964 tok/s step 17288/19560 | loss 3.331124 (+1.76z)| norm 0.2153 (-1.02z)| lr 2.92e-05 | 4179.59 ms | 32.3% bf16 MFU | 123088 tok/s step 17289/19560 | loss 3.283812 (+0.47z)| norm 0.2358 (+1.67z)| lr 2.92e-05 | 4178.02 ms | 32.3% bf16 MFU | 123208 tok/s step 17290/19560 | loss 3.242952 (-0.65z)| norm 0.2208 (-0.31z)| lr 2.92e-05 | 4219.93 ms | 32.0% bf16 MFU | 123260 tok/s step 17291/19560 | loss 3.271821 (+0.14z)| norm 0.2194 (-0.49z)| lr 2.92e-05 | 4207.61 ms | 32.1% bf16 MFU | 123327 tok/s step 17292/19560 | loss 3.274649 (+0.21z)| norm 0.2188 (-0.57z)| lr 2.91e-05 | 4175.07 ms | 32.3% bf16 MFU | 123439 tok/s step 17293/19560 | loss 3.325210 (+1.57z)| norm 0.2332 (+1.33z)| lr 2.91e-05 | 4195.18 ms | 32.2% bf16 MFU | 123516 tok/s step 17294/19560 | loss 3.317472 (+1.34z)| norm 0.2204 (-0.36z)| lr 2.91e-05 | 4209.64 ms | 32.1% bf16 MFU | 123567 tok/s step 17295/19560 | loss 3.277193 (+0.27z)| norm 0.2127 (-1.35z)| lr 2.91e-05 | 4192.32 ms | 32.2% bf16 MFU | 123642 tok/s step 17296/19560 | loss 3.303633 (+0.98z)| norm 0.2170 (-0.78z)| lr 2.90e-05 | 4186.12 ms | 32.3% bf16 MFU | 123722 tok/s step 17297/19560 | loss 3.257875 (-0.26z)| norm 0.2283 (+0.69z)| lr 2.90e-05 | 4350.83 ms | 31.0% bf16 MFU | 123561 tok/s step 17298/19560 | loss 3.255638 (-0.32z)| norm 0.2167 (-0.83z)| lr 2.90e-05 | 4183.91 ms | 32.3% bf16 MFU | 123649 tok/s step 17299/19560 | loss 3.213496 (-1.44z)| norm 0.2146 (-1.11z)| lr 2.90e-05 | 4177.66 ms | 32.3% bf16 MFU | 123741 tok/s step 17300/19560 | loss 3.178954 (-2.31z)| norm 0.2218 (-0.13z)| lr 2.89e-05 | 4290.71 ms | 31.5% bf16 MFU | 123664 tok/s step 17301/19560 | loss 3.184464 (-2.11z)| norm 0.2235 (+0.12z)| lr 2.89e-05 | 4212.16 ms | 32.1% bf16 MFU | 123704 tok/s step 17302/19560 | loss 3.380071 (+2.88z)| norm 0.2242 (+0.21z)| lr 2.89e-05 | 4179.18 ms | 32.3% bf16 MFU | 123791 tok/s step 17303/19560 | loss 3.224895 (-1.02z)| norm 0.2133 (-1.26z)| lr 2.89e-05 | 4191.50 ms | 32.2% bf16 MFU | 123856 tok/s step 17304/19560 | loss 3.284785 (+0.50z)| norm 0.2289 (+0.85z)| lr 2.88e-05 | 4178.39 ms | 32.3% bf16 MFU | 123937 tok/s step 17305/19560 | loss 3.260348 (-0.11z)| norm 0.2311 (+1.13z)| lr 2.88e-05 | 4193.29 ms | 32.2% bf16 MFU | 123992 tok/s step 17306/19560 | loss 3.214627 (-1.28z)| norm 0.2298 (+0.94z)| lr 2.88e-05 | 4177.22 ms | 32.3% bf16 MFU | 124068 tok/s step 17307/19560 | loss 3.224007 (-1.02z)| norm 0.2177 (-0.69z)| lr 2.88e-05 | 4206.26 ms | 32.1% bf16 MFU | 124096 tok/s step 17308/19560 | loss 3.287850 (+0.64z)| norm 0.2177 (-0.69z)| lr 2.87e-05 | 4199.15 ms | 32.2% bf16 MFU | 124134 tok/s step 17309/19560 | loss 3.257767 (-0.14z)| norm 0.2160 (-0.91z)| lr 2.87e-05 | 4187.92 ms | 32.2% bf16 MFU | 124187 tok/s step 17310/19560 | loss 3.216098 (-1.21z)| norm 0.2124 (-1.37z)| lr 2.87e-05 | 4185.13 ms | 32.3% bf16 MFU | 124242 tok/s step 17311/19560 | loss 3.254347 (-0.21z)| norm 0.2184 (-0.56z)| lr 2.87e-05 | 4206.44 ms | 32.1% bf16 MFU | 124261 tok/s step 17312/19560 | loss 3.269139 (+0.17z)| norm 0.2126 (-1.33z)| lr 2.86e-05 | 4175.81 ms | 32.3% bf16 MFU | 124326 tok/s step 17313/19560 | loss 3.209795 (-1.35z)| norm 0.2184 (-0.54z)| lr 2.86e-05 | 4186.29 ms | 32.3% bf16 MFU | 124372 tok/s step 17314/19560 | loss 3.269623 (+0.20z)| norm 0.2211 (-0.17z)| lr 2.86e-05 | 4179.83 ms | 32.3% bf16 MFU | 124425 tok/s step 17315/19560 | loss 3.238629 (-0.60z)| norm 0.2095 (-1.71z)| lr 2.86e-05 | 4194.18 ms | 32.2% bf16 MFU | 124454 tok/s step 17316/19560 | loss 3.251422 (-0.26z)| norm 0.2150 (-0.97z)| lr 2.85e-05 | 4181.21 ms | 32.3% bf16 MFU | 124501 tok/s step 17317/19560 | loss 3.223969 (-0.97z)| norm 0.2288 (+0.84z)| lr 2.85e-05 | 4184.42 ms | 32.3% bf16 MFU | 124540 tok/s step 17318/19560 | loss 3.223425 (-0.98z)| norm 0.2244 (+0.25z)| lr 2.85e-05 | 4195.92 ms | 32.2% bf16 MFU | 124561 tok/s step 17319/19560 | loss 3.227496 (-0.86z)| norm 0.2129 (-1.25z)| lr 2.85e-05 | 4190.53 ms | 32.2% bf16 MFU | 124588 tok/s step 17320/19560 | loss 3.225955 (-0.89z)| norm 0.2244 (+0.25z)| lr 2.84e-05 | 4190.50 ms | 32.2% bf16 MFU | 124615 tok/s step 17321/19560 | loss 3.234566 (-0.66z)| norm 0.2158 (-0.87z)| lr 2.84e-05 | 4178.35 ms | 32.3% bf16 MFU | 124658 tok/s step 17322/19560 | loss 3.284478 (+0.63z)| norm 0.2166 (-0.75z)| lr 2.84e-05 | 4217.83 ms | 32.0% bf16 MFU | 124640 tok/s step 17323/19560 | loss 3.210218 (-1.29z)| norm 0.2047 (-2.25z)| lr 2.84e-05 | 4248.56 ms | 31.8% bf16 MFU | 124578 tok/s step 17324/19560 | loss 3.269207 (+0.23z)| norm 0.2277 (+0.69z)| lr 2.83e-05 | 4204.47 ms | 32.1% bf16 MFU | 124584 tok/s step 17325/19560 | loss 3.253051 (-0.18z)| norm 0.2259 (+0.45z)| lr 2.83e-05 | 4188.47 ms | 32.2% bf16 MFU | 124614 tok/s step 17326/19560 | loss 3.287904 (+0.71z)| norm 0.2198 (-0.33z)| lr 2.83e-05 | 4355.73 ms | 31.0% bf16 MFU | 124401 tok/s step 17327/19560 | loss 3.235779 (-0.62z)| norm 0.2129 (-1.22z)| lr 2.83e-05 | 4181.59 ms | 32.3% bf16 MFU | 124450 tok/s step 17328/19560 | loss 3.200479 (-1.53z)| norm 0.2168 (-0.70z)| lr 2.82e-05 | 4214.37 ms | 32.0% bf16 MFU | 124448 tok/s step 17329/19560 | loss 3.349593 (+2.32z)| norm 0.2181 (-0.52z)| lr 2.82e-05 | 4185.09 ms | 32.3% bf16 MFU | 124489 tok/s step 17330/19560 | loss 3.204139 (-1.39z)| norm 0.2280 (+0.78z)| lr 2.82e-05 | 4170.83 ms | 32.4% bf16 MFU | 124550 tok/s step 17331/19560 | loss 3.206863 (-1.30z)| norm 0.2038 (-2.33z)| lr 2.82e-05 | 4182.82 ms | 32.3% bf16 MFU | 124590 tok/s step 17332/19560 | loss 3.212152 (-1.16z)| norm 0.2156 (-0.81z)| lr 2.81e-05 | 4196.03 ms | 32.2% bf16 MFU | 124608 tok/s step 17333/19560 | loss 3.317650 (+1.53z)| norm 0.2396 (+2.33z)| lr 2.81e-05 | 4190.22 ms | 32.2% bf16 MFU | 124633 tok/s step 17334/19560 | loss 3.267769 (+0.25z)| norm 0.2138 (-1.05z)| lr 2.81e-05 | 4186.10 ms | 32.3% bf16 MFU | 124664 tok/s step 17335/19560 | loss 3.243121 (-0.40z)| norm 0.2216 (-0.02z)| lr 2.81e-05 | 4171.24 ms | 32.4% bf16 MFU | 124715 tok/s step 17336/19560 | loss 3.245515 (-0.33z)| norm 0.2197 (-0.28z)| lr 2.80e-05 | 4173.27 ms | 32.4% bf16 MFU | 124761 tok/s step 17337/19560 | loss 3.269865 (+0.30z)| norm 0.2242 (+0.30z)| lr 2.80e-05 | 4187.22 ms | 32.2% bf16 MFU | 124784 tok/s step 17338/19560 | loss 3.247711 (-0.27z)| norm 0.2318 (+1.28z)| lr 2.80e-05 | 4168.17 ms | 32.4% bf16 MFU | 124834 tok/s step 17339/19560 | loss 3.290811 (+0.84z)| norm 0.2089 (-1.67z)| lr 2.80e-05 | 4185.49 ms | 32.3% bf16 MFU | 124855 tok/s step 17340/19560 | loss 3.248799 (-0.24z)| norm 0.2238 (+0.26z)| lr 2.79e-05 | 4187.49 ms | 32.2% bf16 MFU | 124872 tok/s step 17341/19560 | loss 3.300766 (+1.10z)| norm 0.2369 (+1.96z)| lr 2.79e-05 | 4183.53 ms | 32.3% bf16 MFU | 124895 tok/s step 17342/19560 | loss 3.313748 (+1.42z)| norm 0.2271 (+0.68z)| lr 2.79e-05 | 4222.92 ms | 32.0% bf16 MFU | 124858 tok/s step 17343/19560 | loss 3.271527 (+0.33z)| norm 0.2340 (+1.54z)| lr 2.79e-05 | 4170.10 ms | 32.4% bf16 MFU | 124901 tok/s step 17344/19560 | loss 3.229493 (-0.75z)| norm 0.2354 (+1.69z)| lr 2.78e-05 | 4183.65 ms | 32.3% bf16 MFU | 124922 tok/s step 17345/19560 | loss 3.300220 (+1.10z)| norm 0.2415 (+2.39z)| lr 2.78e-05 | 4185.60 ms | 32.3% bf16 MFU | 124939 tok/s step 17346/19560 | loss 3.235883 (-0.59z)| norm 0.2269 (+0.58z)| lr 2.78e-05 | 4195.36 ms | 32.2% bf16 MFU | 124940 tok/s step 17347/19560 | loss 3.235797 (-0.59z)| norm 0.2307 (+1.03z)| lr 2.78e-05 | 4289.28 ms | 31.5% bf16 MFU | 124805 tok/s step 17348/19560 | loss 3.285341 (+0.70z)| norm 0.2342 (+1.46z)| lr 2.77e-05 | 4172.47 ms | 32.4% bf16 MFU | 124847 tok/s step 17349/19560 | loss 3.273873 (+0.40z)| norm 0.2177 (-0.56z)| lr 2.77e-05 | 4189.28 ms | 32.2% bf16 MFU | 124863 tok/s step 17350/19560 | loss 3.176797 (-2.08z)| norm 0.2266 (+0.53z)| lr 2.77e-05 | 4174.99 ms | 32.3% bf16 MFU | 124898 tok/s step 17351/19560 | loss 3.249932 (-0.22z)| norm 0.2242 (+0.24z)| lr 2.77e-05 | 4174.27 ms | 32.3% bf16 MFU | 124933 tok/s step 17352/19560 | loss 3.211785 (-1.19z)| norm 0.2308 (+1.02z)| lr 2.76e-05 | 4168.07 ms | 32.4% bf16 MFU | 124976 tok/s step 17353/19560 | loss 3.296096 (+0.97z)| norm 0.2163 (-0.75z)| lr 2.76e-05 | 4188.70 ms | 32.2% bf16 MFU | 124986 tok/s step 17354/19560 | loss 3.270161 (+0.30z)| norm 0.2285 (+0.75z)| lr 2.76e-05 | 4180.79 ms | 32.3% bf16 MFU | 125007 tok/s step 17355/19560 | loss 3.317144 (+1.48z)| norm 0.2368 (+1.74z)| lr 2.76e-05 | 4198.93 ms | 32.2% bf16 MFU | 124999 tok/s step 17356/19560 | loss 3.205559 (-1.35z)| norm 0.2093 (-1.58z)| lr 2.75e-05 | 4190.64 ms | 32.2% bf16 MFU | 125005 tok/s step 17357/19560 | loss 3.307156 (+1.23z)| norm 0.2302 (+0.93z)| lr 2.75e-05 | 4179.06 ms | 32.3% bf16 MFU | 125027 tok/s step 17358/19560 | loss 3.224148 (-0.90z)| norm 0.2151 (-0.88z)| lr 2.75e-05 | 4187.73 ms | 32.2% bf16 MFU | 125036 tok/s step 17359/19560 | loss 3.211676 (-1.20z)| norm 0.2232 (+0.10z)| lr 2.75e-05 | 4197.98 ms | 32.2% bf16 MFU | 125029 tok/s step 17360/19560 | loss 3.241267 (-0.45z)| norm 0.2145 (-0.94z)| lr 2.74e-05 | 4309.57 ms | 31.3% bf16 MFU | 124860 tok/s step 17361/19560 | loss 3.288158 (+0.73z)| norm 0.2180 (-0.51z)| lr 2.74e-05 | 4192.45 ms | 32.2% bf16 MFU | 124870 tok/s step 17362/19560 | loss 3.268536 (+0.23z)| norm 0.2233 (+0.12z)| lr 2.74e-05 | 4172.58 ms | 32.4% bf16 MFU | 124909 tok/s step 17363/19560 | loss 3.223562 (-0.91z)| norm 0.2079 (-1.70z)| lr 2.74e-05 | 4179.57 ms | 32.3% bf16 MFU | 124935 tok/s step 17364/19560 | loss 3.258180 (-0.01z)| norm 0.2168 (-0.63z)| lr 2.73e-05 | 4181.60 ms | 32.3% bf16 MFU | 124958 tok/s step 17365/19560 | loss 3.282210 (+0.61z)| norm 0.2210 (-0.10z)| lr 2.73e-05 | 4190.39 ms | 32.2% bf16 MFU | 124966 tok/s step 17366/19560 | loss 3.214369 (-1.15z)| norm 0.2152 (-0.80z)| lr 2.73e-05 | 4180.80 ms | 32.3% bf16 MFU | 124987 tok/s step 17367/19560 | loss 3.214809 (-1.12z)| norm 0.2188 (-0.36z)| lr 2.73e-05 | 4192.07 ms | 32.2% bf16 MFU | 124991 tok/s step 17368/19560 | loss 3.283705 (+0.65z)| norm 0.2145 (-0.86z)| lr 2.72e-05 | 4186.90 ms | 32.2% bf16 MFU | 125003 tok/s step 17369/19560 | loss 3.316603 (+1.48z)| norm 0.2201 (-0.18z)| lr 2.72e-05 | 4183.37 ms | 32.3% bf16 MFU | 125019 tok/s step 17370/19560 | loss 3.229141 (-0.75z)| norm 0.2228 (+0.15z)| lr 2.72e-05 | 4181.55 ms | 32.3% bf16 MFU | 125037 tok/s step 17371/19560 | loss 3.214252 (-1.12z)| norm 0.2138 (-0.95z)| lr 2.72e-05 | 4171.34 ms | 32.4% bf16 MFU | 125070 tok/s step 17372/19560 | loss 3.249423 (-0.23z)| norm 0.2273 (+0.71z)| lr 2.71e-05 | 4172.77 ms | 32.4% bf16 MFU | 125099 tok/s step 17373/19560 | loss 3.262200 (+0.09z)| norm 0.2272 (+0.68z)| lr 2.71e-05 | 4184.63 ms | 32.3% bf16 MFU | 125108 tok/s step 17374/19560 | loss 3.242349 (-0.41z)| norm 0.2192 (-0.31z)| lr 2.71e-05 | 4178.98 ms | 32.3% bf16 MFU | 125126 tok/s step 17375/19560 | loss 3.236552 (-0.55z)| norm 0.2236 (+0.25z)| lr 2.71e-05 | 4186.64 ms | 32.2% bf16 MFU | 125131 tok/s step 17376/19560 | loss 3.247799 (-0.25z)| norm 0.2179 (-0.46z)| lr 2.70e-05 | 4175.42 ms | 32.3% bf16 MFU | 125152 tok/s step 17377/19560 | loss 3.283748 (+0.66z)| norm 0.2160 (-0.70z)| lr 2.70e-05 | 4178.34 ms | 32.3% bf16 MFU | 125169 tok/s step 17378/19560 | loss 3.254210 (-0.09z)| norm 0.2215 (-0.02z)| lr 2.70e-05 | 4178.84 ms | 32.3% bf16 MFU | 125183 tok/s step 17379/19560 | loss 3.221997 (-0.92z)| norm 0.2578 (+4.13z)| lr 2.70e-05 | 4192.02 ms | 32.2% bf16 MFU | 125178 tok/s step 17380/19560 | loss 3.210267 (-1.21z)| norm 0.2209 (-0.13z)| lr 2.69e-05 | 4191.71 ms | 32.2% bf16 MFU | 125173 tok/s step 17381/19560 | loss 3.241676 (-0.39z)| norm 0.2193 (-0.31z)| lr 2.69e-05 | 4189.95 ms | 32.2% bf16 MFU | 125170 tok/s step 17382/19560 | loss 3.238202 (-0.48z)| norm 0.2101 (-1.37z)| lr 2.69e-05 | 4186.14 ms | 32.3% bf16 MFU | 125174 tok/s step 17383/19560 | loss 3.309509 (+1.35z)| norm 0.2310 (+1.05z)| lr 2.69e-05 | 4176.78 ms | 32.3% bf16 MFU | 125192 tok/s step 17384/19560 | loss 3.296259 (+1.00z)| norm 0.2187 (-0.37z)| lr 2.69e-05 | 4176.83 ms | 32.3% bf16 MFU | 125208 tok/s step 17385/19560 | loss 3.245136 (-0.33z)| norm 0.2246 (+0.30z)| lr 2.68e-05 | 4189.13 ms | 32.2% bf16 MFU | 125205 tok/s step 17386/19560 | loss 3.299666 (+1.08z)| norm 0.2191 (-0.33z)| lr 2.68e-05 | 4175.77 ms | 32.3% bf16 MFU | 125223 tok/s step 17387/19560 | loss 3.248820 (-0.23z)| norm 0.1986 (-2.60z)| lr 2.68e-05 | 4186.17 ms | 32.3% bf16 MFU | 125224 tok/s step 17388/19560 | loss 3.286136 (+0.74z)| norm 0.2188 (-0.33z)| lr 2.68e-05 | 4179.76 ms | 32.3% bf16 MFU | 125234 tok/s step 17389/19560 | loss 3.286444 (+0.75z)| norm 0.2093 (-1.37z)| lr 2.67e-05 | 4195.93 ms | 32.2% bf16 MFU | 125220 tok/s step 17390/19560 | loss 3.191468 (-1.69z)| norm 0.2199 (-0.17z)| lr 2.67e-05 | 4191.61 ms | 32.2% bf16 MFU | 125213 tok/s step 17391/19560 | loss 3.255920 (-0.04z)| norm 0.2105 (-1.28z)| lr 2.67e-05 | 4197.34 ms | 32.2% bf16 MFU | 125198 tok/s step 17392/19560 | loss 3.246021 (-0.31z)| norm 0.2137 (-0.88z)| lr 2.67e-05 | 4193.25 ms | 32.2% bf16 MFU | 125190 tok/s step 17393/19560 | loss 3.281734 (+0.64z)| norm 0.2124 (-1.03z)| lr 2.66e-05 | 4193.22 ms | 32.2% bf16 MFU | 125182 tok/s step 17394/19560 | loss 3.210428 (-1.23z)| norm 0.2160 (-0.61z)| lr 2.66e-05 | 4176.14 ms | 32.3% bf16 MFU | 125200 tok/s step 17395/19560 | loss 3.327832 (+1.84z)| norm 0.2229 (+0.21z)| lr 2.66e-05 | 4178.86 ms | 32.3% bf16 MFU | 125213 tok/s step 17396/19560 | loss 3.252899 (-0.12z)| norm 0.2121 (-1.06z)| lr 2.66e-05 | 4175.07 ms | 32.3% bf16 MFU | 125231 tok/s step 17397/19560 | loss 3.250302 (-0.18z)| norm 0.2233 (+0.27z)| lr 2.65e-05 | 4228.94 ms | 31.9% bf16 MFU | 125168 tok/s step 17398/19560 | loss 3.248176 (-0.23z)| norm 0.2121 (-1.04z)| lr 2.65e-05 | 4173.17 ms | 32.4% bf16 MFU | 125192 tok/s step 17399/19560 | loss 3.182372 (-1.92z)| norm 0.2126 (-0.97z)| lr 2.65e-05 | 4202.15 ms | 32.1% bf16 MFU | 125170 tok/s step 17400/19560 | loss 3.286593 (+0.78z)| norm 0.2161 (-0.54z)| lr 2.65e-05 | 4193.01 ms | 32.2% bf16 MFU | 125164 tok/s step 17401/19560 | loss 3.268804 (+0.32z)| norm 0.2219 (+0.17z)| lr 2.64e-05 | 4184.62 ms | 32.3% bf16 MFU | 125170 tok/s step 17402/19560 | loss 3.252763 (-0.10z)| norm 0.2134 (-0.86z)| lr 2.64e-05 | 4182.02 ms | 32.3% bf16 MFU | 125180 tok/s step 17403/19560 | loss 3.303097 (+1.20z)| norm 0.2084 (-1.44z)| lr 2.64e-05 | 4199.14 ms | 32.2% bf16 MFU | 125164 tok/s step 17404/19560 | loss 3.265001 (+0.21z)| norm 0.2103 (-1.21z)| lr 2.64e-05 | 4177.78 ms | 32.3% bf16 MFU | 125180 tok/s step 17405/19560 | loss 3.287149 (+0.77z)| norm 0.2283 (+0.93z)| lr 2.63e-05 | 4182.83 ms | 32.3% bf16 MFU | 125188 tok/s step 17406/19560 | loss 3.266917 (+0.24z)| norm 0.2276 (+0.84z)| lr 2.63e-05 | 4182.23 ms | 32.3% bf16 MFU | 125197 tok/s step 17407/19560 | loss 3.238484 (-0.50z)| norm 0.2112 (-1.11z)| lr 2.63e-05 | 4211.23 ms | 32.1% bf16 MFU | 125162 tok/s step 17408/19560 | loss 3.290780 (+0.85z)| norm 0.2191 (-0.16z)| lr 2.63e-05 | 4170.59 ms | 32.4% bf16 MFU | 125190 tok/s step 17409/19560 | loss 3.265333 (+0.20z)| norm 0.2198 (-0.08z)| lr 2.62e-05 | 4165.15 ms | 32.4% bf16 MFU | 125224 tok/s step 17410/19560 | loss 3.280696 (+0.60z)| norm 0.2219 (+0.17z)| lr 2.62e-05 | 4181.93 ms | 32.3% bf16 MFU | 125231 tok/s step 17411/19560 | loss 3.274190 (+0.43z)| norm 0.2209 (+0.04z)| lr 2.62e-05 | 4172.87 ms | 32.4% bf16 MFU | 125252 tok/s step 17412/19560 | loss 3.219049 (-1.01z)| norm 0.2212 (+0.07z)| lr 2.62e-05 | 4170.20 ms | 32.4% bf16 MFU | 125275 tok/s step 17413/19560 | loss 3.230671 (-0.69z)| norm 0.2164 (-0.48z)| lr 2.61e-05 | 4193.53 ms | 32.2% bf16 MFU | 125263 tok/s step 17414/19560 | loss 3.241467 (-0.40z)| norm 0.2200 (-0.06z)| lr 2.61e-05 | 4190.75 ms | 32.2% bf16 MFU | 125255 tok/s step 17415/19560 | loss 3.251810 (-0.12z)| norm 0.2215 (+0.11z)| lr 2.61e-05 | 4170.13 ms | 32.4% bf16 MFU | 125278 tok/s step 17416/19560 | loss 3.311402 (+1.50z)| norm 0.2444 (+2.75z)| lr 2.61e-05 | 4185.13 ms | 32.3% bf16 MFU | 125278 tok/s step 17417/19560 | loss 3.289856 (+0.91z)| norm 0.2196 (-0.12z)| lr 2.61e-05 | 4167.28 ms | 32.4% bf16 MFU | 125305 tok/s step 17418/19560 | loss 3.231569 (-0.67z)| norm 0.2170 (-0.43z)| lr 2.60e-05 | 4176.04 ms | 32.3% bf16 MFU | 125317 tok/s step 17419/19560 | loss 3.303073 (+1.25z)| norm 0.2081 (-1.46z)| lr 2.60e-05 | 4169.94 ms | 32.4% bf16 MFU | 125337 tok/s step 17420/19560 | loss 3.272963 (+0.44z)| norm 0.2158 (-0.55z)| lr 2.60e-05 | 4175.66 ms | 32.3% bf16 MFU | 125348 tok/s step 17421/19560 | loss 3.221993 (-0.92z)| norm 0.2140 (-0.75z)| lr 2.60e-05 | 4181.32 ms | 32.3% bf16 MFU | 125350 tok/s step 17422/19560 | loss 3.278379 (+0.63z)| norm 0.2283 (+0.92z)| lr 2.59e-05 | 4192.58 ms | 32.2% bf16 MFU | 125335 tok/s step 17423/19560 | loss 3.261345 (+0.17z)| norm 0.2144 (-0.70z)| lr 2.59e-05 | 4166.29 ms | 32.4% bf16 MFU | 125361 tok/s step 17424/19560 | loss 3.261471 (+0.18z)| norm 0.2219 (+0.17z)| lr 2.59e-05 | 4187.21 ms | 32.2% bf16 MFU | 125353 tok/s step 17425/19560 | loss 3.225058 (-0.82z)| norm 0.2090 (-1.32z)| lr 2.59e-05 | 4188.06 ms | 32.2% bf16 MFU | 125345 tok/s step 17426/19560 | loss 3.306983 (+1.42z)| norm 0.2205 (+0.01z)| lr 2.58e-05 | 4182.05 ms | 32.3% bf16 MFU | 125346 tok/s step 17427/19560 | loss 3.311545 (+1.52z)| norm 0.2228 (+0.28z)| lr 2.58e-05 | 4170.76 ms | 32.4% bf16 MFU | 125364 tok/s step 17428/19560 | loss 3.237782 (-0.51z)| norm 0.2137 (-0.78z)| lr 2.58e-05 | 4180.93 ms | 32.3% bf16 MFU | 125366 tok/s step 17429/19560 | loss 3.410016 (+4.01z)| norm 0.2231 (+0.32z)| lr 2.58e-05 | 4188.74 ms | 32.2% bf16 MFU | 125356 tok/s step 17430/19560 | loss 3.216179 (-1.11z)| norm 0.2098 (-1.21z)| lr 2.57e-05 | 4176.22 ms | 32.3% bf16 MFU | 125365 tok/s step 17431/19560 | loss 3.205438 (-1.39z)| norm 0.2141 (-0.71z)| lr 2.57e-05 | 4172.14 ms | 32.4% bf16 MFU | 125380 tok/s step 17432/19560 | loss 3.260375 (+0.11z)| norm 0.2227 (+0.30z)| lr 2.57e-05 | 4191.73 ms | 32.2% bf16 MFU | 125365 tok/s step 17433/19560 | loss 3.283755 (+0.74z)| norm 0.2154 (-0.55z)| lr 2.57e-05 | 4177.97 ms | 32.3% bf16 MFU | 125371 tok/s step 17434/19560 | loss 3.251285 (-0.15z)| norm 0.2187 (-0.15z)| lr 2.56e-05 | 4177.34 ms | 32.3% bf16 MFU | 125378 tok/s step 17435/19560 | loss 3.292811 (+0.97z)| norm 0.2193 (-0.08z)| lr 2.56e-05 | 4176.72 ms | 32.3% bf16 MFU | 125385 tok/s step 17436/19560 | loss 3.259102 (+0.05z)| norm 0.2243 (+0.50z)| lr 2.56e-05 | 4184.61 ms | 32.3% bf16 MFU | 125380 tok/s step 17437/19560 | loss 3.269272 (+0.33z)| norm 0.2124 (-0.89z)| lr 2.56e-05 | 4178.81 ms | 32.3% bf16 MFU | 125385 tok/s step 17438/19560 | loss 3.250391 (-0.20z)| norm 0.2158 (-0.50z)| lr 2.56e-05 | 4188.37 ms | 32.2% bf16 MFU | 125374 tok/s step 17439/19560 | loss 3.243109 (-0.40z)| norm 0.2191 (-0.11z)| lr 2.55e-05 | 4186.42 ms | 32.3% bf16 MFU | 125367 tok/s step 17440/19560 | loss 3.247586 (-0.27z)| norm 0.2114 (-1.02z)| lr 2.55e-05 | 4179.77 ms | 32.3% bf16 MFU | 125371 tok/s step 17441/19560 | loss 3.258336 (+0.02z)| norm 0.2367 (+1.92z)| lr 2.55e-05 | 4192.54 ms | 32.2% bf16 MFU | 125355 tok/s step 17442/19560 | loss 3.240817 (-0.46z)| norm 0.2107 (-1.09z)| lr 2.55e-05 | 4185.89 ms | 32.3% bf16 MFU | 125350 tok/s step 17443/19560 | loss 3.233510 (-0.66z)| norm 0.2154 (-0.56z)| lr 2.54e-05 | 4197.74 ms | 32.2% bf16 MFU | 125327 tok/s step 17444/19560 | loss 3.239325 (-0.50z)| norm 0.2197 (-0.06z)| lr 2.54e-05 | 4183.61 ms | 32.3% bf16 MFU | 125327 tok/s step 17445/19560 | loss 3.297566 (+1.10z)| norm 0.2214 (+0.15z)| lr 2.54e-05 | 4168.91 ms | 32.4% bf16 MFU | 125348 tok/s step 17446/19560 | loss 3.248298 (-0.27z)| norm 0.2183 (-0.20z)| lr 2.54e-05 | 4172.59 ms | 32.4% bf16 MFU | 125363 tok/s step 17447/19560 | loss 3.182975 (-2.05z)| norm 0.2194 (-0.09z)| lr 2.53e-05 | 4185.67 ms | 32.3% bf16 MFU | 125358 tok/s step 17448/19560 | loss 3.329321 (+1.91z)| norm 0.2280 (+0.92z)| lr 2.53e-05 | 4173.52 ms | 32.4% bf16 MFU | 125371 tok/s step 17449/19560 | loss 3.226889 (-0.85z)| norm 0.2261 (+0.68z)| lr 2.53e-05 | 4188.00 ms | 32.2% bf16 MFU | 125362 tok/s step 17450/19560 | loss 3.334426 (+2.01z)| norm 0.2202 (-0.01z)| lr 2.53e-05 | 4177.65 ms | 32.3% bf16 MFU | 125369 tok/s step 17451/19560 | loss 3.241688 (-0.47z)| norm 0.2167 (-0.44z)| lr 2.52e-05 | 4178.49 ms | 32.3% bf16 MFU | 125374 tok/s step 17452/19560 | loss 3.253896 (-0.14z)| norm 0.2260 (+0.67z)| lr 2.52e-05 | 4198.38 ms | 32.2% bf16 MFU | 125349 tok/s step 17453/19560 | loss 3.281420 (+0.59z)| norm 0.2188 (-0.17z)| lr 2.52e-05 | 4181.94 ms | 32.3% bf16 MFU | 125350 tok/s step 17454/19560 | loss 3.290977 (+0.85z)| norm 0.2243 (+0.48z)| lr 2.52e-05 | 4184.79 ms | 32.3% bf16 MFU | 125347 tok/s step 17455/19560 | loss 3.339329 (+2.09z)| norm 0.2139 (-0.76z)| lr 2.51e-05 | 4183.94 ms | 32.3% bf16 MFU | 125345 tok/s step 17456/19560 | loss 3.291063 (+0.80z)| norm 0.2297 (+1.10z)| lr 2.51e-05 | 4181.98 ms | 32.3% bf16 MFU | 125346 tok/s step 17457/19560 | loss 3.231954 (-0.75z)| norm 0.2318 (+1.32z)| lr 2.51e-05 | 4182.70 ms | 32.3% bf16 MFU | 125346 tok/s step 17458/19560 | loss 3.249745 (-0.29z)| norm 0.2210 (+0.06z)| lr 2.51e-05 | 4204.31 ms | 32.1% bf16 MFU | 125314 tok/s step 17459/19560 | loss 3.280133 (+0.53z)| norm 0.2179 (-0.32z)| lr 2.51e-05 | 4887.96 ms | 27.6% bf16 MFU | 124412 tok/s step 17460/19560 | loss 3.329434 (+1.85z)| norm 0.2199 (-0.09z)| lr 2.50e-05 | 4639.44 ms | 29.1% bf16 MFU | 123841 tok/s step 17461/19560 | loss 3.311287 (+1.36z)| norm 0.2311 (+1.28z)| lr 2.50e-05 | 4685.83 ms | 28.8% bf16 MFU | 123244 tok/s step 17462/19560 | loss 3.177962 (-2.23z)| norm 0.2177 (-0.35z)| lr 2.50e-05 | 4466.18 ms | 30.2% bf16 MFU | 122951 tok/s step 17463/19560 | loss 3.253198 (-0.21z)| norm 0.2188 (-0.21z)| lr 2.50e-05 | 4494.07 ms | 30.0% bf16 MFU | 122637 tok/s step 17464/19560 | loss 3.258122 (-0.08z)| norm 0.2174 (-0.38z)| lr 2.49e-05 | 4208.13 ms | 32.1% bf16 MFU | 122734 tok/s step 17465/19560 | loss 3.332906 (+1.89z)| norm 0.2176 (-0.35z)| lr 2.49e-05 | 4354.73 ms | 31.0% bf16 MFU | 122617 tok/s step 17466/19560 | loss 3.258284 (-0.09z)| norm 0.2300 (+1.16z)| lr 2.49e-05 | 4305.48 ms | 31.4% bf16 MFU | 122575 tok/s step 17467/19560 | loss 3.353971 (+2.38z)| norm 0.2280 (+0.90z)| lr 2.49e-05 | 4259.23 ms | 31.7% bf16 MFU | 122601 tok/s step 17468/19560 | loss 3.280330 (+0.46z)| norm 0.2192 (-0.17z)| lr 2.48e-05 | 4288.28 ms | 31.5% bf16 MFU | 122584 tok/s step 17469/19560 | loss 3.305710 (+1.12z)| norm 0.2251 (+0.58z)| lr 2.48e-05 | 4175.55 ms | 32.3% bf16 MFU | 122733 tok/s step 17470/19560 | loss 3.234767 (-0.71z)| norm 0.2300 (+1.17z)| lr 2.48e-05 | 4175.96 ms | 32.3% bf16 MFU | 122874 tok/s step 17471/19560 | loss 3.266640 (+0.13z)| norm 0.2515 (+3.66z)| lr 2.48e-05 | 4230.31 ms | 31.9% bf16 MFU | 122927 tok/s step 17472/19560 | loss 3.249866 (-0.32z)| norm 0.2289 (+0.99z)| lr 2.47e-05 | 4230.45 ms | 31.9% bf16 MFU | 122977 tok/s step 17473/19560 | loss 3.305606 (+1.14z)| norm 0.2813 (+6.18z)| lr 2.47e-05 | 4411.33 ms | 30.6% bf16 MFU | 122771 tok/s step 17474/19560 | loss 3.242523 (-0.51z)| norm 0.2180 (-0.30z)| lr 2.47e-05 | 4161.07 ms | 32.4% bf16 MFU | 122932 tok/s step 17475/19560 | loss 3.262298 (+0.00z)| norm 0.2191 (-0.18z)| lr 2.47e-05 | 4234.48 ms | 31.9% bf16 MFU | 122976 tok/s step 17476/19560 | loss 3.257179 (-0.13z)| norm 0.2264 (+0.59z)| lr 2.47e-05 | 4429.07 ms | 30.5% bf16 MFU | 122746 tok/s step 17477/19560 | loss 3.330498 (+1.76z)| norm 0.2537 (+3.24z)| lr 2.46e-05 | 4199.60 ms | 32.2% bf16 MFU | 122851 tok/s step 17478/19560 | loss 3.283524 (+0.53z)| norm 0.2239 (+0.29z)| lr 2.46e-05 | 4170.44 ms | 32.4% bf16 MFU | 122994 tok/s step 17479/19560 | loss 3.241300 (-0.58z)| norm 0.2190 (-0.20z)| lr 2.46e-05 | 4208.37 ms | 32.1% bf16 MFU | 123074 tok/s step 17480/19560 | loss 3.283931 (+0.53z)| norm 0.2198 (-0.10z)| lr 2.46e-05 | 4175.25 ms | 32.3% bf16 MFU | 123198 tok/s step 17481/19560 | loss 3.256302 (-0.19z)| norm 0.2287 (+0.77z)| lr 2.45e-05 | 4291.62 ms | 31.5% bf16 MFU | 123147 tok/s step 17482/19560 | loss 3.386092 (+3.11z)| norm 0.2291 (+0.81z)| lr 2.45e-05 | 4260.51 ms | 31.7% bf16 MFU | 123142 tok/s step 17483/19560 | loss 3.243086 (-0.53z)| norm 0.2165 (-0.43z)| lr 2.45e-05 | 4172.91 ms | 32.4% bf16 MFU | 123267 tok/s step 17484/19560 | loss 3.277089 (+0.33z)| norm 0.2191 (-0.18z)| lr 2.45e-05 | 4159.02 ms | 32.5% bf16 MFU | 123407 tok/s step 17485/19560 | loss 3.320477 (+1.45z)| norm 0.2232 (+0.24z)| lr 2.44e-05 | 4183.38 ms | 32.3% bf16 MFU | 123503 tok/s step 17486/19560 | loss 3.270876 (+0.16z)| norm 0.2139 (-0.70z)| lr 2.44e-05 | 4176.17 ms | 32.3% bf16 MFU | 123605 tok/s step 17487/19560 | loss 3.247885 (-0.45z)| norm 0.2320 (+1.12z)| lr 2.44e-05 | 4311.09 ms | 31.3% bf16 MFU | 123505 tok/s step 17488/19560 | loss 3.275266 (+0.26z)| norm 0.2184 (-0.25z)| lr 2.44e-05 | 4172.23 ms | 32.4% bf16 MFU | 123613 tok/s step 17489/19560 | loss 3.272187 (+0.18z)| norm 0.2257 (+0.48z)| lr 2.43e-05 | 4170.05 ms | 32.4% bf16 MFU | 123719 tok/s step 17490/19560 | loss 3.341138 (+1.94z)| norm 0.2338 (+1.28z)| lr 2.43e-05 | 4173.25 ms | 32.4% bf16 MFU | 123814 tok/s step 17491/19560 | loss 3.295554 (+0.75z)| norm 0.2219 (+0.07z)| lr 2.43e-05 | 4231.62 ms | 31.9% bf16 MFU | 123819 tok/s step 17492/19560 | loss 3.259150 (-0.19z)| norm 0.2144 (-0.68z)| lr 2.43e-05 | 4229.92 ms | 31.9% bf16 MFU | 123825 tok/s step 17493/19560 | loss 3.297655 (+0.80z)| norm 0.2298 (+0.86z)| lr 2.43e-05 | 4213.80 ms | 32.0% bf16 MFU | 123855 tok/s step 17494/19560 | loss 3.255380 (-0.30z)| norm 0.2298 (+0.85z)| lr 2.42e-05 | 4301.33 ms | 31.4% bf16 MFU | 123757 tok/s step 17495/19560 | loss 3.295280 (+0.73z)| norm 0.2238 (+0.24z)| lr 2.42e-05 | 4237.85 ms | 31.9% bf16 MFU | 123754 tok/s step 17496/19560 | loss 3.268528 (+0.03z)| norm 0.2185 (-0.29z)| lr 2.42e-05 | 4178.13 ms | 32.3% bf16 MFU | 123841 tok/s step 17497/19560 | loss 3.291631 (+0.64z)| norm 0.2184 (-0.30z)| lr 2.42e-05 | 4206.97 ms | 32.1% bf16 MFU | 123880 tok/s step 17498/19560 | loss 3.247238 (-0.53z)| norm 0.2157 (-0.56z)| lr 2.41e-05 | 4378.98 ms | 30.8% bf16 MFU | 123672 tok/s step 17499/19560 | loss 3.266842 (-0.02z)| norm 0.2184 (-0.30z)| lr 2.41e-05 | 4207.18 ms | 32.1% bf16 MFU | 123720 tok/s step 17500/19560 | loss 3.255356 (-0.33z)| norm 0.2345 (+1.31z)| lr 2.41e-05 | 4335.56 ms | 31.1% bf16 MFU | 123580 tok/s val loss 3.257124 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3054/10042 = 0.304123 step 17501/19560 | loss 3.271155 (+0.09z)| norm 0.2161 (-0.53z)| lr 2.41e-05 | 4508.20 ms | 29.9% bf16 MFU | 123216 tok/s step 17502/19560 | loss 3.240775 (-0.72z)| norm 0.2207 (-0.07z)| lr 2.40e-05 | 4214.38 ms | 32.0% bf16 MFU | 123275 tok/s step 17503/19560 | loss 3.282556 (+0.38z)| norm 0.2102 (-1.10z)| lr 2.40e-05 | 4178.30 ms | 32.3% bf16 MFU | 123386 tok/s step 17504/19560 | loss 3.236536 (-0.84z)| norm 0.2313 (+0.99z)| lr 2.40e-05 | 4178.74 ms | 32.3% bf16 MFU | 123490 tok/s step 17505/19560 | loss 3.241509 (-0.70z)| norm 0.2253 (+0.39z)| lr 2.40e-05 | 4238.60 ms | 31.9% bf16 MFU | 123500 tok/s step 17506/19560 | loss 3.297862 (+0.79z)| norm 0.2167 (-0.46z)| lr 2.40e-05 | 4250.05 ms | 31.8% bf16 MFU | 123493 tok/s step 17507/19560 | loss 3.286020 (+0.46z)| norm 0.2185 (-0.27z)| lr 2.39e-05 | 4181.62 ms | 32.3% bf16 MFU | 123587 tok/s step 17508/19560 | loss 3.252110 (-0.45z)| norm 0.2180 (-0.31z)| lr 2.39e-05 | 4236.98 ms | 31.9% bf16 MFU | 123595 tok/s step 17509/19560 | loss 3.293883 (+0.66z)| norm 0.2135 (-0.78z)| lr 2.39e-05 | 4247.36 ms | 31.8% bf16 MFU | 123587 tok/s step 17510/19560 | loss 3.270713 (+0.03z)| norm 0.2123 (-0.91z)| lr 2.39e-05 | 4181.97 ms | 32.3% bf16 MFU | 123676 tok/s step 17511/19560 | loss 3.250638 (-0.50z)| norm 0.2136 (-0.77z)| lr 2.38e-05 | 4278.01 ms | 31.6% bf16 MFU | 123620 tok/s step 17512/19560 | loss 3.267168 (-0.05z)| norm 0.2261 (+0.54z)| lr 2.38e-05 | 4184.76 ms | 32.3% bf16 MFU | 123703 tok/s step 17513/19560 | loss 3.294846 (+0.69z)| norm 0.2140 (-0.72z)| lr 2.38e-05 | 4185.29 ms | 32.3% bf16 MFU | 123782 tok/s step 17514/19560 | loss 3.273656 (+0.12z)| norm 0.2138 (-0.73z)| lr 2.38e-05 | 4213.85 ms | 32.0% bf16 MFU | 123813 tok/s step 17515/19560 | loss 3.318423 (+1.32z)| norm 0.2154 (-0.59z)| lr 2.37e-05 | 4187.99 ms | 32.2% bf16 MFU | 123882 tok/s step 17516/19560 | loss 3.328762 (+1.57z)| norm 0.2329 (+1.26z)| lr 2.37e-05 | 4175.68 ms | 32.3% bf16 MFU | 123966 tok/s step 17517/19560 | loss 3.293793 (+0.64z)| norm 0.2202 (-0.10z)| lr 2.37e-05 | 4165.60 ms | 32.4% bf16 MFU | 124061 tok/s step 17518/19560 | loss 3.303887 (+0.89z)| norm 0.2161 (-0.54z)| lr 2.37e-05 | 5546.77 ms | 24.3% bf16 MFU | 122584 tok/s step 17519/19560 | loss 3.224796 (-1.24z)| norm 0.2213 (+0.01z)| lr 2.37e-05 | 4174.45 ms | 32.3% bf16 MFU | 122734 tok/s step 17520/19560 | loss 3.244993 (-0.69z)| norm 0.2188 (-0.27z)| lr 2.36e-05 | 4170.59 ms | 32.4% bf16 MFU | 122883 tok/s step 17521/19560 | loss 3.222089 (-1.29z)| norm 0.2229 (+0.17z)| lr 2.36e-05 | 4161.49 ms | 32.4% bf16 MFU | 123038 tok/s step 17522/19560 | loss 3.220802 (-1.33z)| norm 0.2268 (+0.58z)| lr 2.36e-05 | 4181.00 ms | 32.3% bf16 MFU | 123156 tok/s step 17523/19560 | loss 3.303036 (+0.89z)| norm 0.2180 (-0.36z)| lr 2.36e-05 | 4219.28 ms | 32.0% bf16 MFU | 123211 tok/s step 17524/19560 | loss 3.276591 (+0.17z)| norm 0.2188 (-0.29z)| lr 2.35e-05 | 4189.21 ms | 32.2% bf16 MFU | 123308 tok/s step 17525/19560 | loss 3.271850 (+0.04z)| norm 0.2216 (+0.02z)| lr 2.35e-05 | 4206.57 ms | 32.1% bf16 MFU | 123375 tok/s step 17526/19560 | loss 3.277492 (+0.18z)| norm 0.2162 (-0.56z)| lr 2.35e-05 | 4178.28 ms | 32.3% bf16 MFU | 123480 tok/s step 17527/19560 | loss 3.241325 (-0.83z)| norm 0.2244 (+0.31z)| lr 2.35e-05 | 4197.90 ms | 32.2% bf16 MFU | 123551 tok/s step 17528/19560 | loss 3.233517 (-1.03z)| norm 0.2172 (-0.47z)| lr 2.35e-05 | 4173.31 ms | 32.4% bf16 MFU | 123655 tok/s step 17529/19560 | loss 3.318854 (+1.31z)| norm 0.2204 (-0.12z)| lr 2.34e-05 | 4173.24 ms | 32.4% bf16 MFU | 123753 tok/s step 17530/19560 | loss 3.227306 (-1.19z)| norm 0.2391 (+1.87z)| lr 2.34e-05 | 4203.91 ms | 32.1% bf16 MFU | 123801 tok/s step 17531/19560 | loss 3.282543 (+0.32z)| norm 0.2139 (-0.85z)| lr 2.34e-05 | 4187.71 ms | 32.2% bf16 MFU | 123871 tok/s step 17532/19560 | loss 3.270781 (-0.00z)| norm 0.2110 (-1.16z)| lr 2.34e-05 | 4227.89 ms | 31.9% bf16 MFU | 123878 tok/s step 17533/19560 | loss 3.239707 (-0.84z)| norm 0.2105 (-1.19z)| lr 2.33e-05 | 4180.18 ms | 32.3% bf16 MFU | 123955 tok/s step 17534/19560 | loss 3.256921 (-0.37z)| norm 0.2138 (-0.83z)| lr 2.33e-05 | 4173.20 ms | 32.4% bf16 MFU | 124039 tok/s step 17535/19560 | loss 3.271240 (+0.02z)| norm 0.2231 (+0.16z)| lr 2.33e-05 | 4195.89 ms | 32.2% bf16 MFU | 124085 tok/s step 17536/19560 | loss 3.296335 (+0.70z)| norm 0.2177 (-0.42z)| lr 2.33e-05 | 4215.20 ms | 32.0% bf16 MFU | 124100 tok/s step 17537/19560 | loss 3.279549 (+0.24z)| norm 0.2128 (-0.94z)| lr 2.32e-05 | 4176.24 ms | 32.3% bf16 MFU | 124172 tok/s step 17538/19560 | loss 3.261747 (-0.24z)| norm 0.2106 (-1.17z)| lr 2.32e-05 | 4176.33 ms | 32.3% bf16 MFU | 124240 tok/s step 17539/19560 | loss 3.213408 (-1.54z)| norm 0.2248 (+0.36z)| lr 2.32e-05 | 4175.55 ms | 32.3% bf16 MFU | 124306 tok/s step 17540/19560 | loss 3.237491 (-0.89z)| norm 0.2111 (-1.10z)| lr 2.32e-05 | 4209.92 ms | 32.1% bf16 MFU | 124318 tok/s step 17541/19560 | loss 3.222846 (-1.29z)| norm 0.2075 (-1.46z)| lr 2.32e-05 | 4168.65 ms | 32.4% bf16 MFU | 124390 tok/s step 17542/19560 | loss 3.270430 (-0.00z)| norm 0.2151 (-0.66z)| lr 2.31e-05 | 4176.85 ms | 32.3% bf16 MFU | 124447 tok/s step 17543/19560 | loss 3.275332 (+0.13z)| norm 0.2216 (+0.03z)| lr 2.31e-05 | 4184.74 ms | 32.3% bf16 MFU | 124489 tok/s step 17544/19560 | loss 3.247010 (-0.63z)| norm 0.2193 (-0.19z)| lr 2.31e-05 | 4252.05 ms | 31.8% bf16 MFU | 124429 tok/s step 17545/19560 | loss 3.228017 (-1.14z)| norm 0.2078 (-1.42z)| lr 2.31e-05 | 4173.61 ms | 32.4% bf16 MFU | 124489 tok/s step 17546/19560 | loss 3.207273 (-1.68z)| norm 0.2116 (-1.00z)| lr 2.30e-05 | 4181.12 ms | 32.3% bf16 MFU | 124534 tok/s step 17547/19560 | loss 3.273328 (+0.11z)| norm 0.2155 (-0.60z)| lr 2.30e-05 | 4172.62 ms | 32.4% bf16 MFU | 124590 tok/s step 17548/19560 | loss 3.245994 (-0.62z)| norm 0.2244 (+0.36z)| lr 2.30e-05 | 4173.64 ms | 32.4% bf16 MFU | 124641 tok/s step 17549/19560 | loss 3.248704 (-0.56z)| norm 0.2176 (-0.38z)| lr 2.30e-05 | 4188.01 ms | 32.2% bf16 MFU | 124669 tok/s step 17550/19560 | loss 3.314579 (+1.22z)| norm 0.2195 (-0.17z)| lr 2.30e-05 | 4213.30 ms | 32.0% bf16 MFU | 124657 tok/s step 17551/19560 | loss 3.306753 (+1.00z)| norm 0.2174 (-0.40z)| lr 2.29e-05 | 4171.81 ms | 32.4% bf16 MFU | 124708 tok/s step 17552/19560 | loss 3.252379 (-0.47z)| norm 0.2179 (-0.34z)| lr 2.29e-05 | 4207.27 ms | 32.1% bf16 MFU | 124703 tok/s step 17553/19560 | loss 3.293601 (+0.63z)| norm 0.2283 (+0.77z)| lr 2.29e-05 | 4201.58 ms | 32.1% bf16 MFU | 124707 tok/s step 17554/19560 | loss 3.252376 (-0.48z)| norm 0.2196 (-0.17z)| lr 2.29e-05 | 4180.86 ms | 32.3% bf16 MFU | 124742 tok/s step 17555/19560 | loss 3.273728 (+0.11z)| norm 0.2179 (-0.35z)| lr 2.28e-05 | 4167.14 ms | 32.4% bf16 MFU | 124796 tok/s step 17556/19560 | loss 3.245860 (-0.65z)| norm 0.2186 (-0.29z)| lr 2.28e-05 | 4179.23 ms | 32.3% bf16 MFU | 124828 tok/s step 17557/19560 | loss 3.185873 (-2.34z)| norm 0.2193 (-0.21z)| lr 2.28e-05 | 4165.40 ms | 32.4% bf16 MFU | 124880 tok/s step 17558/19560 | loss 3.251975 (-0.47z)| norm 0.2102 (-1.20z)| lr 2.28e-05 | 4194.89 ms | 32.2% bf16 MFU | 124885 tok/s step 17559/19560 | loss 3.236752 (-0.92z)| norm 0.2227 (+0.16z)| lr 2.27e-05 | 4170.30 ms | 32.4% bf16 MFU | 124927 tok/s step 17560/19560 | loss 3.232936 (-1.02z)| norm 0.2125 (-0.94z)| lr 2.27e-05 | 4187.76 ms | 32.2% bf16 MFU | 124941 tok/s step 17561/19560 | loss 3.300520 (+0.93z)| norm 0.2238 (+0.28z)| lr 2.27e-05 | 4187.91 ms | 32.2% bf16 MFU | 124953 tok/s step 17562/19560 | loss 3.239401 (-0.83z)| norm 0.2041 (-1.83z)| lr 2.27e-05 | 4188.87 ms | 32.2% bf16 MFU | 124964 tok/s step 17563/19560 | loss 3.240137 (-0.80z)| norm 0.2140 (-0.76z)| lr 2.27e-05 | 4167.24 ms | 32.4% bf16 MFU | 125006 tok/s step 17564/19560 | loss 3.268893 (+0.03z)| norm 0.2472 (+2.70z)| lr 2.26e-05 | 4177.21 ms | 32.3% bf16 MFU | 125031 tok/s step 17565/19560 | loss 3.279282 (+0.32z)| norm 0.2054 (-1.63z)| lr 2.26e-05 | 4169.69 ms | 32.4% bf16 MFU | 125067 tok/s step 17566/19560 | loss 3.230509 (-1.07z)| norm 0.2106 (-1.09z)| lr 2.26e-05 | 4183.79 ms | 32.3% bf16 MFU | 125079 tok/s step 17567/19560 | loss 3.231657 (-1.03z)| norm 0.2142 (-0.71z)| lr 2.26e-05 | 4190.13 ms | 32.2% bf16 MFU | 125081 tok/s step 17568/19560 | loss 3.298876 (+0.88z)| norm 0.2349 (+1.40z)| lr 2.25e-05 | 4183.53 ms | 32.3% bf16 MFU | 125093 tok/s step 17569/19560 | loss 3.343928 (+2.10z)| norm 0.2280 (+0.70z)| lr 2.25e-05 | 4179.54 ms | 32.3% bf16 MFU | 125111 tok/s step 17570/19560 | loss 3.274964 (+0.16z)| norm 0.2156 (-0.58z)| lr 2.25e-05 | 4290.52 ms | 31.5% bf16 MFU | 124965 tok/s step 17571/19560 | loss 3.350422 (+2.22z)| norm 0.2181 (-0.33z)| lr 2.25e-05 | 4190.89 ms | 32.2% bf16 MFU | 124972 tok/s step 17572/19560 | loss 3.255520 (-0.41z)| norm 0.2134 (-0.81z)| lr 2.25e-05 | 4182.10 ms | 32.3% bf16 MFU | 124991 tok/s step 17573/19560 | loss 3.308117 (+1.05z)| norm 0.2209 (-0.03z)| lr 2.24e-05 | 4173.43 ms | 32.4% bf16 MFU | 125023 tok/s step 17574/19560 | loss 3.314167 (+1.20z)| norm 0.2171 (-0.42z)| lr 2.24e-05 | 4179.59 ms | 32.3% bf16 MFU | 125044 tok/s step 17575/19560 | loss 3.303582 (+0.90z)| norm 0.2231 (+0.19z)| lr 2.24e-05 | 4168.67 ms | 32.4% bf16 MFU | 125080 tok/s step 17576/19560 | loss 3.255480 (-0.44z)| norm 0.2133 (-0.81z)| lr 2.24e-05 | 4179.73 ms | 32.3% bf16 MFU | 125098 tok/s step 17577/19560 | loss 3.288700 (+0.49z)| norm 0.2093 (-1.20z)| lr 2.23e-05 | 4174.43 ms | 32.3% bf16 MFU | 125123 tok/s step 17578/19560 | loss 3.269136 (-0.06z)| norm 0.2042 (-1.70z)| lr 2.23e-05 | 4193.63 ms | 32.2% bf16 MFU | 125118 tok/s step 17579/19560 | loss 3.196403 (-2.11z)| norm 0.2187 (-0.23z)| lr 2.23e-05 | 4194.67 ms | 32.2% bf16 MFU | 125111 tok/s step 17580/19560 | loss 3.271024 (+0.00z)| norm 0.2251 (+0.43z)| lr 2.23e-05 | 4176.25 ms | 32.3% bf16 MFU | 125133 tok/s step 17581/19560 | loss 3.243346 (-0.77z)| norm 0.2118 (-0.92z)| lr 2.23e-05 | 4176.68 ms | 32.3% bf16 MFU | 125152 tok/s step 17582/19560 | loss 3.255783 (-0.41z)| norm 0.2143 (-0.66z)| lr 2.22e-05 | 4181.74 ms | 32.3% bf16 MFU | 125164 tok/s step 17583/19560 | loss 3.274343 (+0.13z)| norm 0.2213 (+0.05z)| lr 2.22e-05 | 4174.39 ms | 32.3% bf16 MFU | 125185 tok/s step 17584/19560 | loss 3.282211 (+0.36z)| norm 0.2225 (+0.18z)| lr 2.22e-05 | 4194.66 ms | 32.2% bf16 MFU | 125175 tok/s step 17585/19560 | loss 3.297024 (+0.78z)| norm 0.2126 (-0.82z)| lr 2.22e-05 | 4168.32 ms | 32.4% bf16 MFU | 125206 tok/s step 17586/19560 | loss 3.294301 (+0.69z)| norm 0.2068 (-1.39z)| lr 2.21e-05 | 4176.61 ms | 32.3% bf16 MFU | 125222 tok/s step 17587/19560 | loss 3.280836 (+0.30z)| norm 0.2570 (+3.50z)| lr 2.21e-05 | 4180.52 ms | 32.3% bf16 MFU | 125231 tok/s step 17588/19560 | loss 3.237894 (-0.93z)| norm 0.2180 (-0.27z)| lr 2.21e-05 | 4191.94 ms | 32.2% bf16 MFU | 125223 tok/s step 17589/19560 | loss 3.299067 (+0.86z)| norm 0.2402 (+1.84z)| lr 2.21e-05 | 4169.70 ms | 32.4% bf16 MFU | 125249 tok/s step 17590/19560 | loss 3.268780 (-0.05z)| norm 0.2266 (+0.54z)| lr 2.21e-05 | 4184.76 ms | 32.3% bf16 MFU | 125251 tok/s step 17591/19560 | loss 3.212822 (-1.71z)| norm 0.2184 (-0.24z)| lr 2.20e-05 | 4206.07 ms | 32.1% bf16 MFU | 125221 tok/s step 17592/19560 | loss 3.306657 (+1.07z)| norm 0.2235 (+0.24z)| lr 2.20e-05 | 4164.12 ms | 32.4% bf16 MFU | 125255 tok/s step 17593/19560 | loss 3.218631 (-1.52z)| norm 0.2244 (+0.32z)| lr 2.20e-05 | 4172.38 ms | 32.4% bf16 MFU | 125275 tok/s step 17594/19560 | loss 3.308708 (+1.15z)| norm 0.2215 (+0.05z)| lr 2.20e-05 | 4191.99 ms | 32.2% bf16 MFU | 125265 tok/s step 17595/19560 | loss 3.315583 (+1.39z)| norm 0.2206 (-0.02z)| lr 2.19e-05 | 4187.42 ms | 32.2% bf16 MFU | 125262 tok/s step 17596/19560 | loss 3.292539 (+0.69z)| norm 0.2151 (-0.55z)| lr 2.19e-05 | 4183.86 ms | 32.3% bf16 MFU | 125264 tok/s step 17597/19560 | loss 3.300942 (+0.94z)| norm 0.2215 (+0.06z)| lr 2.19e-05 | 4200.88 ms | 32.1% bf16 MFU | 125241 tok/s step 17598/19560 | loss 3.298049 (+0.84z)| norm 0.2105 (-0.98z)| lr 2.19e-05 | 4184.41 ms | 32.3% bf16 MFU | 125244 tok/s step 17599/19560 | loss 3.269912 (-0.01z)| norm 0.2209 (+0.04z)| lr 2.19e-05 | 4184.36 ms | 32.3% bf16 MFU | 125247 tok/s step 17600/19560 | loss 3.293933 (+0.70z)| norm 0.2076 (-1.25z)| lr 2.18e-05 | 4314.48 ms | 31.3% bf16 MFU | 125060 tok/s step 17601/19560 | loss 3.270372 (-0.00z)| norm 0.2162 (-0.41z)| lr 2.18e-05 | 4179.60 ms | 32.3% bf16 MFU | 125079 tok/s step 17602/19560 | loss 3.265714 (-0.15z)| norm 0.2113 (-0.98z)| lr 2.18e-05 | 4165.59 ms | 32.4% bf16 MFU | 125118 tok/s step 17603/19560 | loss 3.252665 (-0.54z)| norm 0.2216 (+0.22z)| lr 2.18e-05 | 4198.25 ms | 32.2% bf16 MFU | 125107 tok/s step 17604/19560 | loss 3.244438 (-0.79z)| norm 0.2099 (-1.13z)| lr 2.17e-05 | 4177.36 ms | 32.3% bf16 MFU | 125127 tok/s step 17605/19560 | loss 3.255560 (-0.44z)| norm 0.2325 (+1.60z)| lr 2.17e-05 | 4179.36 ms | 32.3% bf16 MFU | 125143 tok/s step 17606/19560 | loss 3.323085 (+1.61z)| norm 0.2255 (+0.74z)| lr 2.17e-05 | 4173.28 ms | 32.4% bf16 MFU | 125167 tok/s step 17607/19560 | loss 3.295194 (+0.75z)| norm 0.2101 (-1.14z)| lr 2.17e-05 | 4185.81 ms | 32.3% bf16 MFU | 125171 tok/s step 17608/19560 | loss 3.340012 (+2.07z)| norm 0.2337 (+1.71z)| lr 2.17e-05 | 4183.75 ms | 32.3% bf16 MFU | 125179 tok/s step 17609/19560 | loss 3.311064 (+1.18z)| norm 0.2181 (-0.16z)| lr 2.16e-05 | 4169.34 ms | 32.4% bf16 MFU | 125207 tok/s step 17610/19560 | loss 3.310988 (+1.25z)| norm 0.2130 (-0.76z)| lr 2.16e-05 | 4175.96 ms | 32.3% bf16 MFU | 125224 tok/s step 17611/19560 | loss 3.233118 (-1.17z)| norm 0.2108 (-1.03z)| lr 2.16e-05 | 4178.90 ms | 32.3% bf16 MFU | 125236 tok/s step 17612/19560 | loss 3.290161 (+0.60z)| norm 0.2189 (-0.04z)| lr 2.16e-05 | 4163.83 ms | 32.4% bf16 MFU | 125270 tok/s step 17613/19560 | loss 3.226369 (-1.36z)| norm 0.2130 (-0.74z)| lr 2.15e-05 | 4191.87 ms | 32.2% bf16 MFU | 125260 tok/s step 17614/19560 | loss 3.297874 (+0.85z)| norm 0.2289 (+1.16z)| lr 2.15e-05 | 4181.71 ms | 32.3% bf16 MFU | 125266 tok/s step 17615/19560 | loss 3.290059 (+0.60z)| norm 0.2188 (-0.04z)| lr 2.15e-05 | 4179.77 ms | 32.3% bf16 MFU | 125274 tok/s step 17616/19560 | loss 3.289106 (+0.57z)| norm 0.2203 (+0.13z)| lr 2.15e-05 | 4192.24 ms | 32.2% bf16 MFU | 125264 tok/s step 17617/19560 | loss 3.254159 (-0.51z)| norm 0.2174 (-0.21z)| lr 2.15e-05 | 4189.65 ms | 32.2% bf16 MFU | 125257 tok/s step 17618/19560 | loss 3.215000 (-1.70z)| norm 0.2112 (-0.95z)| lr 2.14e-05 | 4172.21 ms | 32.4% bf16 MFU | 125278 tok/s step 17619/19560 | loss 3.250010 (-0.60z)| norm 0.2185 (-0.05z)| lr 2.14e-05 | 4169.25 ms | 32.4% bf16 MFU | 125301 tok/s step 17620/19560 | loss 3.264523 (-0.15z)| norm 0.2192 (+0.03z)| lr 2.14e-05 | 4183.86 ms | 32.3% bf16 MFU | 125302 tok/s step 17621/19560 | loss 3.240707 (-0.88z)| norm 0.2171 (-0.22z)| lr 2.14e-05 | 4237.82 ms | 31.9% bf16 MFU | 125223 tok/s step 17622/19560 | loss 3.247832 (-0.65z)| norm 0.2103 (-1.06z)| lr 2.14e-05 | 4192.50 ms | 32.2% bf16 MFU | 125214 tok/s step 17623/19560 | loss 3.276187 (+0.24z)| norm 0.2073 (-1.39z)| lr 2.13e-05 | 4167.53 ms | 32.4% bf16 MFU | 125244 tok/s step 17624/19560 | loss 3.415465 (+4.22z)| norm 0.2307 (+1.48z)| lr 2.13e-05 | 4174.36 ms | 32.3% bf16 MFU | 125261 tok/s step 17625/19560 | loss 3.293251 (+0.68z)| norm 0.2169 (-0.21z)| lr 2.13e-05 | 4206.99 ms | 32.1% bf16 MFU | 125229 tok/s step 17626/19560 | loss 3.306436 (+1.05z)| norm 0.2187 (-0.00z)| lr 2.13e-05 | 4176.85 ms | 32.3% bf16 MFU | 125244 tok/s step 17627/19560 | loss 3.301533 (+0.89z)| norm 0.2302 (+1.39z)| lr 2.12e-05 | 4212.45 ms | 32.1% bf16 MFU | 125205 tok/s step 17628/19560 | loss 3.253688 (-0.49z)| norm 0.2156 (-0.38z)| lr 2.12e-05 | 4196.40 ms | 32.2% bf16 MFU | 125192 tok/s step 17629/19560 | loss 3.285816 (+0.44z)| norm 0.2100 (-1.06z)| lr 2.12e-05 | 4175.04 ms | 32.3% bf16 MFU | 125211 tok/s step 17630/19560 | loss 3.291162 (+0.58z)| norm 0.2135 (-0.63z)| lr 2.12e-05 | 4230.62 ms | 31.9% bf16 MFU | 125147 tok/s step 17631/19560 | loss 3.294142 (+0.66z)| norm 0.2250 (+0.79z)| lr 2.12e-05 | 4183.45 ms | 32.3% bf16 MFU | 125155 tok/s step 17632/19560 | loss 3.160884 (-3.05z)| norm 0.2184 (-0.02z)| lr 2.11e-05 | 4172.94 ms | 32.4% bf16 MFU | 125180 tok/s step 17633/19560 | loss 3.299035 (+0.78z)| norm 0.2155 (-0.37z)| lr 2.11e-05 | 4178.89 ms | 32.3% bf16 MFU | 125194 tok/s step 17634/19560 | loss 3.268227 (-0.07z)| norm 0.2154 (-0.38z)| lr 2.11e-05 | 4174.67 ms | 32.3% bf16 MFU | 125213 tok/s step 17635/19560 | loss 3.259273 (-0.31z)| norm 0.2089 (-1.17z)| lr 2.11e-05 | 4197.69 ms | 32.2% bf16 MFU | 125198 tok/s step 17636/19560 | loss 3.279356 (+0.24z)| norm 0.2098 (-1.06z)| lr 2.10e-05 | 4173.46 ms | 32.4% bf16 MFU | 125219 tok/s step 17637/19560 | loss 3.255040 (-0.43z)| norm 0.2187 (+0.04z)| lr 2.10e-05 | 4182.47 ms | 32.3% bf16 MFU | 125226 tok/s step 17638/19560 | loss 3.290120 (+0.54z)| norm 0.2136 (-0.59z)| lr 2.10e-05 | 4234.45 ms | 31.9% bf16 MFU | 125155 tok/s step 17639/19560 | loss 3.273968 (+0.09z)| norm 0.2201 (+0.20z)| lr 2.10e-05 | 4193.51 ms | 32.2% bf16 MFU | 125149 tok/s step 17640/19560 | loss 3.277682 (+0.19z)| norm 0.2181 (-0.04z)| lr 2.10e-05 | 4202.54 ms | 32.1% bf16 MFU | 125129 tok/s step 17641/19560 | loss 3.236855 (-0.94z)| norm 0.2117 (-0.83z)| lr 2.09e-05 | 4178.22 ms | 32.3% bf16 MFU | 125147 tok/s step 17642/19560 | loss 3.249393 (-0.58z)| norm 0.2153 (-0.38z)| lr 2.09e-05 | 4184.90 ms | 32.3% bf16 MFU | 125153 tok/s step 17643/19560 | loss 3.226864 (-1.19z)| norm 0.2106 (-0.96z)| lr 2.09e-05 | 4180.76 ms | 32.3% bf16 MFU | 125166 tok/s step 17644/19560 | loss 3.255189 (-0.39z)| norm 0.2071 (-1.37z)| lr 2.09e-05 | 4187.73 ms | 32.2% bf16 MFU | 125167 tok/s step 17645/19560 | loss 3.269251 (+0.01z)| norm 0.2127 (-0.66z)| lr 2.09e-05 | 4178.68 ms | 32.3% bf16 MFU | 125182 tok/s step 17646/19560 | loss 3.269076 (+0.02z)| norm 0.2213 (+0.40z)| lr 2.08e-05 | 4183.75 ms | 32.3% bf16 MFU | 125189 tok/s step 17647/19560 | loss 3.213329 (-1.56z)| norm 0.2120 (-0.74z)| lr 2.08e-05 | 4182.60 ms | 32.3% bf16 MFU | 125197 tok/s step 17648/19560 | loss 3.301781 (+0.93z)| norm 0.2083 (-1.19z)| lr 2.08e-05 | 4175.09 ms | 32.3% bf16 MFU | 125216 tok/s step 17649/19560 | loss 3.243879 (-0.71z)| norm 0.2159 (-0.25z)| lr 2.08e-05 | 4451.24 ms | 30.3% bf16 MFU | 124844 tok/s step 17650/19560 | loss 3.299818 (+0.86z)| norm 0.2145 (-0.41z)| lr 2.07e-05 | 4672.60 ms | 28.9% bf16 MFU | 124212 tok/s step 17651/19560 | loss 3.311500 (+1.19z)| norm 0.2125 (-0.65z)| lr 2.07e-05 | 4587.91 ms | 29.4% bf16 MFU | 123716 tok/s step 17652/19560 | loss 3.288373 (+0.53z)| norm 0.2150 (-0.34z)| lr 2.07e-05 | 4383.62 ms | 30.8% bf16 MFU | 123510 tok/s step 17653/19560 | loss 3.245278 (-0.69z)| norm 0.2148 (-0.36z)| lr 2.07e-05 | 4527.86 ms | 29.8% bf16 MFU | 123124 tok/s step 17654/19560 | loss 3.279048 (+0.27z)| norm 0.2185 (+0.11z)| lr 2.07e-05 | 4366.89 ms | 30.9% bf16 MFU | 122971 tok/s step 17655/19560 | loss 3.291111 (+0.60z)| norm 0.2102 (-0.91z)| lr 2.06e-05 | 4240.55 ms | 31.8% bf16 MFU | 123004 tok/s step 17656/19560 | loss 3.241053 (-0.82z)| norm 0.2057 (-1.45z)| lr 2.06e-05 | 4302.41 ms | 31.4% bf16 MFU | 122947 tok/s step 17657/19560 | loss 3.319077 (+1.39z)| norm 0.2143 (-0.38z)| lr 2.06e-05 | 4266.60 ms | 31.6% bf16 MFU | 122944 tok/s step 17658/19560 | loss 3.253598 (-0.48z)| norm 0.2174 (+0.02z)| lr 2.06e-05 | 4300.15 ms | 31.4% bf16 MFU | 122893 tok/s step 17659/19560 | loss 3.298318 (+0.80z)| norm 0.2390 (+2.64z)| lr 2.06e-05 | 4224.46 ms | 32.0% bf16 MFU | 122953 tok/s step 17660/19560 | loss 3.344377 (+2.06z)| norm 0.2242 (+0.82z)| lr 2.05e-05 | 4173.09 ms | 32.4% bf16 MFU | 123087 tok/s step 17661/19560 | loss 3.334716 (+1.75z)| norm 0.2174 (-0.02z)| lr 2.05e-05 | 4162.12 ms | 32.4% bf16 MFU | 123231 tok/s step 17662/19560 | loss 3.253284 (-0.51z)| norm 0.2228 (+0.63z)| lr 2.05e-05 | 4259.42 ms | 31.7% bf16 MFU | 123224 tok/s step 17663/19560 | loss 3.292649 (+0.58z)| norm 0.2168 (-0.10z)| lr 2.05e-05 | 4156.89 ms | 32.5% bf16 MFU | 123369 tok/s step 17664/19560 | loss 3.268404 (-0.09z)| norm 0.2199 (+0.28z)| lr 2.04e-05 | 4162.07 ms | 32.4% bf16 MFU | 123499 tok/s step 17665/19560 | loss 3.275956 (+0.12z)| norm 0.2061 (-1.41z)| lr 2.04e-05 | 4160.23 ms | 32.5% bf16 MFU | 123626 tok/s step 17666/19560 | loss 3.345988 (+2.02z)| norm 0.2150 (-0.33z)| lr 2.04e-05 | 4187.50 ms | 32.2% bf16 MFU | 123704 tok/s step 17667/19560 | loss 3.253654 (-0.52z)| norm 0.2184 (+0.09z)| lr 2.04e-05 | 4231.86 ms | 31.9% bf16 MFU | 123714 tok/s step 17668/19560 | loss 3.312146 (+1.08z)| norm 0.2434 (+3.05z)| lr 2.04e-05 | 4186.07 ms | 32.3% bf16 MFU | 123790 tok/s step 17669/19560 | loss 3.371190 (+2.62z)| norm 0.2157 (-0.27z)| lr 2.03e-05 | 4166.02 ms | 32.4% bf16 MFU | 123893 tok/s step 17670/19560 | loss 3.286494 (+0.33z)| norm 0.2050 (-1.53z)| lr 2.03e-05 | 4181.31 ms | 32.3% bf16 MFU | 123968 tok/s step 17671/19560 | loss 3.282519 (+0.22z)| norm 0.2179 (+0.01z)| lr 2.03e-05 | 4167.28 ms | 32.4% bf16 MFU | 124060 tok/s step 17672/19560 | loss 3.282573 (+0.21z)| norm 0.2140 (-0.45z)| lr 2.03e-05 | 4185.42 ms | 32.3% bf16 MFU | 124120 tok/s step 17673/19560 | loss 3.383770 (+2.84z)| norm 0.2138 (-0.48z)| lr 2.03e-05 | 4163.21 ms | 32.4% bf16 MFU | 124211 tok/s step 17674/19560 | loss 3.262784 (-0.36z)| norm 0.2176 (-0.03z)| lr 2.02e-05 | 4208.29 ms | 32.1% bf16 MFU | 124230 tok/s step 17675/19560 | loss 3.244762 (-0.84z)| norm 0.2147 (-0.37z)| lr 2.02e-05 | 4157.82 ms | 32.5% bf16 MFU | 124323 tok/s step 17676/19560 | loss 3.239749 (-0.97z)| norm 0.2130 (-0.56z)| lr 2.02e-05 | 4179.67 ms | 32.3% bf16 MFU | 124379 tok/s step 17677/19560 | loss 3.279284 (+0.08z)| norm 0.2146 (-0.37z)| lr 2.02e-05 | 4160.61 ms | 32.5% bf16 MFU | 124460 tok/s step 17678/19560 | loss 3.259736 (-0.43z)| norm 0.2130 (-0.56z)| lr 2.01e-05 | 4169.20 ms | 32.4% bf16 MFU | 124525 tok/s step 17679/19560 | loss 3.277262 (+0.04z)| norm 0.2189 (+0.14z)| lr 2.01e-05 | 4159.38 ms | 32.5% bf16 MFU | 124601 tok/s step 17680/19560 | loss 3.312388 (+0.97z)| norm 0.2195 (+0.21z)| lr 2.01e-05 | 4168.25 ms | 32.4% bf16 MFU | 124660 tok/s step 17681/19560 | loss 3.255301 (-0.55z)| norm 0.2105 (-0.85z)| lr 2.01e-05 | 4176.76 ms | 32.3% bf16 MFU | 124704 tok/s step 17682/19560 | loss 3.202671 (-1.92z)| norm 0.2140 (-0.42z)| lr 2.01e-05 | 4164.21 ms | 32.4% bf16 MFU | 124764 tok/s step 17683/19560 | loss 3.311585 (+0.94z)| norm 0.2203 (+0.33z)| lr 2.00e-05 | 4174.79 ms | 32.3% bf16 MFU | 124805 tok/s step 17684/19560 | loss 3.351680 (+1.95z)| norm 0.2293 (+1.39z)| lr 2.00e-05 | 4167.54 ms | 32.4% bf16 MFU | 124854 tok/s step 17685/19560 | loss 3.277807 (+0.01z)| norm 0.2168 (-0.09z)| lr 2.00e-05 | 4161.09 ms | 32.4% bf16 MFU | 124912 tok/s step 17686/19560 | loss 3.330137 (+1.38z)| norm 0.2229 (+0.61z)| lr 2.00e-05 | 4158.12 ms | 32.5% bf16 MFU | 124970 tok/s step 17687/19560 | loss 3.235762 (-1.11z)| norm 0.2199 (+0.26z)| lr 2.00e-05 | 4164.16 ms | 32.4% bf16 MFU | 125017 tok/s step 17688/19560 | loss 3.286873 (+0.23z)| norm 0.2171 (-0.08z)| lr 1.99e-05 | 4172.14 ms | 32.4% bf16 MFU | 125049 tok/s step 17689/19560 | loss 3.237316 (-1.07z)| norm 0.2055 (-1.43z)| lr 1.99e-05 | 4157.22 ms | 32.5% bf16 MFU | 125103 tok/s step 17690/19560 | loss 3.255053 (-0.61z)| norm 0.2038 (-1.63z)| lr 1.99e-05 | 4162.69 ms | 32.4% bf16 MFU | 125145 tok/s step 17691/19560 | loss 3.382782 (+2.68z)| norm 0.2268 (+1.07z)| lr 1.99e-05 | 4168.00 ms | 32.4% bf16 MFU | 125177 tok/s step 17692/19560 | loss 3.285909 (+0.17z)| norm 0.2280 (+1.30z)| lr 1.99e-05 | 4172.12 ms | 32.4% bf16 MFU | 125202 tok/s step 17693/19560 | loss 3.286305 (+0.18z)| norm 0.2233 (+0.70z)| lr 1.98e-05 | 4164.44 ms | 32.4% bf16 MFU | 125236 tok/s step 17694/19560 | loss 3.299294 (+0.51z)| norm 0.2303 (+1.54z)| lr 1.98e-05 | 4158.15 ms | 32.5% bf16 MFU | 125279 tok/s step 17695/19560 | loss 3.233710 (-1.20z)| norm 0.2167 (-0.14z)| lr 1.98e-05 | 4171.95 ms | 32.4% bf16 MFU | 125298 tok/s step 17696/19560 | loss 3.314193 (+0.89z)| norm 0.2242 (+0.81z)| lr 1.98e-05 | 4163.09 ms | 32.4% bf16 MFU | 125330 tok/s step 17697/19560 | loss 3.320874 (+1.07z)| norm 0.2092 (-1.05z)| lr 1.97e-05 | 4168.16 ms | 32.4% bf16 MFU | 125353 tok/s step 17698/19560 | loss 3.260911 (-0.49z)| norm 0.2186 (+0.13z)| lr 1.97e-05 | 4172.62 ms | 32.4% bf16 MFU | 125368 tok/s step 17699/19560 | loss 3.286782 (+0.20z)| norm 0.2168 (-0.11z)| lr 1.97e-05 | 4162.43 ms | 32.4% bf16 MFU | 125397 tok/s step 17700/19560 | loss 3.274780 (-0.12z)| norm 0.2168 (-0.11z)| lr 1.97e-05 | 4160.32 ms | 32.5% bf16 MFU | 125429 tok/s step 17701/19560 | loss 3.259393 (-0.52z)| norm 0.2224 (+0.59z)| lr 1.97e-05 | 4165.66 ms | 32.4% bf16 MFU | 125450 tok/s step 17702/19560 | loss 3.324839 (+1.21z)| norm 0.2389 (+2.57z)| lr 1.96e-05 | 4170.42 ms | 32.4% bf16 MFU | 125463 tok/s step 17703/19560 | loss 3.293011 (+0.37z)| norm 0.2140 (-0.46z)| lr 1.96e-05 | 4158.10 ms | 32.5% bf16 MFU | 125495 tok/s step 17704/19560 | loss 3.243676 (-0.93z)| norm 0.2182 (+0.04z)| lr 1.96e-05 | 4156.68 ms | 32.5% bf16 MFU | 125526 tok/s step 17705/19560 | loss 3.276139 (-0.07z)| norm 0.2109 (-0.85z)| lr 1.96e-05 | 4176.17 ms | 32.3% bf16 MFU | 125527 tok/s step 17706/19560 | loss 3.262281 (-0.43z)| norm 0.2065 (-1.39z)| lr 1.96e-05 | 4160.45 ms | 32.5% bf16 MFU | 125552 tok/s step 17707/19560 | loss 3.251716 (-0.74z)| norm 0.2163 (-0.19z)| lr 1.95e-05 | 4162.34 ms | 32.4% bf16 MFU | 125572 tok/s step 17708/19560 | loss 3.257618 (-0.57z)| norm 0.2037 (-1.70z)| lr 1.95e-05 | 4174.42 ms | 32.3% bf16 MFU | 125573 tok/s step 17709/19560 | loss 3.304453 (+0.67z)| norm 0.2188 (+0.13z)| lr 1.95e-05 | 4168.42 ms | 32.4% bf16 MFU | 125583 tok/s step 17710/19560 | loss 3.333189 (+1.42z)| norm 0.2195 (+0.22z)| lr 1.95e-05 | 4169.96 ms | 32.4% bf16 MFU | 125591 tok/s step 17711/19560 | loss 3.290865 (+0.28z)| norm 0.2132 (-0.54z)| lr 1.95e-05 | 4167.21 ms | 32.4% bf16 MFU | 125602 tok/s step 17712/19560 | loss 3.209148 (-1.86z)| norm 0.2116 (-0.73z)| lr 1.94e-05 | 4180.29 ms | 32.3% bf16 MFU | 125593 tok/s step 17713/19560 | loss 3.264963 (-0.38z)| norm 0.2187 (+0.13z)| lr 1.94e-05 | 4172.85 ms | 32.4% bf16 MFU | 125595 tok/s step 17714/19560 | loss 3.264407 (-0.39z)| norm 0.2174 (-0.04z)| lr 1.94e-05 | 4165.39 ms | 32.4% bf16 MFU | 125609 tok/s step 17715/19560 | loss 3.338661 (+1.54z)| norm 0.2171 (-0.05z)| lr 1.94e-05 | 4350.79 ms | 31.0% bf16 MFU | 125354 tok/s step 17716/19560 | loss 3.282097 (+0.06z)| norm 0.2196 (+0.29z)| lr 1.93e-05 | 4153.98 ms | 32.5% bf16 MFU | 125397 tok/s step 17717/19560 | loss 3.219644 (-1.55z)| norm 0.2146 (-0.38z)| lr 1.93e-05 | 4133.71 ms | 32.7% bf16 MFU | 125468 tok/s step 17718/19560 | loss 3.256702 (-0.59z)| norm 0.2105 (-0.94z)| lr 1.93e-05 | 4171.95 ms | 32.4% bf16 MFU | 125478 tok/s step 17719/19560 | loss 3.281003 (+0.03z)| norm 0.2137 (-0.47z)| lr 1.93e-05 | 4148.86 ms | 32.5% bf16 MFU | 125523 tok/s step 17720/19560 | loss 3.345264 (+1.70z)| norm 0.2315 (+2.02z)| lr 1.93e-05 | 4140.98 ms | 32.6% bf16 MFU | 125577 tok/s step 17721/19560 | loss 3.245753 (-0.91z)| norm 0.2173 (+0.04z)| lr 1.92e-05 | 4141.20 ms | 32.6% bf16 MFU | 125629 tok/s step 17722/19560 | loss 3.315997 (+0.93z)| norm 0.2137 (-0.46z)| lr 1.92e-05 | 4144.61 ms | 32.6% bf16 MFU | 125672 tok/s step 17723/19560 | loss 3.340192 (+1.55z)| norm 0.2229 (+0.83z)| lr 1.92e-05 | 4160.33 ms | 32.5% bf16 MFU | 125690 tok/s step 17724/19560 | loss 3.286402 (+0.15z)| norm 0.2089 (-1.13z)| lr 1.92e-05 | 4145.75 ms | 32.6% bf16 MFU | 125728 tok/s step 17725/19560 | loss 3.383754 (+2.60z)| norm 0.2260 (+1.26z)| lr 1.92e-05 | 4154.70 ms | 32.5% bf16 MFU | 125751 tok/s step 17726/19560 | loss 3.304306 (+0.59z)| norm 0.2151 (-0.27z)| lr 1.91e-05 | 4152.89 ms | 32.5% bf16 MFU | 125776 tok/s step 17727/19560 | loss 3.272869 (-0.21z)| norm 0.2177 (+0.09z)| lr 1.91e-05 | 4152.03 ms | 32.5% bf16 MFU | 125801 tok/s step 17728/19560 | loss 3.272999 (-0.20z)| norm 0.2228 (+0.80z)| lr 1.91e-05 | 4156.83 ms | 32.5% bf16 MFU | 125817 tok/s step 17729/19560 | loss 3.262804 (-0.46z)| norm 0.2087 (-1.17z)| lr 1.91e-05 | 4156.30 ms | 32.5% bf16 MFU | 125834 tok/s step 17730/19560 | loss 3.280286 (-0.02z)| norm 0.2149 (-0.32z)| lr 1.91e-05 | 4161.27 ms | 32.4% bf16 MFU | 125841 tok/s step 17731/19560 | loss 3.351347 (+1.74z)| norm 0.2390 (+2.95z)| lr 1.90e-05 | 4157.88 ms | 32.5% bf16 MFU | 125854 tok/s step 17732/19560 | loss 3.240779 (-1.03z)| norm 0.2081 (-1.24z)| lr 1.90e-05 | 4157.25 ms | 32.5% bf16 MFU | 125867 tok/s step 17733/19560 | loss 3.278847 (-0.08z)| norm 0.2073 (-1.33z)| lr 1.90e-05 | 4173.31 ms | 32.4% bf16 MFU | 125855 tok/s step 17734/19560 | loss 3.264563 (-0.43z)| norm 0.2136 (-0.46z)| lr 1.90e-05 | 4158.69 ms | 32.5% bf16 MFU | 125866 tok/s step 17735/19560 | loss 3.279767 (-0.04z)| norm 0.2210 (+0.55z)| lr 1.90e-05 | 4157.69 ms | 32.5% bf16 MFU | 125878 tok/s step 17736/19560 | loss 3.273412 (-0.19z)| norm 0.2333 (+2.25z)| lr 1.89e-05 | 4165.01 ms | 32.4% bf16 MFU | 125878 tok/s step 17737/19560 | loss 3.272742 (-0.20z)| norm 0.2283 (+1.52z)| lr 1.89e-05 | 4151.84 ms | 32.5% bf16 MFU | 125898 tok/s step 17738/19560 | loss 3.256114 (-0.62z)| norm 0.2191 (+0.27z)| lr 1.89e-05 | 4162.34 ms | 32.4% bf16 MFU | 125901 tok/s step 17739/19560 | loss 3.266830 (-0.35z)| norm 0.2153 (-0.26z)| lr 1.89e-05 | 4168.02 ms | 32.4% bf16 MFU | 125895 tok/s step 17740/19560 | loss 3.322388 (+1.06z)| norm 0.2317 (+1.95z)| lr 1.89e-05 | 4170.11 ms | 32.4% bf16 MFU | 125887 tok/s step 17741/19560 | loss 3.203956 (-1.94z)| norm 0.2333 (+2.10z)| lr 1.88e-05 | 4171.27 ms | 32.4% bf16 MFU | 125877 tok/s step 17742/19560 | loss 3.232944 (-1.19z)| norm 0.2152 (-0.28z)| lr 1.88e-05 | 4169.31 ms | 32.4% bf16 MFU | 125871 tok/s step 17743/19560 | loss 3.253744 (-0.66z)| norm 0.2176 (+0.04z)| lr 1.88e-05 | 4161.83 ms | 32.4% bf16 MFU | 125876 tok/s step 17744/19560 | loss 3.287959 (+0.21z)| norm 0.2216 (+0.57z)| lr 1.88e-05 | 4170.26 ms | 32.4% bf16 MFU | 125868 tok/s step 17745/19560 | loss 3.271094 (-0.22z)| norm 0.2074 (-1.31z)| lr 1.87e-05 | 4167.17 ms | 32.4% bf16 MFU | 125865 tok/s step 17746/19560 | loss 3.299202 (+0.48z)| norm 0.2361 (+2.43z)| lr 1.87e-05 | 4168.29 ms | 32.4% bf16 MFU | 125861 tok/s step 17747/19560 | loss 3.292846 (+0.31z)| norm 0.2101 (-0.94z)| lr 1.87e-05 | 4160.70 ms | 32.5% bf16 MFU | 125868 tok/s step 17748/19560 | loss 3.248030 (-0.83z)| norm 0.2185 (+0.15z)| lr 1.87e-05 | 4166.09 ms | 32.4% bf16 MFU | 125867 tok/s step 17749/19560 | loss 3.361671 (+2.01z)| norm 0.2285 (+1.42z)| lr 1.87e-05 | 4157.89 ms | 32.5% bf16 MFU | 125879 tok/s step 17750/19560 | loss 3.257466 (-0.61z)| norm 0.2099 (-0.98z)| lr 1.86e-05 | 4162.82 ms | 32.4% bf16 MFU | 125882 tok/s val loss 3.255516 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3065/10042 = 0.305218 step 17751/19560 | loss 3.218581 (-1.57z)| norm 0.2128 (-0.60z)| lr 1.86e-05 | 4164.88 ms | 32.4% bf16 MFU | 125882 tok/s step 17752/19560 | loss 3.269527 (-0.28z)| norm 0.2147 (-0.35z)| lr 1.86e-05 | 4159.45 ms | 32.5% bf16 MFU | 125890 tok/s step 17753/19560 | loss 3.268971 (-0.29z)| norm 0.2058 (-1.49z)| lr 1.86e-05 | 4162.81 ms | 32.4% bf16 MFU | 125893 tok/s step 17754/19560 | loss 3.232507 (-1.22z)| norm 0.2195 (+0.28z)| lr 1.86e-05 | 4166.37 ms | 32.4% bf16 MFU | 125890 tok/s step 17755/19560 | loss 3.295331 (+0.42z)| norm 0.2146 (-0.34z)| lr 1.85e-05 | 4163.07 ms | 32.4% bf16 MFU | 125893 tok/s step 17756/19560 | loss 3.223673 (-1.44z)| norm 0.2131 (-0.54z)| lr 1.85e-05 | 4174.45 ms | 32.3% bf16 MFU | 125878 tok/s step 17757/19560 | loss 3.226617 (-1.34z)| norm 0.2147 (-0.33z)| lr 1.85e-05 | 4224.60 ms | 32.0% bf16 MFU | 125789 tok/s step 17758/19560 | loss 3.251558 (-0.69z)| norm 0.2180 (+0.10z)| lr 1.85e-05 | 4163.41 ms | 32.4% bf16 MFU | 125796 tok/s step 17759/19560 | loss 3.274191 (-0.10z)| norm 0.2162 (-0.12z)| lr 1.85e-05 | 4166.76 ms | 32.4% bf16 MFU | 125798 tok/s step 17760/19560 | loss 3.278034 (-0.03z)| norm 0.2249 (+1.01z)| lr 1.84e-05 | 4167.62 ms | 32.4% bf16 MFU | 125798 tok/s step 17761/19560 | loss 3.281779 (+0.07z)| norm 0.2123 (-0.64z)| lr 1.84e-05 | 4161.56 ms | 32.4% bf16 MFU | 125807 tok/s step 17762/19560 | loss 3.243762 (-0.93z)| norm 0.2103 (-0.90z)| lr 1.84e-05 | 4171.80 ms | 32.4% bf16 MFU | 125800 tok/s step 17763/19560 | loss 3.277397 (-0.04z)| norm 0.2224 (+0.68z)| lr 1.84e-05 | 4230.82 ms | 31.9% bf16 MFU | 125706 tok/s step 17764/19560 | loss 3.267752 (-0.30z)| norm 0.2079 (-1.22z)| lr 1.84e-05 | 4155.59 ms | 32.5% bf16 MFU | 125729 tok/s step 17765/19560 | loss 3.230974 (-1.26z)| norm 0.2170 (-0.03z)| lr 1.83e-05 | 4162.53 ms | 32.4% bf16 MFU | 125741 tok/s step 17766/19560 | loss 3.215820 (-1.63z)| norm 0.2065 (-1.39z)| lr 1.83e-05 | 4171.43 ms | 32.4% bf16 MFU | 125738 tok/s step 17767/19560 | loss 3.271001 (-0.19z)| norm 0.2142 (-0.38z)| lr 1.83e-05 | 4161.89 ms | 32.4% bf16 MFU | 125750 tok/s step 17768/19560 | loss 3.240643 (-0.97z)| norm 0.2036 (-1.73z)| lr 1.83e-05 | 4176.09 ms | 32.3% bf16 MFU | 125739 tok/s step 17769/19560 | loss 3.221526 (-1.46z)| norm 0.2105 (-0.84z)| lr 1.83e-05 | 4169.05 ms | 32.4% bf16 MFU | 125740 tok/s step 17770/19560 | loss 3.171003 (-2.69z)| norm 0.2059 (-1.40z)| lr 1.82e-05 | 4160.38 ms | 32.5% bf16 MFU | 125754 tok/s step 17771/19560 | loss 3.357773 (+2.00z)| norm 0.2123 (-0.60z)| lr 1.82e-05 | 4194.45 ms | 32.2% bf16 MFU | 125716 tok/s step 17772/19560 | loss 3.279991 (+0.04z)| norm 0.2121 (-0.63z)| lr 1.82e-05 | 4170.72 ms | 32.4% bf16 MFU | 125716 tok/s step 17773/19560 | loss 3.260742 (-0.44z)| norm 0.2076 (-1.19z)| lr 1.82e-05 | 4169.27 ms | 32.4% bf16 MFU | 125718 tok/s step 17774/19560 | loss 3.269152 (-0.23z)| norm 0.2262 (+1.17z)| lr 1.82e-05 | 4168.26 ms | 32.4% bf16 MFU | 125721 tok/s step 17775/19560 | loss 3.200796 (-1.93z)| norm 0.2211 (+0.51z)| lr 1.81e-05 | 4158.64 ms | 32.5% bf16 MFU | 125738 tok/s step 17776/19560 | loss 3.260486 (-0.43z)| norm 0.2190 (+0.24z)| lr 1.81e-05 | 4152.42 ms | 32.5% bf16 MFU | 125764 tok/s step 17777/19560 | loss 3.195750 (-2.01z)| norm 0.2276 (+1.32z)| lr 1.81e-05 | 4163.32 ms | 32.4% bf16 MFU | 125773 tok/s step 17778/19560 | loss 3.258181 (-0.47z)| norm 0.2184 (+0.14z)| lr 1.81e-05 | 4161.79 ms | 32.4% bf16 MFU | 125783 tok/s step 17779/19560 | loss 3.280797 (+0.10z)| norm 0.2138 (-0.44z)| lr 1.81e-05 | 4157.37 ms | 32.5% bf16 MFU | 125799 tok/s step 17780/19560 | loss 3.360356 (+2.02z)| norm 0.2312 (+1.74z)| lr 1.80e-05 | 4162.22 ms | 32.4% bf16 MFU | 125807 tok/s step 17781/19560 | loss 3.269531 (-0.20z)| norm 0.2290 (+1.44z)| lr 1.80e-05 | 4173.19 ms | 32.4% bf16 MFU | 125799 tok/s step 17782/19560 | loss 3.290211 (+0.31z)| norm 0.2271 (+1.18z)| lr 1.80e-05 | 4165.24 ms | 32.4% bf16 MFU | 125802 tok/s step 17783/19560 | loss 3.274577 (-0.07z)| norm 0.2142 (-0.42z)| lr 1.80e-05 | 4167.82 ms | 32.4% bf16 MFU | 125802 tok/s step 17784/19560 | loss 3.279528 (+0.04z)| norm 0.2269 (+1.14z)| lr 1.80e-05 | 4160.98 ms | 32.4% bf16 MFU | 125812 tok/s step 17785/19560 | loss 3.318400 (+0.99z)| norm 0.2155 (-0.29z)| lr 1.79e-05 | 4163.42 ms | 32.4% bf16 MFU | 125818 tok/s step 17786/19560 | loss 3.231107 (-1.14z)| norm 0.2129 (-0.61z)| lr 1.79e-05 | 4165.24 ms | 32.4% bf16 MFU | 125820 tok/s step 17787/19560 | loss 3.296094 (+0.45z)| norm 0.2132 (-0.55z)| lr 1.79e-05 | 4175.13 ms | 32.3% bf16 MFU | 125808 tok/s step 17788/19560 | loss 3.276051 (-0.03z)| norm 0.2127 (-0.61z)| lr 1.79e-05 | 4159.57 ms | 32.5% bf16 MFU | 125820 tok/s step 17789/19560 | loss 3.285455 (+0.22z)| norm 0.2193 (+0.23z)| lr 1.79e-05 | 4165.31 ms | 32.4% bf16 MFU | 125822 tok/s step 17790/19560 | loss 3.274601 (-0.06z)| norm 0.2252 (+0.99z)| lr 1.78e-05 | 4168.67 ms | 32.4% bf16 MFU | 125820 tok/s step 17791/19560 | loss 3.382493 (+2.55z)| norm 0.2184 (+0.12z)| lr 1.78e-05 | 4162.02 ms | 32.4% bf16 MFU | 125827 tok/s step 17792/19560 | loss 3.259604 (-0.43z)| norm 0.2130 (-0.57z)| lr 1.78e-05 | 4170.28 ms | 32.4% bf16 MFU | 125822 tok/s step 17793/19560 | loss 3.308855 (+0.75z)| norm 0.2153 (-0.29z)| lr 1.78e-05 | 4166.71 ms | 32.4% bf16 MFU | 125822 tok/s step 17794/19560 | loss 3.316185 (+0.94z)| norm 0.2228 (+0.68z)| lr 1.78e-05 | 4172.81 ms | 32.4% bf16 MFU | 125813 tok/s step 17795/19560 | loss 3.262964 (-0.36z)| norm 0.2109 (-0.85z)| lr 1.77e-05 | 4168.84 ms | 32.4% bf16 MFU | 125811 tok/s step 17796/19560 | loss 3.293824 (+0.40z)| norm 0.2179 (+0.07z)| lr 1.77e-05 | 4158.55 ms | 32.5% bf16 MFU | 125824 tok/s step 17797/19560 | loss 3.272795 (-0.10z)| norm 0.2158 (-0.21z)| lr 1.77e-05 | 4162.46 ms | 32.4% bf16 MFU | 125831 tok/s step 17798/19560 | loss 3.301683 (+0.62z)| norm 0.2193 (+0.25z)| lr 1.77e-05 | 4164.17 ms | 32.4% bf16 MFU | 125834 tok/s step 17799/19560 | loss 3.231161 (-1.12z)| norm 0.2229 (+0.74z)| lr 1.77e-05 | 4163.82 ms | 32.4% bf16 MFU | 125838 tok/s step 17800/19560 | loss 3.267042 (-0.23z)| norm 0.2098 (-1.03z)| lr 1.76e-05 | 4158.12 ms | 32.5% bf16 MFU | 125851 tok/s step 17801/19560 | loss 3.262477 (-0.33z)| norm 0.2138 (-0.50z)| lr 1.76e-05 | 4162.82 ms | 32.4% bf16 MFU | 125856 tok/s step 17802/19560 | loss 3.270442 (-0.13z)| norm 0.2129 (-0.61z)| lr 1.76e-05 | 4154.43 ms | 32.5% bf16 MFU | 125873 tok/s step 17803/19560 | loss 3.272701 (-0.08z)| norm 0.2097 (-1.03z)| lr 1.76e-05 | 4164.56 ms | 32.4% bf16 MFU | 125874 tok/s step 17804/19560 | loss 3.231769 (-1.12z)| norm 0.2130 (-0.58z)| lr 1.76e-05 | 4157.32 ms | 32.5% bf16 MFU | 125886 tok/s step 17805/19560 | loss 3.315662 (+1.01z)| norm 0.2095 (-1.04z)| lr 1.75e-05 | 4162.50 ms | 32.4% bf16 MFU | 125889 tok/s step 17806/19560 | loss 3.270739 (-0.13z)| norm 0.2174 (+0.00z)| lr 1.75e-05 | 4158.46 ms | 32.5% bf16 MFU | 125899 tok/s step 17807/19560 | loss 3.322708 (+1.18z)| norm 0.2346 (+2.25z)| lr 1.75e-05 | 4156.10 ms | 32.5% bf16 MFU | 125911 tok/s step 17808/19560 | loss 3.306834 (+0.78z)| norm 0.2148 (-0.35z)| lr 1.75e-05 | 4159.70 ms | 32.5% bf16 MFU | 125918 tok/s step 17809/19560 | loss 3.284411 (+0.20z)| norm 0.2176 (+0.01z)| lr 1.75e-05 | 4163.92 ms | 32.4% bf16 MFU | 125917 tok/s step 17810/19560 | loss 3.293125 (+0.41z)| norm 0.2169 (-0.09z)| lr 1.74e-05 | 4170.31 ms | 32.4% bf16 MFU | 125907 tok/s step 17811/19560 | loss 3.229561 (-1.21z)| norm 0.2123 (-0.68z)| lr 1.74e-05 | 4160.13 ms | 32.5% bf16 MFU | 125913 tok/s step 17812/19560 | loss 3.291513 (+0.40z)| norm 0.2116 (-0.76z)| lr 1.74e-05 | 4164.90 ms | 32.4% bf16 MFU | 125912 tok/s step 17813/19560 | loss 3.325917 (+1.28z)| norm 0.2244 (+0.92z)| lr 1.74e-05 | 4193.56 ms | 32.2% bf16 MFU | 125867 tok/s step 17814/19560 | loss 3.316285 (+1.04z)| norm 0.2249 (+1.00z)| lr 1.74e-05 | 4129.65 ms | 32.7% bf16 MFU | 125922 tok/s step 17815/19560 | loss 3.269029 (-0.20z)| norm 0.2049 (-1.62z)| lr 1.73e-05 | 4132.55 ms | 32.7% bf16 MFU | 125969 tok/s step 17816/19560 | loss 3.341686 (+1.67z)| norm 0.2186 (+0.17z)| lr 1.73e-05 | 4142.58 ms | 32.6% bf16 MFU | 125999 tok/s step 17817/19560 | loss 3.211810 (-1.66z)| norm 0.2131 (-0.55z)| lr 1.73e-05 | 4155.16 ms | 32.5% bf16 MFU | 126008 tok/s step 17818/19560 | loss 3.300692 (+0.60z)| norm 0.2175 (+0.00z)| lr 1.73e-05 | 4151.28 ms | 32.5% bf16 MFU | 126022 tok/s step 17819/19560 | loss 3.244895 (-0.82z)| norm 0.2122 (-0.69z)| lr 1.73e-05 | 4153.50 ms | 32.5% bf16 MFU | 126032 tok/s step 17820/19560 | loss 3.280299 (+0.11z)| norm 0.2143 (-0.39z)| lr 1.72e-05 | 4163.71 ms | 32.4% bf16 MFU | 126027 tok/s step 17821/19560 | loss 3.307955 (+0.83z)| norm 0.2074 (-1.30z)| lr 1.72e-05 | 4146.41 ms | 32.6% bf16 MFU | 126047 tok/s step 17822/19560 | loss 3.238800 (-0.97z)| norm 0.2140 (-0.41z)| lr 1.72e-05 | 4149.57 ms | 32.5% bf16 MFU | 126062 tok/s step 17823/19560 | loss 3.290312 (+0.37z)| norm 0.2217 (+0.63z)| lr 1.72e-05 | 4148.85 ms | 32.5% bf16 MFU | 126078 tok/s step 17824/19560 | loss 3.238798 (-0.97z)| norm 0.2102 (-0.91z)| lr 1.72e-05 | 4148.38 ms | 32.5% bf16 MFU | 126093 tok/s step 17825/19560 | loss 3.303483 (+0.74z)| norm 0.2125 (-0.61z)| lr 1.71e-05 | 4159.64 ms | 32.5% bf16 MFU | 126091 tok/s step 17826/19560 | loss 3.366696 (+2.33z)| norm 0.2170 (+0.01z)| lr 1.71e-05 | 4163.50 ms | 32.4% bf16 MFU | 126082 tok/s step 17827/19560 | loss 3.276979 (+0.02z)| norm 0.2050 (-1.60z)| lr 1.71e-05 | 4155.61 ms | 32.5% bf16 MFU | 126086 tok/s step 17828/19560 | loss 3.273922 (-0.06z)| norm 0.2154 (-0.20z)| lr 1.71e-05 | 4155.88 ms | 32.5% bf16 MFU | 126090 tok/s step 17829/19560 | loss 3.250120 (-0.67z)| norm 0.2106 (-0.83z)| lr 1.71e-05 | 4163.94 ms | 32.4% bf16 MFU | 126081 tok/s step 17830/19560 | loss 3.348038 (+1.84z)| norm 0.2184 (+0.25z)| lr 1.70e-05 | 4151.45 ms | 32.5% bf16 MFU | 126091 tok/s step 17831/19560 | loss 3.317127 (+1.04z)| norm 0.2175 (+0.13z)| lr 1.70e-05 | 4162.24 ms | 32.4% bf16 MFU | 126085 tok/s step 17832/19560 | loss 3.340108 (+1.59z)| norm 0.2212 (+0.64z)| lr 1.70e-05 | 4165.29 ms | 32.4% bf16 MFU | 126074 tok/s step 17833/19560 | loss 3.315002 (+0.95z)| norm 0.2149 (-0.24z)| lr 1.70e-05 | 4161.92 ms | 32.4% bf16 MFU | 126069 tok/s step 17834/19560 | loss 3.299857 (+0.55z)| norm 0.2050 (-1.62z)| lr 1.70e-05 | 4162.56 ms | 32.4% bf16 MFU | 126063 tok/s step 17835/19560 | loss 3.271162 (-0.17z)| norm 0.2197 (+0.43z)| lr 1.69e-05 | 4164.41 ms | 32.4% bf16 MFU | 126055 tok/s step 17836/19560 | loss 3.311550 (+0.84z)| norm 0.2240 (+1.02z)| lr 1.69e-05 | 4162.51 ms | 32.4% bf16 MFU | 126050 tok/s step 17837/19560 | loss 3.238649 (-0.99z)| norm 0.2133 (-0.49z)| lr 1.69e-05 | 4156.38 ms | 32.5% bf16 MFU | 126055 tok/s step 17838/19560 | loss 3.335729 (+1.45z)| norm 0.2249 (+1.14z)| lr 1.69e-05 | 4163.69 ms | 32.4% bf16 MFU | 126048 tok/s step 17839/19560 | loss 3.284570 (+0.17z)| norm 0.2140 (-0.40z)| lr 1.69e-05 | 4170.86 ms | 32.4% bf16 MFU | 126030 tok/s step 17840/19560 | loss 3.318385 (+1.01z)| norm 0.2226 (+0.79z)| lr 1.69e-05 | 5134.49 ms | 26.3% bf16 MFU | 124835 tok/s step 17841/19560 | loss 3.273159 (-0.14z)| norm 0.2158 (-0.16z)| lr 1.68e-05 | 4638.53 ms | 29.1% bf16 MFU | 124244 tok/s step 17842/19560 | loss 3.314955 (+0.90z)| norm 0.2186 (+0.24z)| lr 1.68e-05 | 4485.89 ms | 30.1% bf16 MFU | 123876 tok/s step 17843/19560 | loss 3.296479 (+0.45z)| norm 0.2183 (+0.20z)| lr 1.68e-05 | 4523.02 ms | 29.9% bf16 MFU | 123478 tok/s step 17844/19560 | loss 3.272102 (-0.17z)| norm 0.2051 (-1.62z)| lr 1.68e-05 | 4354.09 ms | 31.0% bf16 MFU | 123324 tok/s step 17845/19560 | loss 3.342388 (+1.60z)| norm 0.2154 (-0.19z)| lr 1.68e-05 | 4347.07 ms | 31.1% bf16 MFU | 123189 tok/s step 17846/19560 | loss 3.347628 (+1.69z)| norm 0.2144 (-0.34z)| lr 1.67e-05 | 4255.58 ms | 31.7% bf16 MFU | 123189 tok/s step 17847/19560 | loss 3.336457 (+1.39z)| norm 0.2327 (+2.14z)| lr 1.67e-05 | 4175.16 ms | 32.3% bf16 MFU | 123308 tok/s step 17848/19560 | loss 3.289154 (+0.22z)| norm 0.2150 (-0.26z)| lr 1.67e-05 | 4159.29 ms | 32.5% bf16 MFU | 123446 tok/s step 17849/19560 | loss 3.281645 (+0.02z)| norm 0.2130 (-0.52z)| lr 1.67e-05 | 4290.11 ms | 31.5% bf16 MFU | 123384 tok/s step 17850/19560 | loss 3.420826 (+3.38z)| norm 0.2245 (+1.05z)| lr 1.67e-05 | 4401.71 ms | 30.7% bf16 MFU | 123170 tok/s step 17851/19560 | loss 3.354040 (+1.75z)| norm 0.2072 (-1.32z)| lr 1.66e-05 | 4258.42 ms | 31.7% bf16 MFU | 123167 tok/s step 17852/19560 | loss 3.290259 (+0.21z)| norm 0.2099 (-0.95z)| lr 1.66e-05 | 4160.21 ms | 32.5% bf16 MFU | 123310 tok/s step 17853/19560 | loss 3.289750 (+0.22z)| norm 0.2464 (+3.83z)| lr 1.66e-05 | 4406.63 ms | 30.6% bf16 MFU | 123094 tok/s step 17854/19560 | loss 3.268223 (-0.31z)| norm 0.2212 (+0.54z)| lr 1.66e-05 | 4223.49 ms | 32.0% bf16 MFU | 123146 tok/s step 17855/19560 | loss 3.307103 (+0.65z)| norm 0.2053 (-1.50z)| lr 1.66e-05 | 4164.12 ms | 32.4% bf16 MFU | 123284 tok/s step 17856/19560 | loss 3.379565 (+2.37z)| norm 0.2176 (+0.10z)| lr 1.65e-05 | 4159.30 ms | 32.5% bf16 MFU | 123422 tok/s step 17857/19560 | loss 3.314786 (+0.79z)| norm 0.2166 (-0.05z)| lr 1.65e-05 | 4175.31 ms | 32.3% bf16 MFU | 123530 tok/s step 17858/19560 | loss 3.278530 (-0.09z)| norm 0.2091 (-1.01z)| lr 1.65e-05 | 4172.23 ms | 32.4% bf16 MFU | 123636 tok/s step 17859/19560 | loss 3.385208 (+2.45z)| norm 0.2251 (+1.10z)| lr 1.65e-05 | 4158.54 ms | 32.5% bf16 MFU | 123758 tok/s step 17860/19560 | loss 3.274907 (-0.19z)| norm 0.2156 (-0.16z)| lr 1.65e-05 | 4156.15 ms | 32.5% bf16 MFU | 123878 tok/s step 17861/19560 | loss 3.325441 (+1.01z)| norm 0.2071 (-1.30z)| lr 1.64e-05 | 4175.18 ms | 32.3% bf16 MFU | 123962 tok/s step 17862/19560 | loss 3.315976 (+0.77z)| norm 0.2237 (+0.91z)| lr 1.64e-05 | 4176.95 ms | 32.3% bf16 MFU | 124040 tok/s step 17863/19560 | loss 3.403227 (+2.74z)| norm 0.2281 (+1.47z)| lr 1.64e-05 | 4173.93 ms | 32.3% bf16 MFU | 124119 tok/s step 17864/19560 | loss 3.298018 (+0.31z)| norm 0.2155 (-0.18z)| lr 1.64e-05 | 4168.01 ms | 32.4% bf16 MFU | 124202 tok/s step 17865/19560 | loss 3.272901 (-0.27z)| norm 0.2172 (+0.06z)| lr 1.64e-05 | 4295.73 ms | 31.4% bf16 MFU | 124094 tok/s step 17866/19560 | loss 3.336924 (+1.19z)| norm 0.2184 (+0.23z)| lr 1.64e-05 | 4173.10 ms | 32.4% bf16 MFU | 124171 tok/s step 17867/19560 | loss 3.253942 (-0.72z)| norm 0.2164 (-0.05z)| lr 1.63e-05 | 4178.06 ms | 32.3% bf16 MFU | 124237 tok/s step 17868/19560 | loss 3.308679 (+0.54z)| norm 0.2128 (-0.53z)| lr 1.63e-05 | 4169.05 ms | 32.4% bf16 MFU | 124313 tok/s step 17869/19560 | loss 3.288234 (+0.06z)| norm 0.2100 (-0.91z)| lr 1.63e-05 | 4249.84 ms | 31.8% bf16 MFU | 124266 tok/s step 17870/19560 | loss 3.283020 (-0.07z)| norm 0.2021 (-1.97z)| lr 1.63e-05 | 4203.05 ms | 32.1% bf16 MFU | 124290 tok/s step 17871/19560 | loss 3.274644 (-0.27z)| norm 0.2088 (-1.02z)| lr 1.63e-05 | 4181.81 ms | 32.3% bf16 MFU | 124344 tok/s step 17872/19560 | loss 3.229239 (-1.32z)| norm 0.2113 (-0.67z)| lr 1.62e-05 | 4183.25 ms | 32.3% bf16 MFU | 124393 tok/s step 17873/19560 | loss 3.316405 (+0.71z)| norm 0.2134 (-0.39z)| lr 1.62e-05 | 5266.26 ms | 25.6% bf16 MFU | 123151 tok/s step 17874/19560 | loss 3.280055 (-0.14z)| norm 0.2100 (-0.86z)| lr 1.62e-05 | 4170.44 ms | 32.4% bf16 MFU | 123279 tok/s step 17875/19560 | loss 3.283490 (-0.06z)| norm 0.2145 (-0.22z)| lr 1.62e-05 | 4166.04 ms | 32.4% bf16 MFU | 123408 tok/s step 17876/19560 | loss 3.306960 (+0.48z)| norm 0.2316 (+2.18z)| lr 1.62e-05 | 4219.03 ms | 32.0% bf16 MFU | 123451 tok/s step 17877/19560 | loss 3.319059 (+0.78z)| norm 0.2041 (-1.67z)| lr 1.61e-05 | 4186.17 ms | 32.3% bf16 MFU | 123540 tok/s step 17878/19560 | loss 3.303621 (+0.41z)| norm 0.2209 (+0.69z)| lr 1.61e-05 | 4167.93 ms | 32.4% bf16 MFU | 123653 tok/s step 17879/19560 | loss 3.291290 (+0.10z)| norm 0.2119 (-0.58z)| lr 1.61e-05 | 4180.98 ms | 32.3% bf16 MFU | 123740 tok/s step 17880/19560 | loss 3.268977 (-0.43z)| norm 0.2138 (-0.31z)| lr 1.61e-05 | 4220.06 ms | 32.0% bf16 MFU | 123765 tok/s step 17881/19560 | loss 3.323879 (+0.87z)| norm 0.2226 (+0.91z)| lr 1.61e-05 | 4171.09 ms | 32.4% bf16 MFU | 123862 tok/s step 17882/19560 | loss 3.274581 (-0.32z)| norm 0.2285 (+1.73z)| lr 1.60e-05 | 4172.85 ms | 32.4% bf16 MFU | 123951 tok/s step 17883/19560 | loss 3.345361 (+1.36z)| norm 0.2203 (+0.57z)| lr 1.60e-05 | 4180.26 ms | 32.3% bf16 MFU | 124024 tok/s step 17884/19560 | loss 3.279289 (-0.22z)| norm 0.2147 (-0.23z)| lr 1.60e-05 | 4222.40 ms | 32.0% bf16 MFU | 124031 tok/s step 17885/19560 | loss 3.312144 (+0.55z)| norm 0.2140 (-0.32z)| lr 1.60e-05 | 4164.80 ms | 32.4% bf16 MFU | 124124 tok/s step 17886/19560 | loss 3.315894 (+0.63z)| norm 0.2162 (-0.01z)| lr 1.60e-05 | 4163.69 ms | 32.4% bf16 MFU | 124214 tok/s step 17887/19560 | loss 3.346817 (+1.36z)| norm 0.2027 (-1.86z)| lr 1.60e-05 | 4168.81 ms | 32.4% bf16 MFU | 124291 tok/s step 17888/19560 | loss 3.274746 (-0.37z)| norm 0.2239 (+1.07z)| lr 1.59e-05 | 4178.83 ms | 32.3% bf16 MFU | 124350 tok/s step 17889/19560 | loss 3.301769 (+0.27z)| norm 0.2182 (+0.27z)| lr 1.59e-05 | 4172.35 ms | 32.4% bf16 MFU | 124415 tok/s step 17890/19560 | loss 3.253820 (-0.89z)| norm 0.2077 (-1.17z)| lr 1.59e-05 | 4173.46 ms | 32.4% bf16 MFU | 124476 tok/s step 17891/19560 | loss 3.303781 (+0.31z)| norm 0.2067 (-1.29z)| lr 1.59e-05 | 4170.34 ms | 32.4% bf16 MFU | 124538 tok/s step 17892/19560 | loss 3.295318 (+0.11z)| norm 0.2118 (-0.59z)| lr 1.59e-05 | 4196.84 ms | 32.2% bf16 MFU | 124557 tok/s step 17893/19560 | loss 3.298476 (+0.17z)| norm 0.2125 (-0.49z)| lr 1.58e-05 | 4164.20 ms | 32.4% bf16 MFU | 124625 tok/s step 17894/19560 | loss 3.275207 (-0.41z)| norm 0.2106 (-0.76z)| lr 1.58e-05 | 4179.18 ms | 32.3% bf16 MFU | 124666 tok/s step 17895/19560 | loss 3.316823 (+0.61z)| norm 0.2122 (-0.54z)| lr 1.58e-05 | 4304.09 ms | 31.4% bf16 MFU | 124523 tok/s step 17896/19560 | loss 3.365285 (+1.77z)| norm 0.2225 (+0.88z)| lr 1.58e-05 | 4172.76 ms | 32.4% bf16 MFU | 124579 tok/s step 17897/19560 | loss 3.293339 (-0.01z)| norm 0.2123 (-0.54z)| lr 1.58e-05 | 4167.12 ms | 32.4% bf16 MFU | 124641 tok/s step 17898/19560 | loss 3.242851 (-1.32z)| norm 0.2121 (-0.58z)| lr 1.57e-05 | 4247.56 ms | 31.8% bf16 MFU | 124581 tok/s step 17899/19560 | loss 3.311279 (+0.45z)| norm 0.2170 (+0.10z)| lr 1.57e-05 | 4300.12 ms | 31.4% bf16 MFU | 124448 tok/s step 17900/19560 | loss 3.253888 (-1.03z)| norm 0.2121 (-0.59z)| lr 1.57e-05 | 4169.05 ms | 32.4% bf16 MFU | 124513 tok/s step 17901/19560 | loss 3.305507 (+0.29z)| norm 0.2131 (-0.46z)| lr 1.57e-05 | 4173.14 ms | 32.4% bf16 MFU | 124569 tok/s step 17902/19560 | loss 3.338712 (+1.13z)| norm 0.2404 (+3.26z)| lr 1.57e-05 | 4196.55 ms | 32.2% bf16 MFU | 124588 tok/s step 17903/19560 | loss 3.298091 (+0.07z)| norm 0.2051 (-1.52z)| lr 1.56e-05 | 4181.09 ms | 32.3% bf16 MFU | 124628 tok/s step 17904/19560 | loss 3.254207 (-1.09z)| norm 0.2055 (-1.45z)| lr 1.56e-05 | 4171.30 ms | 32.4% bf16 MFU | 124681 tok/s step 17905/19560 | loss 3.302291 (+0.16z)| norm 0.2067 (-1.26z)| lr 1.56e-05 | 4163.94 ms | 32.4% bf16 MFU | 124743 tok/s step 17906/19560 | loss 3.276877 (-0.53z)| norm 0.2205 (+0.59z)| lr 1.56e-05 | 4173.15 ms | 32.4% bf16 MFU | 124787 tok/s step 17907/19560 | loss 3.311480 (+0.40z)| norm 0.2025 (-1.80z)| lr 1.56e-05 | 4172.75 ms | 32.4% bf16 MFU | 124830 tok/s step 17908/19560 | loss 3.271194 (-0.68z)| norm 0.1989 (-2.23z)| lr 1.56e-05 | 4313.61 ms | 31.3% bf16 MFU | 124666 tok/s step 17909/19560 | loss 3.278176 (-0.49z)| norm 0.2151 (-0.07z)| lr 1.55e-05 | 4170.27 ms | 32.4% bf16 MFU | 124718 tok/s step 17910/19560 | loss 3.305562 (+0.26z)| norm 0.2080 (-1.01z)| lr 1.55e-05 | 4217.21 ms | 32.0% bf16 MFU | 124698 tok/s step 17911/19560 | loss 3.303900 (+0.21z)| norm 0.2285 (+1.72z)| lr 1.55e-05 | 4206.17 ms | 32.1% bf16 MFU | 124696 tok/s step 17912/19560 | loss 3.367880 (+1.92z)| norm 0.2235 (+1.06z)| lr 1.55e-05 | 4175.80 ms | 32.3% bf16 MFU | 124739 tok/s step 17913/19560 | loss 3.275447 (-0.58z)| norm 0.2117 (-0.52z)| lr 1.55e-05 | 4172.63 ms | 32.4% bf16 MFU | 124784 tok/s step 17914/19560 | loss 3.287872 (-0.26z)| norm 0.2085 (-0.94z)| lr 1.54e-05 | 4176.20 ms | 32.3% bf16 MFU | 124822 tok/s step 17915/19560 | loss 3.308668 (+0.31z)| norm 0.2243 (+1.16z)| lr 1.54e-05 | 4228.58 ms | 31.9% bf16 MFU | 124780 tok/s step 17916/19560 | loss 3.318711 (+0.58z)| norm 0.2188 (+0.42z)| lr 1.54e-05 | 4162.80 ms | 32.4% bf16 MFU | 124839 tok/s step 17917/19560 | loss 3.304643 (+0.19z)| norm 0.2119 (-0.49z)| lr 1.54e-05 | 4161.70 ms | 32.4% bf16 MFU | 124896 tok/s step 17918/19560 | loss 3.256938 (-1.11z)| norm 0.2030 (-1.64z)| lr 1.54e-05 | 4168.23 ms | 32.4% bf16 MFU | 124940 tok/s step 17919/19560 | loss 3.278753 (-0.50z)| norm 0.2060 (-1.23z)| lr 1.54e-05 | 4168.43 ms | 32.4% bf16 MFU | 124982 tok/s step 17920/19560 | loss 3.289094 (-0.22z)| norm 0.2132 (-0.28z)| lr 1.53e-05 | 4262.92 ms | 31.7% bf16 MFU | 124882 tok/s step 17921/19560 | loss 3.277504 (-0.54z)| norm 0.2014 (-1.80z)| lr 1.53e-05 | 4169.95 ms | 32.4% bf16 MFU | 124925 tok/s step 17922/19560 | loss 3.278814 (-0.50z)| norm 0.2130 (-0.28z)| lr 1.53e-05 | 4177.99 ms | 32.3% bf16 MFU | 124953 tok/s step 17923/19560 | loss 3.280985 (-0.44z)| norm 0.2083 (-0.89z)| lr 1.53e-05 | 4175.10 ms | 32.3% bf16 MFU | 124984 tok/s step 17924/19560 | loss 3.290939 (-0.16z)| norm 0.2033 (-1.50z)| lr 1.53e-05 | 4177.02 ms | 32.3% bf16 MFU | 125011 tok/s step 17925/19560 | loss 3.326833 (+0.84z)| norm 0.2090 (-0.76z)| lr 1.52e-05 | 4167.92 ms | 32.4% bf16 MFU | 125050 tok/s step 17926/19560 | loss 3.290264 (-0.19z)| norm 0.2197 (+0.61z)| lr 1.52e-05 | 4176.58 ms | 32.3% bf16 MFU | 125074 tok/s step 17927/19560 | loss 3.333212 (+1.01z)| norm 0.2039 (-1.39z)| lr 1.52e-05 | 4188.70 ms | 32.2% bf16 MFU | 125078 tok/s step 17928/19560 | loss 3.269937 (-0.79z)| norm 0.2033 (-1.46z)| lr 1.52e-05 | 4171.22 ms | 32.4% bf16 MFU | 125109 tok/s step 17929/19560 | loss 3.339914 (+1.18z)| norm 0.2148 (+0.01z)| lr 1.52e-05 | 4181.14 ms | 32.3% bf16 MFU | 125123 tok/s step 17930/19560 | loss 3.310140 (+0.33z)| norm 0.2120 (-0.35z)| lr 1.51e-05 | 4167.35 ms | 32.4% bf16 MFU | 125157 tok/s step 17931/19560 | loss 3.253953 (-1.26z)| norm 0.2174 (+0.33z)| lr 1.51e-05 | 4222.76 ms | 32.0% bf16 MFU | 125107 tok/s step 17932/19560 | loss 3.301810 (+0.08z)| norm 0.2065 (-1.04z)| lr 1.51e-05 | 4181.34 ms | 32.3% bf16 MFU | 125121 tok/s step 17933/19560 | loss 3.289377 (-0.27z)| norm 0.2145 (-0.04z)| lr 1.51e-05 | 4172.10 ms | 32.4% bf16 MFU | 125149 tok/s step 17934/19560 | loss 3.304377 (+0.15z)| norm 0.2231 (+1.05z)| lr 1.51e-05 | 4167.85 ms | 32.4% bf16 MFU | 125181 tok/s step 17935/19560 | loss 3.264596 (-0.98z)| norm 0.2167 (+0.26z)| lr 1.51e-05 | 4173.16 ms | 32.4% bf16 MFU | 125204 tok/s step 17936/19560 | loss 3.283132 (-0.44z)| norm 0.2225 (+1.00z)| lr 1.50e-05 | 4173.27 ms | 32.4% bf16 MFU | 125225 tok/s step 17937/19560 | loss 3.321763 (+0.66z)| norm 0.2192 (+0.57z)| lr 1.50e-05 | 4167.52 ms | 32.4% bf16 MFU | 125254 tok/s step 17938/19560 | loss 3.311110 (+0.35z)| norm 0.2217 (+0.89z)| lr 1.50e-05 | 4176.59 ms | 32.3% bf16 MFU | 125268 tok/s step 17939/19560 | loss 3.316367 (+0.49z)| norm 0.2144 (-0.05z)| lr 1.50e-05 | 4174.51 ms | 32.3% bf16 MFU | 125284 tok/s step 17940/19560 | loss 3.251818 (-1.37z)| norm 0.2111 (-0.47z)| lr 1.50e-05 | 4176.64 ms | 32.3% bf16 MFU | 125296 tok/s step 17941/19560 | loss 3.244004 (-1.57z)| norm 0.2183 (+0.46z)| lr 1.49e-05 | 4177.67 ms | 32.3% bf16 MFU | 125306 tok/s step 17942/19560 | loss 3.275852 (-0.64z)| norm 0.2153 (+0.08z)| lr 1.49e-05 | 4172.17 ms | 32.4% bf16 MFU | 125324 tok/s step 17943/19560 | loss 3.290088 (-0.24z)| norm 0.2131 (-0.22z)| lr 1.49e-05 | 4176.93 ms | 32.3% bf16 MFU | 125334 tok/s step 17944/19560 | loss 3.285580 (-0.36z)| norm 0.2095 (-0.67z)| lr 1.49e-05 | 4163.04 ms | 32.4% bf16 MFU | 125364 tok/s step 17945/19560 | loss 3.281588 (-0.50z)| norm 0.2200 (+0.68z)| lr 1.49e-05 | 4171.79 ms | 32.4% bf16 MFU | 125380 tok/s step 17946/19560 | loss 3.311689 (+0.38z)| norm 0.2191 (+0.57z)| lr 1.49e-05 | 4173.57 ms | 32.4% bf16 MFU | 125392 tok/s step 17947/19560 | loss 3.233777 (-1.91z)| norm 0.2256 (+1.39z)| lr 1.48e-05 | 4178.27 ms | 32.3% bf16 MFU | 125396 tok/s step 17948/19560 | loss 3.284653 (-0.41z)| norm 0.2094 (-0.71z)| lr 1.48e-05 | 4167.81 ms | 32.4% bf16 MFU | 125416 tok/s step 17949/19560 | loss 3.302096 (+0.10z)| norm 0.2068 (-1.04z)| lr 1.48e-05 | 4159.56 ms | 32.5% bf16 MFU | 125447 tok/s step 17950/19560 | loss 3.271258 (-0.82z)| norm 0.2191 (+0.56z)| lr 1.48e-05 | 4173.73 ms | 32.3% bf16 MFU | 125456 tok/s step 17951/19560 | loss 3.223174 (-2.19z)| norm 0.2039 (-1.39z)| lr 1.48e-05 | 4169.55 ms | 32.4% bf16 MFU | 125470 tok/s step 17952/19560 | loss 3.309402 (+0.31z)| norm 0.2293 (+1.83z)| lr 1.47e-05 | 4175.93 ms | 32.3% bf16 MFU | 125474 tok/s step 17953/19560 | loss 3.341519 (+1.24z)| norm 0.2209 (+0.76z)| lr 1.47e-05 | 4186.54 ms | 32.3% bf16 MFU | 125462 tok/s step 17954/19560 | loss 3.230821 (-1.97z)| norm 0.2238 (+1.11z)| lr 1.47e-05 | 4171.69 ms | 32.4% bf16 MFU | 125473 tok/s step 17955/19560 | loss 3.268254 (-0.87z)| norm 0.2108 (-0.54z)| lr 1.47e-05 | 4176.23 ms | 32.3% bf16 MFU | 125476 tok/s step 17956/19560 | loss 3.339173 (+1.18z)| norm 0.2112 (-0.49z)| lr 1.47e-05 | 4174.78 ms | 32.3% bf16 MFU | 125482 tok/s step 17957/19560 | loss 3.302002 (+0.09z)| norm 0.2156 (+0.07z)| lr 1.47e-05 | 4169.41 ms | 32.4% bf16 MFU | 125495 tok/s step 17958/19560 | loss 3.269474 (-0.85z)| norm 0.2063 (-1.10z)| lr 1.46e-05 | 4169.96 ms | 32.4% bf16 MFU | 125507 tok/s step 17959/19560 | loss 3.204449 (-2.67z)| norm 0.2134 (-0.19z)| lr 1.46e-05 | 4168.91 ms | 32.4% bf16 MFU | 125519 tok/s step 17960/19560 | loss 3.322107 (+0.72z)| norm 0.2138 (-0.13z)| lr 1.46e-05 | 4179.33 ms | 32.3% bf16 MFU | 125516 tok/s step 17961/19560 | loss 3.251630 (-1.30z)| norm 0.2089 (-0.74z)| lr 1.46e-05 | 4175.04 ms | 32.3% bf16 MFU | 125519 tok/s step 17962/19560 | loss 3.272166 (-0.70z)| norm 0.2132 (-0.21z)| lr 1.46e-05 | 4236.96 ms | 31.9% bf16 MFU | 125430 tok/s step 17963/19560 | loss 3.322611 (+0.73z)| norm 0.2147 (-0.02z)| lr 1.45e-05 | 4168.68 ms | 32.4% bf16 MFU | 125447 tok/s step 17964/19560 | loss 3.298122 (+0.03z)| norm 0.2138 (-0.12z)| lr 1.45e-05 | 4194.93 ms | 32.2% bf16 MFU | 125424 tok/s step 17965/19560 | loss 3.308187 (+0.31z)| norm 0.2081 (-0.85z)| lr 1.45e-05 | 4159.67 ms | 32.5% bf16 MFU | 125454 tok/s step 17966/19560 | loss 3.220043 (-2.19z)| norm 0.2494 (+4.14z)| lr 1.45e-05 | 4171.90 ms | 32.4% bf16 MFU | 125465 tok/s step 17967/19560 | loss 3.290439 (-0.18z)| norm 0.2174 (+0.29z)| lr 1.45e-05 | 4174.58 ms | 32.3% bf16 MFU | 125472 tok/s step 17968/19560 | loss 3.268166 (-0.80z)| norm 0.2162 (+0.16z)| lr 1.45e-05 | 4176.71 ms | 32.3% bf16 MFU | 125474 tok/s step 17969/19560 | loss 3.246010 (-1.42z)| norm 0.2173 (+0.29z)| lr 1.44e-05 | 4184.14 ms | 32.3% bf16 MFU | 125466 tok/s step 17970/19560 | loss 3.229467 (-1.85z)| norm 0.2129 (-0.23z)| lr 1.44e-05 | 4172.09 ms | 32.4% bf16 MFU | 125476 tok/s step 17971/19560 | loss 3.256005 (-1.09z)| norm 0.2109 (-0.46z)| lr 1.44e-05 | 4169.92 ms | 32.4% bf16 MFU | 125489 tok/s step 17972/19560 | loss 3.282536 (-0.35z)| norm 0.2094 (-0.65z)| lr 1.44e-05 | 4170.30 ms | 32.4% bf16 MFU | 125500 tok/s step 17973/19560 | loss 3.350602 (+1.54z)| norm 0.2133 (-0.18z)| lr 1.44e-05 | 4173.77 ms | 32.3% bf16 MFU | 125506 tok/s step 17974/19560 | loss 3.260939 (-0.94z)| norm 0.2168 (+0.24z)| lr 1.43e-05 | 4175.68 ms | 32.3% bf16 MFU | 125508 tok/s step 17975/19560 | loss 3.276115 (-0.50z)| norm 0.2104 (-0.52z)| lr 1.43e-05 | 4168.01 ms | 32.4% bf16 MFU | 125522 tok/s step 17976/19560 | loss 3.286687 (-0.21z)| norm 0.2094 (-0.64z)| lr 1.43e-05 | 4166.13 ms | 32.4% bf16 MFU | 125539 tok/s step 17977/19560 | loss 3.277479 (-0.46z)| norm 0.2051 (-1.16z)| lr 1.43e-05 | 4171.37 ms | 32.4% bf16 MFU | 125546 tok/s step 17978/19560 | loss 3.250166 (-1.25z)| norm 0.2003 (-1.70z)| lr 1.43e-05 | 4169.66 ms | 32.4% bf16 MFU | 125556 tok/s step 17979/19560 | loss 3.297353 (+0.15z)| norm 0.2220 (+0.91z)| lr 1.43e-05 | 4169.50 ms | 32.4% bf16 MFU | 125565 tok/s step 17980/19560 | loss 3.255776 (-1.07z)| norm 0.2077 (-0.82z)| lr 1.42e-05 | 4192.44 ms | 32.2% bf16 MFU | 125540 tok/s step 17981/19560 | loss 3.266433 (-0.75z)| norm 0.2138 (-0.05z)| lr 1.42e-05 | 4168.90 ms | 32.4% bf16 MFU | 125551 tok/s step 17982/19560 | loss 3.250577 (-1.21z)| norm 0.2123 (-0.24z)| lr 1.42e-05 | 4176.75 ms | 32.3% bf16 MFU | 125549 tok/s step 17983/19560 | loss 3.282202 (-0.27z)| norm 0.2115 (-0.35z)| lr 1.42e-05 | 4176.09 ms | 32.3% bf16 MFU | 125549 tok/s step 17984/19560 | loss 3.287480 (-0.10z)| norm 0.2126 (-0.20z)| lr 1.42e-05 | 4175.07 ms | 32.3% bf16 MFU | 125550 tok/s step 17985/19560 | loss 3.329364 (+1.17z)| norm 0.2260 (+1.52z)| lr 1.41e-05 | 4189.22 ms | 32.2% bf16 MFU | 125531 tok/s step 17986/19560 | loss 3.212320 (-2.31z)| norm 0.2151 (+0.11z)| lr 1.41e-05 | 4165.17 ms | 32.4% bf16 MFU | 125548 tok/s step 17987/19560 | loss 3.284198 (-0.16z)| norm 0.2102 (-0.51z)| lr 1.41e-05 | 4271.17 ms | 31.6% bf16 MFU | 125408 tok/s step 17988/19560 | loss 3.292888 (+0.10z)| norm 0.2033 (-1.39z)| lr 1.41e-05 | 4322.51 ms | 31.2% bf16 MFU | 125202 tok/s step 17989/19560 | loss 3.311564 (+0.68z)| norm 0.2098 (-0.55z)| lr 1.41e-05 | 4190.83 ms | 32.2% bf16 MFU | 125197 tok/s step 17990/19560 | loss 3.248980 (-1.22z)| norm 0.2062 (-1.00z)| lr 1.41e-05 | 4271.91 ms | 31.6% bf16 MFU | 125074 tok/s step 17991/19560 | loss 3.234551 (-1.69z)| norm 0.2164 (+0.34z)| lr 1.40e-05 | 4182.59 ms | 32.3% bf16 MFU | 125088 tok/s step 17992/19560 | loss 3.335835 (+1.51z)| norm 0.2127 (-0.14z)| lr 1.40e-05 | 4174.82 ms | 32.3% bf16 MFU | 125112 tok/s step 17993/19560 | loss 3.306878 (+0.59z)| norm 0.2167 (+0.38z)| lr 1.40e-05 | 4195.54 ms | 32.2% bf16 MFU | 125105 tok/s step 17994/19560 | loss 3.248373 (-1.24z)| norm 0.2075 (-0.81z)| lr 1.40e-05 | 4169.47 ms | 32.4% bf16 MFU | 125137 tok/s step 17995/19560 | loss 3.285173 (-0.08z)| norm 0.2135 (-0.03z)| lr 1.40e-05 | 4179.54 ms | 32.3% bf16 MFU | 125152 tok/s step 17996/19560 | loss 3.302635 (+0.47z)| norm 0.2091 (-0.60z)| lr 1.40e-05 | 4187.47 ms | 32.2% bf16 MFU | 125155 tok/s step 17997/19560 | loss 3.307741 (+0.63z)| norm 0.2067 (-0.91z)| lr 1.39e-05 | 4178.20 ms | 32.3% bf16 MFU | 125171 tok/s step 17998/19560 | loss 3.311290 (+0.73z)| norm 0.2054 (-1.09z)| lr 1.39e-05 | 4178.74 ms | 32.3% bf16 MFU | 125186 tok/s step 17999/19560 | loss 3.258940 (-0.92z)| norm 0.2134 (-0.05z)| lr 1.39e-05 | 4169.64 ms | 32.4% bf16 MFU | 125213 tok/s step 18000/19560 | loss 3.287024 (-0.05z)| norm 0.2088 (-0.64z)| lr 1.39e-05 | 4176.96 ms | 32.3% bf16 MFU | 125229 tok/s val loss 3.254262 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3078/10042 = 0.306513 step 18001/19560 | loss 3.288778 (+0.02z)| norm 0.2159 (+0.29z)| lr 1.39e-05 | 4223.02 ms | 32.0% bf16 MFU | 125175 tok/s step 18002/19560 | loss 3.344288 (+1.76z)| norm 0.2148 (+0.14z)| lr 1.38e-05 | 4172.81 ms | 32.4% bf16 MFU | 125198 tok/s step 18003/19560 | loss 3.380899 (+2.81z)| norm 0.2304 (+2.13z)| lr 1.38e-05 | 4169.21 ms | 32.4% bf16 MFU | 125226 tok/s step 18004/19560 | loss 3.356146 (+2.01z)| norm 0.2278 (+1.81z)| lr 1.38e-05 | 4170.49 ms | 32.4% bf16 MFU | 125250 tok/s step 18005/19560 | loss 3.254139 (-1.07z)| norm 0.2126 (-0.17z)| lr 1.38e-05 | 4182.76 ms | 32.3% bf16 MFU | 125255 tok/s step 18006/19560 | loss 3.293661 (+0.13z)| norm 0.2101 (-0.48z)| lr 1.38e-05 | 4225.51 ms | 32.0% bf16 MFU | 125196 tok/s step 18007/19560 | loss 3.268155 (-0.63z)| norm 0.2133 (-0.07z)| lr 1.38e-05 | 4177.12 ms | 32.3% bf16 MFU | 125212 tok/s step 18008/19560 | loss 3.285871 (-0.10z)| norm 0.2159 (+0.26z)| lr 1.37e-05 | 4184.28 ms | 32.3% bf16 MFU | 125216 tok/s step 18009/19560 | loss 3.360825 (+2.13z)| norm 0.1976 (-2.09z)| lr 1.37e-05 | 4337.29 ms | 31.1% bf16 MFU | 125000 tok/s step 18010/19560 | loss 3.231478 (-1.70z)| norm 0.2049 (-1.12z)| lr 1.37e-05 | 4167.22 ms | 32.4% bf16 MFU | 125040 tok/s step 18011/19560 | loss 3.255077 (-0.99z)| norm 0.2299 (+2.10z)| lr 1.37e-05 | 4250.33 ms | 31.8% bf16 MFU | 124956 tok/s step 18012/19560 | loss 3.296133 (+0.22z)| norm 0.2169 (+0.43z)| lr 1.37e-05 | 4171.08 ms | 32.4% bf16 MFU | 124993 tok/s step 18013/19560 | loss 3.307699 (+0.57z)| norm 0.2187 (+0.66z)| lr 1.37e-05 | 4169.34 ms | 32.4% bf16 MFU | 125031 tok/s step 18014/19560 | loss 3.267659 (-0.61z)| norm 0.2047 (-1.12z)| lr 1.36e-05 | 4209.46 ms | 32.1% bf16 MFU | 125007 tok/s step 18015/19560 | loss 3.303582 (+0.47z)| norm 0.2058 (-0.99z)| lr 1.36e-05 | 4175.87 ms | 32.3% bf16 MFU | 125034 tok/s step 18016/19560 | loss 3.258213 (-0.89z)| norm 0.2142 (+0.09z)| lr 1.36e-05 | 4166.70 ms | 32.4% bf16 MFU | 125074 tok/s step 18017/19560 | loss 3.223655 (-1.88z)| norm 0.2409 (+3.37z)| lr 1.36e-05 | 4165.26 ms | 32.4% bf16 MFU | 125113 tok/s step 18018/19560 | loss 3.318285 (+0.91z)| norm 0.2147 (+0.13z)| lr 1.36e-05 | 4165.16 ms | 32.4% bf16 MFU | 125152 tok/s step 18019/19560 | loss 3.355175 (+1.96z)| norm 0.2079 (-0.71z)| lr 1.35e-05 | 4158.30 ms | 32.5% bf16 MFU | 125198 tok/s step 18020/19560 | loss 3.276804 (-0.32z)| norm 0.2330 (+2.33z)| lr 1.35e-05 | 4164.85 ms | 32.4% bf16 MFU | 125232 tok/s step 18021/19560 | loss 3.272662 (-0.44z)| norm 0.2019 (-1.43z)| lr 1.35e-05 | 4155.38 ms | 32.5% bf16 MFU | 125279 tok/s step 18022/19560 | loss 3.312814 (+0.72z)| norm 0.2167 (+0.34z)| lr 1.35e-05 | 4172.53 ms | 32.4% bf16 MFU | 125298 tok/s step 18023/19560 | loss 3.288619 (+0.02z)| norm 0.2248 (+1.30z)| lr 1.35e-05 | 4295.71 ms | 31.4% bf16 MFU | 125135 tok/s step 18024/19560 | loss 3.272764 (-0.43z)| norm 0.2105 (-0.40z)| lr 1.35e-05 | 4163.03 ms | 32.4% bf16 MFU | 125176 tok/s step 18025/19560 | loss 3.360093 (+2.13z)| norm 0.2013 (-1.49z)| lr 1.34e-05 | 4169.10 ms | 32.4% bf16 MFU | 125205 tok/s step 18026/19560 | loss 3.270471 (-0.51z)| norm 0.2069 (-0.80z)| lr 1.34e-05 | 4160.80 ms | 32.4% bf16 MFU | 125245 tok/s step 18027/19560 | loss 3.264822 (-0.67z)| norm 0.2150 (+0.15z)| lr 1.34e-05 | 4173.93 ms | 32.3% bf16 MFU | 125263 tok/s step 18028/19560 | loss 3.368985 (+2.34z)| norm 0.2242 (+1.23z)| lr 1.34e-05 | 4177.53 ms | 32.3% bf16 MFU | 125275 tok/s step 18029/19560 | loss 3.391037 (+2.86z)| norm 0.2174 (+0.42z)| lr 1.34e-05 | 4165.74 ms | 32.4% bf16 MFU | 125304 tok/s step 18030/19560 | loss 3.270573 (-0.50z)| norm 0.2207 (+0.86z)| lr 1.34e-05 | 4172.43 ms | 32.4% bf16 MFU | 125322 tok/s step 18031/19560 | loss 3.249853 (-1.07z)| norm 0.2019 (-1.43z)| lr 1.33e-05 | 4847.91 ms | 27.9% bf16 MFU | 124463 tok/s step 18032/19560 | loss 3.293711 (+0.15z)| norm 0.2490 (+4.01z)| lr 1.33e-05 | 4450.14 ms | 30.3% bf16 MFU | 124130 tok/s step 18033/19560 | loss 3.279278 (-0.25z)| norm 0.2109 (-0.36z)| lr 1.33e-05 | 4528.51 ms | 29.8% bf16 MFU | 123713 tok/s step 18034/19560 | loss 3.313942 (+0.72z)| norm 0.2173 (+0.38z)| lr 1.33e-05 | 4372.38 ms | 30.9% bf16 MFU | 123522 tok/s step 18035/19560 | loss 3.260417 (-0.78z)| norm 0.2129 (-0.14z)| lr 1.33e-05 | 4717.34 ms | 28.6% bf16 MFU | 122903 tok/s step 18036/19560 | loss 3.306345 (+0.51z)| norm 0.2114 (-0.32z)| lr 1.33e-05 | 4334.57 ms | 31.1% bf16 MFU | 122806 tok/s step 18037/19560 | loss 3.288059 (-0.01z)| norm 0.2135 (-0.07z)| lr 1.32e-05 | 4306.47 ms | 31.4% bf16 MFU | 122753 tok/s step 18038/19560 | loss 3.262176 (-0.73z)| norm 0.2296 (+1.77z)| lr 1.32e-05 | 4384.32 ms | 30.8% bf16 MFU | 122594 tok/s step 18039/19560 | loss 3.268493 (-0.54z)| norm 0.2209 (+0.77z)| lr 1.32e-05 | 4436.61 ms | 30.4% bf16 MFU | 122373 tok/s step 18040/19560 | loss 3.334445 (+1.34z)| norm 0.2066 (-0.88z)| lr 1.32e-05 | 4183.40 ms | 32.3% bf16 MFU | 122521 tok/s step 18041/19560 | loss 3.298635 (+0.31z)| norm 0.2093 (-0.56z)| lr 1.32e-05 | 4523.95 ms | 29.8% bf16 MFU | 122189 tok/s step 18042/19560 | loss 3.291573 (+0.11z)| norm 0.2134 (-0.09z)| lr 1.31e-05 | 4208.23 ms | 32.1% bf16 MFU | 122309 tok/s step 18043/19560 | loss 3.287594 (-0.00z)| norm 0.2213 (+0.84z)| lr 1.31e-05 | 4278.15 ms | 31.6% bf16 MFU | 122321 tok/s step 18044/19560 | loss 3.293391 (+0.17z)| norm 0.2046 (-1.11z)| lr 1.31e-05 | 4377.51 ms | 30.8% bf16 MFU | 122194 tok/s step 18045/19560 | loss 3.236109 (-1.44z)| norm 0.2141 (+0.00z)| lr 1.31e-05 | 4347.54 ms | 31.1% bf16 MFU | 122114 tok/s step 18046/19560 | loss 3.229197 (-1.62z)| norm 0.2104 (-0.43z)| lr 1.31e-05 | 4201.32 ms | 32.1% bf16 MFU | 122248 tok/s step 18047/19560 | loss 3.260136 (-0.74z)| norm 0.2014 (-1.49z)| lr 1.31e-05 | 4273.51 ms | 31.6% bf16 MFU | 122269 tok/s step 18048/19560 | loss 3.298295 (+0.33z)| norm 0.2360 (+2.48z)| lr 1.30e-05 | 4241.51 ms | 31.8% bf16 MFU | 122336 tok/s step 18049/19560 | loss 3.386748 (+2.71z)| norm 0.2418 (+3.03z)| lr 1.30e-05 | 4171.20 ms | 32.4% bf16 MFU | 122504 tok/s step 18050/19560 | loss 3.276038 (-0.31z)| norm 0.2074 (-0.79z)| lr 1.30e-05 | 4251.39 ms | 31.8% bf16 MFU | 122545 tok/s step 18051/19560 | loss 3.320706 (+0.90z)| norm 0.2723 (+5.54z)| lr 1.30e-05 | 4233.18 ms | 31.9% bf16 MFU | 122610 tok/s step 18052/19560 | loss 3.323542 (+0.96z)| norm 0.2178 (+0.26z)| lr 1.30e-05 | 4488.20 ms | 30.1% bf16 MFU | 122321 tok/s step 18053/19560 | loss 3.288764 (+0.03z)| norm 0.2082 (-0.67z)| lr 1.30e-05 | 4166.04 ms | 32.4% bf16 MFU | 122497 tok/s step 18054/19560 | loss 3.340301 (+1.41z)| norm 0.2059 (-0.89z)| lr 1.29e-05 | 4181.26 ms | 32.3% bf16 MFU | 122642 tok/s step 18055/19560 | loss 3.280735 (-0.19z)| norm 0.2195 (+0.42z)| lr 1.29e-05 | 4273.12 ms | 31.6% bf16 MFU | 122644 tok/s step 18056/19560 | loss 3.288049 (+0.01z)| norm 0.2165 (+0.12z)| lr 1.29e-05 | 4166.18 ms | 32.4% bf16 MFU | 122804 tok/s step 18057/19560 | loss 3.267757 (-0.53z)| norm 0.2248 (+0.92z)| lr 1.29e-05 | 4308.17 ms | 31.3% bf16 MFU | 122749 tok/s step 18058/19560 | loss 3.377903 (+2.42z)| norm 0.2119 (-0.33z)| lr 1.29e-05 | 4165.58 ms | 32.4% bf16 MFU | 122904 tok/s step 18059/19560 | loss 3.295964 (+0.21z)| norm 0.2111 (-0.40z)| lr 1.29e-05 | 4176.79 ms | 32.3% bf16 MFU | 123035 tok/s step 18060/19560 | loss 3.327446 (+1.05z)| norm 0.2200 (+0.45z)| lr 1.28e-05 | 4170.57 ms | 32.4% bf16 MFU | 123169 tok/s step 18061/19560 | loss 3.351591 (+1.66z)| norm 0.2189 (+0.34z)| lr 1.28e-05 | 4344.71 ms | 31.1% bf16 MFU | 123044 tok/s step 18062/19560 | loss 3.299387 (+0.28z)| norm 0.2229 (+0.73z)| lr 1.28e-05 | 4163.04 ms | 32.4% bf16 MFU | 123189 tok/s step 18063/19560 | loss 3.281451 (-0.20z)| norm 0.2322 (+1.61z)| lr 1.28e-05 | 4169.15 ms | 32.4% bf16 MFU | 123317 tok/s step 18064/19560 | loss 3.283991 (-0.13z)| norm 0.2011 (-1.37z)| lr 1.28e-05 | 4173.27 ms | 32.4% bf16 MFU | 123433 tok/s step 18065/19560 | loss 3.266299 (-0.59z)| norm 0.2155 (+0.02z)| lr 1.28e-05 | 4330.44 ms | 31.2% bf16 MFU | 123315 tok/s step 18066/19560 | loss 3.382416 (+2.43z)| norm 0.2444 (+2.70z)| lr 1.27e-05 | 4177.00 ms | 32.3% bf16 MFU | 123425 tok/s step 18067/19560 | loss 3.224615 (-1.64z)| norm 0.2393 (+2.17z)| lr 1.27e-05 | 4250.74 ms | 31.8% bf16 MFU | 123421 tok/s step 18068/19560 | loss 3.237155 (-1.31z)| norm 0.2212 (+0.49z)| lr 1.27e-05 | 4222.92 ms | 32.0% bf16 MFU | 123457 tok/s step 18069/19560 | loss 3.283482 (-0.13z)| norm 0.2227 (+0.63z)| lr 1.27e-05 | 4191.21 ms | 32.2% bf16 MFU | 123539 tok/s step 18070/19560 | loss 3.347302 (+1.49z)| norm 0.2305 (+1.33z)| lr 1.27e-05 | 4334.39 ms | 31.2% bf16 MFU | 123410 tok/s step 18071/19560 | loss 3.327293 (+0.97z)| norm 0.2273 (+1.02z)| lr 1.27e-05 | 4217.91 ms | 32.0% bf16 MFU | 123455 tok/s step 18072/19560 | loss 3.291208 (+0.05z)| norm 0.2188 (+0.24z)| lr 1.26e-05 | 4222.06 ms | 32.0% bf16 MFU | 123491 tok/s step 18073/19560 | loss 3.303660 (+0.36z)| norm 0.2319 (+1.41z)| lr 1.26e-05 | 4266.32 ms | 31.6% bf16 MFU | 123461 tok/s step 18074/19560 | loss 3.250052 (-0.99z)| norm 0.2258 (+0.86z)| lr 1.26e-05 | 4418.72 ms | 30.6% bf16 MFU | 123220 tok/s step 18075/19560 | loss 3.298244 (+0.22z)| norm 0.2085 (-0.68z)| lr 1.26e-05 | 4248.35 ms | 31.8% bf16 MFU | 123230 tok/s step 18076/19560 | loss 3.340549 (+1.29z)| norm 0.2082 (-0.71z)| lr 1.26e-05 | 4319.84 ms | 31.3% bf16 MFU | 123137 tok/s step 18077/19560 | loss 3.281335 (-0.22z)| norm 0.2015 (-1.30z)| lr 1.26e-05 | 4174.15 ms | 32.3% bf16 MFU | 123260 tok/s step 18078/19560 | loss 3.296928 (+0.18z)| norm 0.2158 (-0.03z)| lr 1.25e-05 | 4240.44 ms | 31.8% bf16 MFU | 123279 tok/s step 18079/19560 | loss 3.293934 (+0.09z)| norm 0.2092 (-0.62z)| lr 1.25e-05 | 4182.22 ms | 32.3% bf16 MFU | 123383 tok/s step 18080/19560 | loss 3.286042 (-0.11z)| norm 0.2094 (-0.59z)| lr 1.25e-05 | 4281.61 ms | 31.5% bf16 MFU | 123337 tok/s step 18081/19560 | loss 3.286360 (-0.09z)| norm 0.2066 (-0.83z)| lr 1.25e-05 | 4465.93 ms | 30.2% bf16 MFU | 123040 tok/s step 18082/19560 | loss 3.302267 (+0.31z)| norm 0.2163 (+0.05z)| lr 1.25e-05 | 4245.53 ms | 31.8% bf16 MFU | 123062 tok/s step 18083/19560 | loss 3.265231 (-0.66z)| norm 0.2091 (-0.60z)| lr 1.25e-05 | 4207.15 ms | 32.1% bf16 MFU | 123140 tok/s step 18084/19560 | loss 3.281147 (-0.24z)| norm 0.2020 (-1.23z)| lr 1.24e-05 | 4186.39 ms | 32.3% bf16 MFU | 123245 tok/s step 18085/19560 | loss 3.352157 (+1.61z)| norm 0.2128 (-0.26z)| lr 1.24e-05 | 4276.74 ms | 31.6% bf16 MFU | 123212 tok/s step 18086/19560 | loss 3.286142 (-0.12z)| norm 0.2160 (+0.02z)| lr 1.24e-05 | 4175.47 ms | 32.3% bf16 MFU | 123330 tok/s step 18087/19560 | loss 3.277629 (-0.36z)| norm 0.2065 (-0.82z)| lr 1.24e-05 | 4177.97 ms | 32.3% bf16 MFU | 123438 tok/s step 18088/19560 | loss 3.301745 (+0.29z)| norm 0.2074 (-0.74z)| lr 1.24e-05 | 4187.49 ms | 32.2% bf16 MFU | 123526 tok/s step 18089/19560 | loss 3.281953 (-0.25z)| norm 0.2069 (-0.78z)| lr 1.24e-05 | 4307.23 ms | 31.3% bf16 MFU | 123436 tok/s step 18090/19560 | loss 3.278237 (-0.35z)| norm 0.2132 (-0.21z)| lr 1.23e-05 | 4174.96 ms | 32.3% bf16 MFU | 123543 tok/s step 18091/19560 | loss 3.343652 (+1.39z)| norm 0.2218 (+0.54z)| lr 1.23e-05 | 4173.91 ms | 32.3% bf16 MFU | 123646 tok/s step 18092/19560 | loss 3.316513 (+0.66z)| norm 0.2099 (-0.51z)| lr 1.23e-05 | 4187.32 ms | 32.2% bf16 MFU | 123724 tok/s step 18093/19560 | loss 3.327582 (+0.95z)| norm 0.2626 (+3.90z)| lr 1.23e-05 | 4175.21 ms | 32.3% bf16 MFU | 123817 tok/s step 18094/19560 | loss 3.326564 (+0.91z)| norm 0.2109 (-0.42z)| lr 1.23e-05 | 4209.41 ms | 32.1% bf16 MFU | 123854 tok/s step 18095/19560 | loss 3.265045 (-0.73z)| norm 0.2095 (-0.54z)| lr 1.23e-05 | 4189.20 ms | 32.2% bf16 MFU | 123918 tok/s step 18096/19560 | loss 3.242301 (-1.33z)| norm 0.2057 (-0.86z)| lr 1.22e-05 | 4173.12 ms | 32.4% bf16 MFU | 124004 tok/s step 18097/19560 | loss 3.266894 (-0.68z)| norm 0.2170 (+0.12z)| lr 1.22e-05 | 4177.46 ms | 32.3% bf16 MFU | 124079 tok/s step 18098/19560 | loss 3.224589 (-1.81z)| norm 0.2151 (-0.04z)| lr 1.22e-05 | 4166.77 ms | 32.4% bf16 MFU | 124167 tok/s step 18099/19560 | loss 3.331169 (+1.02z)| norm 0.2087 (-0.59z)| lr 1.22e-05 | 4196.52 ms | 32.2% bf16 MFU | 124205 tok/s step 18100/19560 | loss 3.261164 (-0.84z)| norm 0.2042 (-0.97z)| lr 1.22e-05 | 4241.27 ms | 31.8% bf16 MFU | 124175 tok/s step 18101/19560 | loss 3.234613 (-1.53z)| norm 0.2060 (-0.81z)| lr 1.22e-05 | 4175.80 ms | 32.3% bf16 MFU | 124244 tok/s step 18102/19560 | loss 3.280681 (-0.30z)| norm 0.2119 (-0.31z)| lr 1.21e-05 | 4266.40 ms | 31.6% bf16 MFU | 124177 tok/s step 18103/19560 | loss 3.284627 (-0.20z)| norm 0.2288 (+1.12z)| lr 1.21e-05 | 4323.26 ms | 31.2% bf16 MFU | 124031 tok/s step 18104/19560 | loss 3.335614 (+1.15z)| norm 0.2025 (-1.11z)| lr 1.21e-05 | 4292.95 ms | 31.5% bf16 MFU | 123936 tok/s step 18105/19560 | loss 3.347089 (+1.43z)| norm 0.2227 (+0.60z)| lr 1.21e-05 | 4181.67 ms | 32.3% bf16 MFU | 124008 tok/s step 18106/19560 | loss 3.301452 (+0.21z)| norm 0.2029 (-1.09z)| lr 1.21e-05 | 4188.59 ms | 32.2% bf16 MFU | 124066 tok/s step 18107/19560 | loss 3.477509 (+4.46z)| norm 0.2125 (-0.27z)| lr 1.21e-05 | 4576.96 ms | 29.5% bf16 MFU | 123590 tok/s step 18108/19560 | loss 3.296740 (+0.04z)| norm 0.2096 (-0.52z)| lr 1.20e-05 | 4185.27 ms | 32.3% bf16 MFU | 123674 tok/s step 18109/19560 | loss 3.439801 (+3.36z)| norm 0.2232 (+0.63z)| lr 1.20e-05 | 4234.01 ms | 31.9% bf16 MFU | 123682 tok/s step 18110/19560 | loss 3.254495 (-0.98z)| norm 0.2116 (-0.35z)| lr 1.20e-05 | 4178.11 ms | 32.3% bf16 MFU | 123772 tok/s step 18111/19560 | loss 3.243204 (-1.23z)| norm 0.2153 (-0.04z)| lr 1.20e-05 | 4176.96 ms | 32.3% bf16 MFU | 123860 tok/s step 18112/19560 | loss 3.268920 (-0.63z)| norm 0.2125 (-0.28z)| lr 1.20e-05 | 4181.07 ms | 32.3% bf16 MFU | 123936 tok/s step 18113/19560 | loss 3.326441 (+0.71z)| norm 0.2342 (+1.55z)| lr 1.20e-05 | 4174.08 ms | 32.3% bf16 MFU | 124020 tok/s step 18114/19560 | loss 3.248064 (-1.13z)| norm 0.2023 (-1.13z)| lr 1.19e-05 | 4171.43 ms | 32.4% bf16 MFU | 124103 tok/s step 18115/19560 | loss 3.254753 (-0.97z)| norm 0.2140 (-0.15z)| lr 1.19e-05 | 4396.77 ms | 30.7% bf16 MFU | 123860 tok/s step 18116/19560 | loss 3.281404 (-0.34z)| norm 0.2167 (+0.07z)| lr 1.19e-05 | 4212.57 ms | 32.1% bf16 MFU | 123890 tok/s step 18117/19560 | loss 3.240389 (-1.28z)| norm 0.2117 (-0.35z)| lr 1.19e-05 | 4175.49 ms | 32.3% bf16 MFU | 123974 tok/s step 18118/19560 | loss 3.392439 (+2.20z)| norm 0.2099 (-0.51z)| lr 1.19e-05 | 4205.55 ms | 32.1% bf16 MFU | 124008 tok/s step 18119/19560 | loss 3.345138 (+1.10z)| norm 0.2192 (+0.28z)| lr 1.19e-05 | 4277.99 ms | 31.6% bf16 MFU | 123936 tok/s step 18120/19560 | loss 3.263944 (-0.76z)| norm 0.2059 (-0.85z)| lr 1.18e-05 | 4179.37 ms | 32.3% bf16 MFU | 124011 tok/s step 18121/19560 | loss 3.330043 (+0.76z)| norm 0.2252 (+0.78z)| lr 1.18e-05 | 4177.69 ms | 32.3% bf16 MFU | 124085 tok/s step 18122/19560 | loss 3.310585 (+0.30z)| norm 0.2186 (+0.22z)| lr 1.18e-05 | 4177.15 ms | 32.3% bf16 MFU | 124157 tok/s step 18123/19560 | loss 3.265134 (-0.74z)| norm 0.2113 (-0.40z)| lr 1.18e-05 | 4207.30 ms | 32.1% bf16 MFU | 124180 tok/s step 18124/19560 | loss 3.291098 (-0.14z)| norm 0.2115 (-0.38z)| lr 1.18e-05 | 4176.23 ms | 32.3% bf16 MFU | 124248 tok/s step 18125/19560 | loss 3.309006 (+0.27z)| norm 0.2066 (-0.80z)| lr 1.18e-05 | 4246.63 ms | 31.8% bf16 MFU | 124208 tok/s step 18126/19560 | loss 3.301508 (+0.10z)| norm 0.2076 (-0.71z)| lr 1.17e-05 | 4181.72 ms | 32.3% bf16 MFU | 124267 tok/s step 18127/19560 | loss 3.275829 (-0.50z)| norm 0.2102 (-0.50z)| lr 1.17e-05 | 4170.17 ms | 32.4% bf16 MFU | 124340 tok/s step 18128/19560 | loss 3.262629 (-0.80z)| norm 0.2242 (+0.69z)| lr 1.17e-05 | 4168.16 ms | 32.4% bf16 MFU | 124412 tok/s step 18129/19560 | loss 3.329812 (+0.74z)| norm 0.2133 (-0.24z)| lr 1.17e-05 | 4189.52 ms | 32.2% bf16 MFU | 124448 tok/s step 18130/19560 | loss 3.276759 (-0.47z)| norm 0.2213 (+0.43z)| lr 1.17e-05 | 5005.96 ms | 27.0% bf16 MFU | 123463 tok/s step 18131/19560 | loss 3.242281 (-1.25z)| norm 0.2113 (-0.40z)| lr 1.17e-05 | 4885.58 ms | 27.6% bf16 MFU | 122655 tok/s step 18132/19560 | loss 3.270478 (-0.58z)| norm 0.2033 (-1.07z)| lr 1.16e-05 | 4169.02 ms | 32.4% bf16 MFU | 122810 tok/s step 18133/19560 | loss 3.265682 (-0.69z)| norm 0.2060 (-0.83z)| lr 1.16e-05 | 4156.37 ms | 32.5% bf16 MFU | 122977 tok/s step 18134/19560 | loss 3.278948 (-0.38z)| norm 0.2180 (+0.18z)| lr 1.16e-05 | 4176.04 ms | 32.3% bf16 MFU | 123105 tok/s step 18135/19560 | loss 3.325800 (+0.71z)| norm 0.2195 (+0.30z)| lr 1.16e-05 | 13542.76 ms | 10.0% bf16 MFU | 118886 tok/s step 18136/19560 | loss 3.310323 (+0.34z)| norm 0.2074 (-0.71z)| lr 1.16e-05 | 9127.33 ms | 14.8% bf16 MFU | 115814 tok/s step 18137/19560 | loss 3.354971 (+1.39z)| norm 0.2260 (+0.85z)| lr 1.16e-05 | 4697.18 ms | 28.7% bf16 MFU | 115604 tok/s step 18138/19560 | loss 3.252063 (-1.04z)| norm 0.2610 (+3.61z)| lr 1.15e-05 | 4599.95 ms | 29.4% bf16 MFU | 115522 tok/s step 18139/19560 | loss 3.293652 (-0.06z)| norm 0.2043 (-0.97z)| lr 1.15e-05 | 4153.06 ms | 32.5% bf16 MFU | 116058 tok/s step 18140/19560 | loss 3.279246 (-0.40z)| norm 0.2373 (+1.67z)| lr 1.15e-05 | 4147.12 ms | 32.6% bf16 MFU | 116577 tok/s step 18141/19560 | loss 3.249394 (-1.09z)| norm 0.2197 (+0.26z)| lr 1.15e-05 | 4142.75 ms | 32.6% bf16 MFU | 117075 tok/s step 18142/19560 | loss 3.280101 (-0.37z)| norm 0.2254 (+0.71z)| lr 1.15e-05 | 4137.76 ms | 32.6% bf16 MFU | 117557 tok/s step 18143/19560 | loss 3.319423 (+0.56z)| norm 0.2050 (-0.93z)| lr 1.15e-05 | 4226.17 ms | 31.9% bf16 MFU | 117882 tok/s step 18144/19560 | loss 3.291176 (-0.12z)| norm 0.2093 (-0.58z)| lr 1.15e-05 | 4276.79 ms | 31.6% bf16 MFU | 118117 tok/s step 18145/19560 | loss 3.257710 (-0.92z)| norm 0.2069 (-0.76z)| lr 1.14e-05 | 4185.45 ms | 32.3% bf16 MFU | 118475 tok/s step 18146/19560 | loss 3.232002 (-1.51z)| norm 0.2055 (-0.87z)| lr 1.14e-05 | 4144.47 ms | 32.6% bf16 MFU | 118876 tok/s step 18147/19560 | loss 3.273329 (-0.52z)| norm 0.2117 (-0.37z)| lr 1.14e-05 | 4200.02 ms | 32.1% bf16 MFU | 119174 tok/s step 18148/19560 | loss 3.394408 (+2.30z)| norm 0.2149 (-0.10z)| lr 1.14e-05 | 4132.76 ms | 32.7% bf16 MFU | 119558 tok/s step 18149/19560 | loss 3.276429 (-0.46z)| norm 0.2210 (+0.39z)| lr 1.14e-05 | 4248.29 ms | 31.8% bf16 MFU | 119751 tok/s step 18150/19560 | loss 3.279183 (-0.39z)| norm 0.2063 (-0.81z)| lr 1.14e-05 | 4155.81 ms | 32.5% bf16 MFU | 120071 tok/s step 18151/19560 | loss 3.286962 (-0.20z)| norm 0.2050 (-0.90z)| lr 1.13e-05 | 4166.85 ms | 32.4% bf16 MFU | 120359 tok/s step 18152/19560 | loss 3.323883 (+0.65z)| norm 0.2046 (-0.93z)| lr 1.13e-05 | 4176.40 ms | 32.3% bf16 MFU | 120618 tok/s step 18153/19560 | loss 3.312251 (+0.39z)| norm 0.2147 (-0.12z)| lr 1.13e-05 | 4175.62 ms | 32.3% bf16 MFU | 120865 tok/s step 18154/19560 | loss 3.267982 (-0.65z)| norm 0.2184 (+0.18z)| lr 1.13e-05 | 4182.48 ms | 32.3% bf16 MFU | 121089 tok/s step 18155/19560 | loss 3.261659 (-0.80z)| norm 0.2067 (-0.77z)| lr 1.13e-05 | 4171.55 ms | 32.4% bf16 MFU | 121319 tok/s step 18156/19560 | loss 3.283030 (-0.29z)| norm 0.2090 (-0.57z)| lr 1.13e-05 | 4207.44 ms | 32.1% bf16 MFU | 121483 tok/s step 18157/19560 | loss 3.285928 (-0.20z)| norm 0.2145 (-0.12z)| lr 1.12e-05 | 4196.53 ms | 32.2% bf16 MFU | 121656 tok/s step 18158/19560 | loss 3.230150 (-1.54z)| norm 0.2134 (-0.20z)| lr 1.12e-05 | 4182.87 ms | 32.3% bf16 MFU | 121840 tok/s step 18159/19560 | loss 3.316817 (+0.54z)| norm 0.2143 (-0.14z)| lr 1.12e-05 | 4189.35 ms | 32.2% bf16 MFU | 122006 tok/s step 18160/19560 | loss 3.244321 (-1.19z)| norm 0.2012 (-1.22z)| lr 1.12e-05 | 4203.31 ms | 32.1% bf16 MFU | 122142 tok/s step 18161/19560 | loss 3.343280 (+1.17z)| norm 0.2192 (+0.29z)| lr 1.12e-05 | 4187.14 ms | 32.2% bf16 MFU | 122296 tok/s step 18162/19560 | loss 3.301198 (+0.16z)| norm 0.2096 (-0.51z)| lr 1.12e-05 | 4192.54 ms | 32.2% bf16 MFU | 122433 tok/s step 18163/19560 | loss 3.273171 (-0.51z)| norm 0.2067 (-0.75z)| lr 1.11e-05 | 4207.50 ms | 32.1% bf16 MFU | 122542 tok/s step 18164/19560 | loss 3.289265 (-0.12z)| norm 0.2079 (-0.64z)| lr 1.11e-05 | 4210.80 ms | 32.1% bf16 MFU | 122641 tok/s step 18165/19560 | loss 3.294516 (+0.00z)| norm 0.2224 (+0.57z)| lr 1.11e-05 | 4198.61 ms | 32.2% bf16 MFU | 122752 tok/s step 18166/19560 | loss 3.265398 (-0.70z)| norm 0.2135 (-0.17z)| lr 1.11e-05 | 4187.05 ms | 32.2% bf16 MFU | 122875 tok/s step 18167/19560 | loss 3.356529 (+1.46z)| norm 0.2140 (-0.12z)| lr 1.11e-05 | 4198.36 ms | 32.2% bf16 MFU | 122976 tok/s step 18168/19560 | loss 3.344127 (+1.16z)| norm 0.2136 (-0.16z)| lr 1.11e-05 | 4198.66 ms | 32.2% bf16 MFU | 123070 tok/s step 18169/19560 | loss 3.210623 (-1.97z)| norm 0.2173 (+0.14z)| lr 1.11e-05 | 4182.68 ms | 32.3% bf16 MFU | 123184 tok/s step 18170/19560 | loss 3.347657 (+1.23z)| norm 0.2169 (+0.11z)| lr 1.10e-05 | 4194.08 ms | 32.2% bf16 MFU | 123275 tok/s step 18171/19560 | loss 3.335104 (+0.92z)| norm 0.2092 (-0.53z)| lr 1.10e-05 | 4200.26 ms | 32.1% bf16 MFU | 123353 tok/s step 18172/19560 | loss 3.271628 (-0.55z)| norm 0.2134 (-0.18z)| lr 1.10e-05 | 4198.51 ms | 32.2% bf16 MFU | 123429 tok/s step 18173/19560 | loss 3.322973 (+0.63z)| norm 0.2042 (-0.95z)| lr 1.10e-05 | 4187.98 ms | 32.2% bf16 MFU | 123517 tok/s step 18174/19560 | loss 3.252674 (-1.02z)| norm 0.2076 (-0.67z)| lr 1.10e-05 | 4191.21 ms | 32.2% bf16 MFU | 123595 tok/s step 18175/19560 | loss 3.272694 (-0.55z)| norm 0.2254 (+0.83z)| lr 1.10e-05 | 4199.51 ms | 32.2% bf16 MFU | 123658 tok/s step 18176/19560 | loss 3.270850 (-0.59z)| norm 0.2039 (-0.98z)| lr 1.09e-05 | 4191.20 ms | 32.2% bf16 MFU | 123730 tok/s step 18177/19560 | loss 3.282448 (-0.30z)| norm 0.2052 (-0.86z)| lr 1.09e-05 | 4192.76 ms | 32.2% bf16 MFU | 123796 tok/s step 18178/19560 | loss 3.319295 (+0.57z)| norm 0.2113 (-0.33z)| lr 1.09e-05 | 4181.66 ms | 32.3% bf16 MFU | 123875 tok/s step 18179/19560 | loss 3.246735 (-1.14z)| norm 0.2083 (-0.62z)| lr 1.09e-05 | 4213.38 ms | 32.0% bf16 MFU | 123903 tok/s step 18180/19560 | loss 3.289274 (-0.13z)| norm 0.2080 (-0.64z)| lr 1.09e-05 | 4194.50 ms | 32.2% bf16 MFU | 123957 tok/s step 18181/19560 | loss 3.328504 (+0.80z)| norm 0.2058 (-0.85z)| lr 1.09e-05 | 4309.29 ms | 31.3% bf16 MFU | 123843 tok/s step 18182/19560 | loss 3.216395 (-1.83z)| norm 0.2054 (-0.89z)| lr 1.08e-05 | 4218.49 ms | 32.0% bf16 MFU | 123865 tok/s step 18183/19560 | loss 3.283411 (-0.25z)| norm 0.2056 (-0.86z)| lr 1.08e-05 | 4187.07 ms | 32.2% bf16 MFU | 123932 tok/s step 18184/19560 | loss 3.251409 (-0.99z)| norm 0.2067 (-0.74z)| lr 1.08e-05 | 4171.70 ms | 32.4% bf16 MFU | 124019 tok/s step 18185/19560 | loss 3.330974 (+0.86z)| norm 0.2082 (-0.58z)| lr 1.08e-05 | 4186.14 ms | 32.3% bf16 MFU | 124081 tok/s step 18186/19560 | loss 3.285953 (-0.18z)| norm 0.2025 (-1.13z)| lr 1.08e-05 | 4184.16 ms | 32.3% bf16 MFU | 124142 tok/s step 18187/19560 | loss 3.299680 (+0.15z)| norm 0.2080 (-0.59z)| lr 1.08e-05 | 4176.74 ms | 32.3% bf16 MFU | 124211 tok/s step 18188/19560 | loss 3.254294 (-0.92z)| norm 0.2041 (-0.95z)| lr 1.08e-05 | 4176.67 ms | 32.3% bf16 MFU | 124277 tok/s step 18189/19560 | loss 3.328426 (+0.85z)| norm 0.2194 (+0.52z)| lr 1.07e-05 | 4180.71 ms | 32.3% bf16 MFU | 124333 tok/s step 18190/19560 | loss 3.241606 (-1.20z)| norm 0.2128 (-0.11z)| lr 1.07e-05 | 4171.24 ms | 32.4% bf16 MFU | 124401 tok/s step 18191/19560 | loss 3.294457 (+0.05z)| norm 0.2157 (+0.18z)| lr 1.07e-05 | 4178.64 ms | 32.3% bf16 MFU | 124455 tok/s step 18192/19560 | loss 3.242500 (-1.17z)| norm 0.2131 (-0.08z)| lr 1.07e-05 | 4196.97 ms | 32.2% bf16 MFU | 124478 tok/s step 18193/19560 | loss 3.237205 (-1.28z)| norm 0.1989 (-1.45z)| lr 1.07e-05 | 4184.78 ms | 32.3% bf16 MFU | 124518 tok/s step 18194/19560 | loss 3.259150 (-0.76z)| norm 0.2198 (+0.62z)| lr 1.07e-05 | 4178.46 ms | 32.3% bf16 MFU | 124566 tok/s step 18195/19560 | loss 3.290637 (-0.02z)| norm 0.2096 (-0.39z)| lr 1.06e-05 | 4175.29 ms | 32.3% bf16 MFU | 124616 tok/s step 18196/19560 | loss 3.343718 (+1.24z)| norm 0.2069 (-0.65z)| lr 1.06e-05 | 4163.67 ms | 32.4% bf16 MFU | 124681 tok/s step 18197/19560 | loss 3.318340 (+0.62z)| norm 0.2222 (+0.92z)| lr 1.06e-05 | 4416.53 ms | 30.6% bf16 MFU | 124383 tok/s step 18198/19560 | loss 3.330313 (+0.92z)| norm 0.2062 (-0.72z)| lr 1.06e-05 | 4175.27 ms | 32.3% bf16 MFU | 124442 tok/s step 18199/19560 | loss 3.285357 (-0.16z)| norm 0.2137 (+0.08z)| lr 1.06e-05 | 4203.17 ms | 32.1% bf16 MFU | 124457 tok/s step 18200/19560 | loss 3.342088 (+1.20z)| norm 0.2180 (+0.53z)| lr 1.06e-05 | 4175.62 ms | 32.3% bf16 MFU | 124512 tok/s step 18201/19560 | loss 3.311361 (+0.45z)| norm 0.2060 (-0.72z)| lr 1.06e-05 | 4188.63 ms | 32.2% bf16 MFU | 124545 tok/s step 18202/19560 | loss 3.268207 (-0.59z)| norm 0.2108 (-0.19z)| lr 1.05e-05 | 4463.41 ms | 30.2% bf16 MFU | 124191 tok/s step 18203/19560 | loss 3.329437 (+0.88z)| norm 0.2065 (-0.65z)| lr 1.05e-05 | 4173.64 ms | 32.4% bf16 MFU | 124262 tok/s step 18204/19560 | loss 3.315355 (+0.55z)| norm 0.2059 (-0.72z)| lr 1.05e-05 | 4183.03 ms | 32.3% bf16 MFU | 124316 tok/s step 18205/19560 | loss 3.321178 (+0.68z)| norm 0.2196 (+0.74z)| lr 1.05e-05 | 4328.04 ms | 31.2% bf16 MFU | 124157 tok/s step 18206/19560 | loss 3.267788 (-0.60z)| norm 0.2044 (-0.89z)| lr 1.05e-05 | 4192.50 ms | 32.2% bf16 MFU | 124202 tok/s step 18207/19560 | loss 3.233065 (-1.42z)| norm 0.2012 (-1.22z)| lr 1.05e-05 | 4173.86 ms | 32.3% bf16 MFU | 124272 tok/s step 18208/19560 | loss 3.298158 (+0.14z)| norm 0.2308 (+1.90z)| lr 1.04e-05 | 4181.04 ms | 32.3% bf16 MFU | 124329 tok/s step 18209/19560 | loss 3.258573 (-0.80z)| norm 0.2248 (+1.25z)| lr 1.04e-05 | 4165.81 ms | 32.4% bf16 MFU | 124405 tok/s step 18210/19560 | loss 3.383353 (+2.12z)| norm 0.2221 (+0.96z)| lr 1.04e-05 | 4195.42 ms | 32.2% bf16 MFU | 124433 tok/s step 18211/19560 | loss 3.266946 (-0.61z)| norm 0.2097 (-0.34z)| lr 1.04e-05 | 4164.18 ms | 32.4% bf16 MFU | 124506 tok/s step 18212/19560 | loss 3.256619 (-0.84z)| norm 0.2104 (-0.27z)| lr 1.04e-05 | 4172.89 ms | 32.4% bf16 MFU | 124563 tok/s step 18213/19560 | loss 3.240221 (-1.21z)| norm 0.2162 (+0.33z)| lr 1.04e-05 | 4313.65 ms | 31.3% bf16 MFU | 124412 tok/s step 18214/19560 | loss 3.264731 (-0.63z)| norm 0.2028 (-1.06z)| lr 1.04e-05 | 4188.90 ms | 32.2% bf16 MFU | 124450 tok/s step 18215/19560 | loss 3.360870 (+1.59z)| norm 0.2094 (-0.37z)| lr 1.03e-05 | 4179.16 ms | 32.3% bf16 MFU | 124500 tok/s step 18216/19560 | loss 3.260262 (-0.73z)| norm 0.2061 (-0.72z)| lr 1.03e-05 | 4184.28 ms | 32.3% bf16 MFU | 124540 tok/s step 18217/19560 | loss 3.298210 (+0.14z)| norm 0.2015 (-1.19z)| lr 1.03e-05 | 4179.71 ms | 32.3% bf16 MFU | 124585 tok/s step 18218/19560 | loss 3.310701 (+0.43z)| norm 0.2013 (-1.20z)| lr 1.03e-05 | 4176.89 ms | 32.3% bf16 MFU | 124631 tok/s step 18219/19560 | loss 3.293870 (+0.05z)| norm 0.2135 (+0.07z)| lr 1.03e-05 | 4192.89 ms | 32.2% bf16 MFU | 124652 tok/s step 18220/19560 | loss 3.305928 (+0.33z)| norm 0.2230 (+1.05z)| lr 1.03e-05 | 4176.61 ms | 32.3% bf16 MFU | 124696 tok/s step 18221/19560 | loss 3.255362 (-0.84z)| norm 0.2108 (-0.19z)| lr 1.02e-05 | 4519.86 ms | 29.9% bf16 MFU | 124261 tok/s step 18222/19560 | loss 3.196914 (-2.14z)| norm 0.2016 (-1.25z)| lr 1.02e-05 | 4562.16 ms | 29.6% bf16 MFU | 123794 tok/s step 18223/19560 | loss 3.228603 (-1.40z)| norm 0.2088 (-0.41z)| lr 1.02e-05 | 4587.44 ms | 29.4% bf16 MFU | 123319 tok/s step 18224/19560 | loss 3.212047 (-1.76z)| norm 0.2150 (+0.29z)| lr 1.02e-05 | 4187.95 ms | 32.2% bf16 MFU | 123412 tok/s step 18225/19560 | loss 3.239287 (-1.13z)| norm 0.2221 (+1.11z)| lr 1.02e-05 | 4395.83 ms | 30.7% bf16 MFU | 123205 tok/s step 18226/19560 | loss 3.245613 (-1.00z)| norm 0.2085 (-0.45z)| lr 1.02e-05 | 4157.34 ms | 32.5% bf16 MFU | 123350 tok/s step 18227/19560 | loss 3.264442 (-0.56z)| norm 0.2030 (-1.08z)| lr 1.02e-05 | 4410.51 ms | 30.6% bf16 MFU | 123126 tok/s step 18228/19560 | loss 3.235821 (-1.20z)| norm 0.2217 (+1.06z)| lr 1.01e-05 | 4417.24 ms | 30.6% bf16 MFU | 122905 tok/s step 18229/19560 | loss 3.255184 (-0.77z)| norm 0.2164 (+0.44z)| lr 1.01e-05 | 4208.80 ms | 32.1% bf16 MFU | 122988 tok/s step 18230/19560 | loss 3.272660 (-0.37z)| norm 0.2091 (-0.41z)| lr 1.01e-05 | 4633.25 ms | 29.1% bf16 MFU | 122496 tok/s step 18231/19560 | loss 3.306510 (+0.39z)| norm 0.2143 (+0.21z)| lr 1.01e-05 | 4267.78 ms | 31.6% bf16 MFU | 122514 tok/s step 18232/19560 | loss 3.270762 (-0.41z)| norm 0.2106 (-0.22z)| lr 1.01e-05 | 4158.25 ms | 32.5% bf16 MFU | 122692 tok/s step 18233/19560 | loss 3.291905 (+0.08z)| norm 0.2130 (+0.06z)| lr 1.01e-05 | 4186.88 ms | 32.2% bf16 MFU | 122819 tok/s step 18234/19560 | loss 3.252003 (-0.82z)| norm 0.2128 (+0.03z)| lr 1.00e-05 | 4176.26 ms | 32.3% bf16 MFU | 122955 tok/s step 18235/19560 | loss 3.286653 (+0.01z)| norm 0.2026 (-1.17z)| lr 1.00e-05 | 4157.59 ms | 32.5% bf16 MFU | 123112 tok/s step 18236/19560 | loss 3.293683 (+0.18z)| norm 0.2121 (-0.05z)| lr 1.00e-05 | 4405.28 ms | 30.6% bf16 MFU | 122907 tok/s step 18237/19560 | loss 3.255232 (-0.78z)| norm 0.2115 (-0.10z)| lr 1.00e-05 | 4208.03 ms | 32.1% bf16 MFU | 122992 tok/s step 18238/19560 | loss 3.270093 (-0.39z)| norm 0.2130 (+0.07z)| lr 9.99e-06 | 4305.23 ms | 31.4% bf16 MFU | 122931 tok/s step 18239/19560 | loss 3.235392 (-1.30z)| norm 0.2084 (-0.47z)| lr 9.97e-06 | 4247.46 ms | 31.8% bf16 MFU | 122956 tok/s step 18240/19560 | loss 3.248620 (-0.95z)| norm 0.2147 (+0.28z)| lr 9.96e-06 | 4277.97 ms | 31.6% bf16 MFU | 122936 tok/s step 18241/19560 | loss 3.288027 (+0.09z)| norm 0.2088 (-0.41z)| lr 9.94e-06 | 4162.84 ms | 32.4% bf16 MFU | 123087 tok/s step 18242/19560 | loss 3.337024 (+1.35z)| norm 0.2425 (+3.50z)| lr 9.93e-06 | 4163.32 ms | 32.4% bf16 MFU | 123229 tok/s step 18243/19560 | loss 3.321605 (+0.94z)| norm 0.2132 (+0.09z)| lr 9.91e-06 | 4152.13 ms | 32.5% bf16 MFU | 123381 tok/s step 18244/19560 | loss 3.232553 (-1.37z)| norm 0.1986 (-1.58z)| lr 9.90e-06 | 4231.61 ms | 31.9% bf16 MFU | 123407 tok/s step 18245/19560 | loss 3.281805 (-0.10z)| norm 0.1991 (-1.50z)| lr 9.88e-06 | 4187.91 ms | 32.2% bf16 MFU | 123496 tok/s step 18246/19560 | loss 3.295553 (+0.29z)| norm 0.2163 (+0.46z)| lr 9.87e-06 | 4159.29 ms | 32.5% bf16 MFU | 123624 tok/s step 18247/19560 | loss 3.261601 (-0.62z)| norm 0.2087 (-0.41z)| lr 9.85e-06 | 4163.11 ms | 32.4% bf16 MFU | 123739 tok/s step 18248/19560 | loss 3.234386 (-1.34z)| norm 0.2074 (-0.55z)| lr 9.84e-06 | 4197.95 ms | 32.2% bf16 MFU | 123797 tok/s step 18249/19560 | loss 3.287008 (+0.09z)| norm 0.2271 (+1.70z)| lr 9.82e-06 | 4162.00 ms | 32.4% bf16 MFU | 123906 tok/s step 18250/19560 | loss 3.318751 (+0.95z)| norm 0.2180 (+0.66z)| lr 9.81e-06 | 4303.71 ms | 31.4% bf16 MFU | 123801 tok/s val loss 3.253107 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3050/10042 = 0.303724 step 18251/19560 | loss 3.253659 (-0.81z)| norm 0.2181 (+0.67z)| lr 9.79e-06 | 4162.32 ms | 32.4% bf16 MFU | 123909 tok/s step 18252/19560 | loss 3.278895 (-0.13z)| norm 0.2103 (-0.22z)| lr 9.78e-06 | 4207.65 ms | 32.1% bf16 MFU | 123944 tok/s step 18253/19560 | loss 3.355964 (+1.92z)| norm 0.2110 (-0.14z)| lr 9.76e-06 | 4179.95 ms | 32.3% bf16 MFU | 124018 tok/s step 18254/19560 | loss 3.205635 (-2.04z)| norm 0.2141 (+0.20z)| lr 9.75e-06 | 4161.50 ms | 32.4% bf16 MFU | 124117 tok/s step 18255/19560 | loss 3.250342 (-0.86z)| norm 0.2140 (+0.19z)| lr 9.73e-06 | 4239.80 ms | 31.8% bf16 MFU | 124094 tok/s step 18256/19560 | loss 3.240583 (-1.10z)| norm 0.2028 (-1.08z)| lr 9.72e-06 | 4217.46 ms | 32.0% bf16 MFU | 124105 tok/s step 18257/19560 | loss 3.291595 (+0.24z)| norm 0.2110 (-0.14z)| lr 9.70e-06 | 4448.67 ms | 30.4% bf16 MFU | 123792 tok/s step 18258/19560 | loss 3.281817 (-0.02z)| norm 0.2108 (-0.15z)| lr 9.69e-06 | 4268.88 ms | 31.6% bf16 MFU | 123743 tok/s step 18259/19560 | loss 3.381133 (+2.51z)| norm 0.2186 (+0.74z)| lr 9.67e-06 | 4353.73 ms | 31.0% bf16 MFU | 123577 tok/s step 18260/19560 | loss 3.295233 (+0.29z)| norm 0.2165 (+0.49z)| lr 9.66e-06 | 4228.03 ms | 31.9% bf16 MFU | 123599 tok/s step 18261/19560 | loss 3.321059 (+0.94z)| norm 0.2144 (+0.24z)| lr 9.65e-06 | 4160.51 ms | 32.5% bf16 MFU | 123720 tok/s step 18262/19560 | loss 3.283471 (-0.02z)| norm 0.2055 (-0.78z)| lr 9.63e-06 | 4164.31 ms | 32.4% bf16 MFU | 123829 tok/s step 18263/19560 | loss 3.230496 (-1.36z)| norm 0.2004 (-1.35z)| lr 9.62e-06 | 4193.24 ms | 32.2% bf16 MFU | 123889 tok/s step 18264/19560 | loss 3.249472 (-0.86z)| norm 0.2150 (+0.33z)| lr 9.60e-06 | 4332.24 ms | 31.2% bf16 MFU | 123745 tok/s step 18265/19560 | loss 3.269382 (-0.34z)| norm 0.2128 (+0.08z)| lr 9.59e-06 | 4163.65 ms | 32.4% bf16 MFU | 123854 tok/s step 18266/19560 | loss 3.264990 (-0.46z)| norm 0.2040 (-1.02z)| lr 9.57e-06 | 4269.26 ms | 31.6% bf16 MFU | 123802 tok/s step 18267/19560 | loss 3.257340 (-0.65z)| norm 0.2106 (-0.14z)| lr 9.56e-06 | 4202.12 ms | 32.1% bf16 MFU | 123850 tok/s step 18268/19560 | loss 3.271803 (-0.27z)| norm 0.2076 (-0.53z)| lr 9.54e-06 | 4171.68 ms | 32.4% bf16 MFU | 123941 tok/s step 18269/19560 | loss 3.294975 (+0.32z)| norm 0.2144 (+0.43z)| lr 9.53e-06 | 4167.23 ms | 32.4% bf16 MFU | 124035 tok/s step 18270/19560 | loss 3.318716 (+0.93z)| norm 0.2134 (+0.31z)| lr 9.51e-06 | 4181.01 ms | 32.3% bf16 MFU | 124103 tok/s step 18271/19560 | loss 3.251854 (-0.79z)| norm 0.2108 (-0.08z)| lr 9.50e-06 | 4178.49 ms | 32.3% bf16 MFU | 124171 tok/s step 18272/19560 | loss 3.309087 (+0.69z)| norm 0.2022 (-1.31z)| lr 9.48e-06 | 4293.54 ms | 31.4% bf16 MFU | 124068 tok/s step 18273/19560 | loss 3.364075 (+2.06z)| norm 0.2341 (+3.14z)| lr 9.47e-06 | 5053.18 ms | 26.7% bf16 MFU | 123053 tok/s step 18274/19560 | loss 3.296459 (+0.32z)| norm 0.2088 (-0.37z)| lr 9.45e-06 | 4163.65 ms | 32.4% bf16 MFU | 123196 tok/s step 18275/19560 | loss 3.334613 (+1.28z)| norm 0.2194 (+1.08z)| lr 9.44e-06 | 4159.28 ms | 32.5% bf16 MFU | 123339 tok/s step 18276/19560 | loss 3.242694 (-1.06z)| norm 0.2094 (-0.30z)| lr 9.42e-06 | 4289.79 ms | 31.5% bf16 MFU | 123283 tok/s step 18277/19560 | loss 3.257059 (-0.68z)| norm 0.2018 (-1.32z)| lr 9.41e-06 | 4162.96 ms | 32.4% bf16 MFU | 123416 tok/s step 18278/19560 | loss 3.293376 (+0.27z)| norm 0.2046 (-0.93z)| lr 9.40e-06 | 4160.28 ms | 32.5% bf16 MFU | 123546 tok/s step 18279/19560 | loss 3.321241 (+0.99z)| norm 0.2069 (-0.62z)| lr 9.38e-06 | 4212.64 ms | 32.1% bf16 MFU | 123592 tok/s step 18280/19560 | loss 3.250231 (-0.85z)| norm 0.2178 (+0.88z)| lr 9.37e-06 | 4237.48 ms | 31.9% bf16 MFU | 123598 tok/s step 18281/19560 | loss 3.288978 (+0.17z)| norm 0.2059 (-0.76z)| lr 9.35e-06 | 4282.71 ms | 31.5% bf16 MFU | 123539 tok/s step 18282/19560 | loss 3.242434 (-1.04z)| norm 0.2665 (+6.29z)| lr 9.34e-06 | 4367.27 ms | 30.9% bf16 MFU | 123365 tok/s step 18283/19560 | loss 3.258132 (-0.63z)| norm 0.2120 (+0.01z)| lr 9.32e-06 | 4190.96 ms | 32.2% bf16 MFU | 123452 tok/s step 18284/19560 | loss 3.279097 (-0.08z)| norm 0.2051 (-0.77z)| lr 9.31e-06 | 4191.59 ms | 32.2% bf16 MFU | 123533 tok/s step 18285/19560 | loss 3.264957 (-0.45z)| norm 0.2057 (-0.70z)| lr 9.29e-06 | 4209.86 ms | 32.1% bf16 MFU | 123583 tok/s step 18286/19560 | loss 3.305527 (+0.60z)| norm 0.2066 (-0.58z)| lr 9.28e-06 | 4206.12 ms | 32.1% bf16 MFU | 123637 tok/s step 18287/19560 | loss 3.224534 (-1.50z)| norm 0.2060 (-0.65z)| lr 9.26e-06 | 4155.09 ms | 32.5% bf16 MFU | 123764 tok/s step 18288/19560 | loss 3.312213 (+0.77z)| norm 0.2062 (-0.62z)| lr 9.25e-06 | 4159.91 ms | 32.5% bf16 MFU | 123877 tok/s step 18289/19560 | loss 3.286936 (+0.13z)| norm 0.2112 (-0.04z)| lr 9.24e-06 | 4181.91 ms | 32.3% bf16 MFU | 123952 tok/s step 18290/19560 | loss 3.258834 (-0.60z)| norm 0.2108 (-0.10z)| lr 9.22e-06 | 4150.22 ms | 32.5% bf16 MFU | 124071 tok/s step 18291/19560 | loss 3.265611 (-0.42z)| norm 0.2036 (-0.92z)| lr 9.21e-06 | 4514.29 ms | 29.9% bf16 MFU | 123674 tok/s step 18292/19560 | loss 3.246958 (-0.90z)| norm 0.2034 (-0.93z)| lr 9.19e-06 | 4161.60 ms | 32.4% bf16 MFU | 123790 tok/s step 18293/19560 | loss 3.184723 (-2.45z)| norm 0.2070 (-0.51z)| lr 9.18e-06 | 4151.26 ms | 32.5% bf16 MFU | 123915 tok/s step 18294/19560 | loss 3.340810 (+1.51z)| norm 0.2080 (-0.38z)| lr 9.16e-06 | 4216.59 ms | 32.0% bf16 MFU | 123936 tok/s step 18295/19560 | loss 3.318978 (+0.98z)| norm 0.2035 (-0.90z)| lr 9.15e-06 | 4152.55 ms | 32.5% bf16 MFU | 124052 tok/s step 18296/19560 | loss 3.283425 (+0.08z)| norm 0.2009 (-1.18z)| lr 9.13e-06 | 4154.62 ms | 32.5% bf16 MFU | 124159 tok/s step 18297/19560 | loss 3.280344 (-0.01z)| norm 0.1993 (-1.34z)| lr 9.12e-06 | 4163.32 ms | 32.4% bf16 MFU | 124248 tok/s step 18298/19560 | loss 3.345144 (+1.69z)| norm 0.2128 (+0.21z)| lr 9.11e-06 | 4167.67 ms | 32.4% bf16 MFU | 124325 tok/s step 18299/19560 | loss 3.260563 (-0.52z)| norm 0.2110 (-0.00z)| lr 9.09e-06 | 4152.86 ms | 32.5% bf16 MFU | 124421 tok/s step 18300/19560 | loss 3.273281 (-0.19z)| norm 0.2026 (-0.95z)| lr 9.08e-06 | 4164.94 ms | 32.4% bf16 MFU | 124494 tok/s step 18301/19560 | loss 3.265014 (-0.39z)| norm 0.2047 (-0.71z)| lr 9.06e-06 | 4157.81 ms | 32.5% bf16 MFU | 124575 tok/s step 18302/19560 | loss 3.247309 (-0.86z)| norm 0.2054 (-0.63z)| lr 9.05e-06 | 4166.76 ms | 32.4% bf16 MFU | 124637 tok/s step 18303/19560 | loss 3.280809 (+0.02z)| norm 0.2069 (-0.45z)| lr 9.03e-06 | 4156.41 ms | 32.5% bf16 MFU | 124712 tok/s step 18304/19560 | loss 3.277627 (-0.06z)| norm 0.2014 (-1.07z)| lr 9.02e-06 | 4241.73 ms | 31.8% bf16 MFU | 124657 tok/s step 18305/19560 | loss 3.206733 (-1.90z)| norm 0.2089 (-0.22z)| lr 9.01e-06 | 4159.86 ms | 32.5% bf16 MFU | 124726 tok/s step 18306/19560 | loss 3.321834 (+1.11z)| norm 0.1984 (-1.40z)| lr 8.99e-06 | 4207.90 ms | 32.1% bf16 MFU | 124719 tok/s step 18307/19560 | loss 3.287228 (+0.20z)| norm 0.1994 (-1.27z)| lr 8.98e-06 | 4173.81 ms | 32.3% bf16 MFU | 124764 tok/s step 18308/19560 | loss 3.276305 (-0.09z)| norm 0.2103 (-0.04z)| lr 8.96e-06 | 4455.15 ms | 30.3% bf16 MFU | 124410 tok/s step 18309/19560 | loss 3.353425 (+1.92z)| norm 0.2539 (+4.46z)| lr 8.95e-06 | 4169.18 ms | 32.4% bf16 MFU | 124477 tok/s step 18310/19560 | loss 3.261081 (-0.50z)| norm 0.2034 (-0.78z)| lr 8.93e-06 | 4165.94 ms | 32.4% bf16 MFU | 124546 tok/s step 18311/19560 | loss 3.259579 (-0.53z)| norm 0.2196 (+0.87z)| lr 8.92e-06 | 4383.73 ms | 30.8% bf16 MFU | 124298 tok/s step 18312/19560 | loss 3.296584 (+0.43z)| norm 0.2036 (-0.78z)| lr 8.91e-06 | 4155.44 ms | 32.5% bf16 MFU | 124392 tok/s step 18313/19560 | loss 3.282207 (+0.06z)| norm 0.2145 (+0.35z)| lr 8.89e-06 | 4169.55 ms | 32.4% bf16 MFU | 124459 tok/s step 18314/19560 | loss 3.231549 (-1.26z)| norm 0.2068 (-0.46z)| lr 8.88e-06 | 4173.56 ms | 32.4% bf16 MFU | 124517 tok/s step 18315/19560 | loss 3.229540 (-1.29z)| norm 0.2111 (-0.01z)| lr 8.86e-06 | 4158.96 ms | 32.5% bf16 MFU | 124595 tok/s step 18316/19560 | loss 3.341424 (+1.60z)| norm 0.2086 (-0.27z)| lr 8.85e-06 | 4313.15 ms | 31.3% bf16 MFU | 124443 tok/s step 18317/19560 | loss 3.275292 (-0.10z)| norm 0.2252 (+1.44z)| lr 8.84e-06 | 4255.66 ms | 31.7% bf16 MFU | 124381 tok/s step 18318/19560 | loss 3.282411 (+0.08z)| norm 0.2156 (+0.44z)| lr 8.82e-06 | 4224.15 ms | 32.0% bf16 MFU | 124367 tok/s step 18319/19560 | loss 3.240139 (-1.02z)| norm 0.2177 (+0.66z)| lr 8.81e-06 | 4171.96 ms | 32.4% bf16 MFU | 124432 tok/s step 18320/19560 | loss 3.234131 (-1.17z)| norm 0.2121 (+0.08z)| lr 8.79e-06 | 4175.40 ms | 32.3% bf16 MFU | 124489 tok/s step 18321/19560 | loss 3.204421 (-1.92z)| norm 0.2125 (+0.11z)| lr 8.78e-06 | 4167.04 ms | 32.4% bf16 MFU | 124556 tok/s step 18322/19560 | loss 3.261605 (-0.44z)| norm 0.2112 (-0.01z)| lr 8.76e-06 | 4183.91 ms | 32.3% bf16 MFU | 124593 tok/s step 18323/19560 | loss 3.274343 (-0.11z)| norm 0.2124 (+0.11z)| lr 8.75e-06 | 4159.02 ms | 32.5% bf16 MFU | 124667 tok/s step 18324/19560 | loss 3.236892 (-1.06z)| norm 0.2065 (-0.51z)| lr 8.74e-06 | 4169.80 ms | 32.4% bf16 MFU | 124720 tok/s step 18325/19560 | loss 3.290828 (+0.34z)| norm 0.2047 (-0.68z)| lr 8.72e-06 | 4173.73 ms | 32.3% bf16 MFU | 124765 tok/s step 18326/19560 | loss 3.253389 (-0.62z)| norm 0.2126 (+0.14z)| lr 8.71e-06 | 4158.39 ms | 32.5% bf16 MFU | 124831 tok/s step 18327/19560 | loss 3.300519 (+0.61z)| norm 0.2223 (+1.14z)| lr 8.69e-06 | 4174.80 ms | 32.3% bf16 MFU | 124868 tok/s step 18328/19560 | loss 3.250190 (-0.69z)| norm 0.2149 (+0.37z)| lr 8.68e-06 | 4174.72 ms | 32.3% bf16 MFU | 124904 tok/s step 18329/19560 | loss 3.277678 (+0.04z)| norm 0.2112 (-0.02z)| lr 8.67e-06 | 4427.49 ms | 30.5% bf16 MFU | 124580 tok/s step 18330/19560 | loss 3.266470 (-0.26z)| norm 0.5254 (+10.65z)| lr 8.65e-06 | 4173.24 ms | 32.4% bf16 MFU | 124632 tok/s step 18331/19560 | loss 3.244943 (-0.82z)| norm 0.2100 (-0.13z)| lr 8.64e-06 | 4202.62 ms | 32.1% bf16 MFU | 124638 tok/s step 18332/19560 | loss 3.224162 (-1.35z)| norm 0.2075 (-0.22z)| lr 8.62e-06 | 4175.13 ms | 32.3% bf16 MFU | 124685 tok/s step 18333/19560 | loss 3.314784 (+1.07z)| norm 0.2117 (-0.07z)| lr 8.61e-06 | 4163.83 ms | 32.4% bf16 MFU | 124747 tok/s step 18334/19560 | loss 3.211933 (-1.65z)| norm 0.2052 (-0.29z)| lr 8.60e-06 | 4176.82 ms | 32.3% bf16 MFU | 124785 tok/s step 18335/19560 | loss 3.253901 (-0.54z)| norm 0.2150 (+0.04z)| lr 8.58e-06 | 4161.67 ms | 32.4% bf16 MFU | 124845 tok/s step 18336/19560 | loss 3.270484 (-0.10z)| norm 0.2150 (+0.04z)| lr 8.57e-06 | 4181.25 ms | 32.3% bf16 MFU | 124872 tok/s step 18337/19560 | loss 3.270851 (-0.09z)| norm 0.2199 (+0.21z)| lr 8.55e-06 | 4170.09 ms | 32.4% bf16 MFU | 124915 tok/s step 18338/19560 | loss 3.245642 (-0.76z)| norm 0.2106 (-0.11z)| lr 8.54e-06 | 4165.65 ms | 32.4% bf16 MFU | 124962 tok/s step 18339/19560 | loss 3.225009 (-1.30z)| norm 0.2151 (+0.05z)| lr 8.53e-06 | 4178.04 ms | 32.3% bf16 MFU | 124989 tok/s step 18340/19560 | loss 3.255438 (-0.48z)| norm 0.2054 (-0.28z)| lr 8.51e-06 | 4163.74 ms | 32.4% bf16 MFU | 125035 tok/s step 18341/19560 | loss 3.247815 (-0.69z)| norm 0.2042 (-0.32z)| lr 8.50e-06 | 4163.22 ms | 32.4% bf16 MFU | 125080 tok/s step 18342/19560 | loss 3.320195 (+1.27z)| norm 0.2186 (+0.17z)| lr 8.48e-06 | 4177.57 ms | 32.3% bf16 MFU | 125101 tok/s step 18343/19560 | loss 3.290805 (+0.50z)| norm 0.2040 (-0.33z)| lr 8.47e-06 | 4175.70 ms | 32.3% bf16 MFU | 125124 tok/s step 18344/19560 | loss 3.287401 (+0.40z)| norm 0.2122 (-0.05z)| lr 8.46e-06 | 4170.46 ms | 32.4% bf16 MFU | 125153 tok/s step 18345/19560 | loss 3.294586 (+0.60z)| norm 0.2007 (-0.45z)| lr 8.44e-06 | 4164.71 ms | 32.4% bf16 MFU | 125190 tok/s step 18346/19560 | loss 3.247045 (-0.71z)| norm 0.2052 (-0.29z)| lr 8.43e-06 | 4180.56 ms | 32.3% bf16 MFU | 125201 tok/s step 18347/19560 | loss 3.255764 (-0.46z)| norm 0.2052 (-0.29z)| lr 8.42e-06 | 4161.18 ms | 32.4% bf16 MFU | 125241 tok/s step 18348/19560 | loss 3.256914 (-0.42z)| norm 0.2065 (-0.24z)| lr 8.40e-06 | 4174.32 ms | 32.3% bf16 MFU | 125259 tok/s step 18349/19560 | loss 3.252444 (-0.54z)| norm 0.2171 (+0.12z)| lr 8.39e-06 | 4164.57 ms | 32.4% bf16 MFU | 125290 tok/s step 18350/19560 | loss 3.238378 (-0.95z)| norm 0.2046 (-0.31z)| lr 8.37e-06 | 4167.10 ms | 32.4% bf16 MFU | 125317 tok/s step 18351/19560 | loss 3.300474 (+0.79z)| norm 0.2228 (+0.31z)| lr 8.36e-06 | 4172.56 ms | 32.4% bf16 MFU | 125333 tok/s step 18352/19560 | loss 3.365619 (+2.56z)| norm 0.2121 (-0.05z)| lr 8.35e-06 | 4162.56 ms | 32.4% bf16 MFU | 125364 tok/s step 18353/19560 | loss 3.242908 (-0.87z)| norm 0.2009 (-0.43z)| lr 8.33e-06 | 4186.41 ms | 32.3% bf16 MFU | 125358 tok/s step 18354/19560 | loss 3.284549 (+0.29z)| norm 0.2207 (+0.24z)| lr 8.32e-06 | 4172.05 ms | 32.4% bf16 MFU | 125373 tok/s step 18355/19560 | loss 3.397593 (+3.28z)| norm 0.2493 (+1.20z)| lr 8.31e-06 | 4169.42 ms | 32.4% bf16 MFU | 125392 tok/s step 18356/19560 | loss 3.245718 (-0.80z)| norm 0.2072 (-0.23z)| lr 8.29e-06 | 4184.30 ms | 32.3% bf16 MFU | 125387 tok/s step 18357/19560 | loss 3.293347 (+0.47z)| norm 0.2198 (+0.20z)| lr 8.28e-06 | 4225.45 ms | 32.0% bf16 MFU | 125322 tok/s step 18358/19560 | loss 3.276696 (+0.03z)| norm 0.2081 (-0.20z)| lr 8.26e-06 | 4171.30 ms | 32.4% bf16 MFU | 125340 tok/s step 18359/19560 | loss 3.291235 (+0.42z)| norm 0.2052 (-0.29z)| lr 8.25e-06 | 4161.60 ms | 32.4% bf16 MFU | 125372 tok/s step 18360/19560 | loss 3.291878 (+0.43z)| norm 0.2206 (+0.23z)| lr 8.24e-06 | 4169.70 ms | 32.4% bf16 MFU | 125391 tok/s step 18361/19560 | loss 3.268173 (-0.20z)| norm 0.2013 (-0.43z)| lr 8.22e-06 | 4171.46 ms | 32.4% bf16 MFU | 125405 tok/s step 18362/19560 | loss 3.257661 (-0.48z)| norm 0.2051 (-0.29z)| lr 8.21e-06 | 4170.29 ms | 32.4% bf16 MFU | 125421 tok/s step 18363/19560 | loss 3.265383 (-0.27z)| norm 0.2137 (-0.01z)| lr 8.20e-06 | 4177.52 ms | 32.3% bf16 MFU | 125425 tok/s step 18364/19560 | loss 3.286405 (+0.30z)| norm 0.2108 (-0.10z)| lr 8.18e-06 | 4190.45 ms | 32.2% bf16 MFU | 125410 tok/s step 18365/19560 | loss 3.252369 (-0.62z)| norm 0.2145 (+0.02z)| lr 8.17e-06 | 4169.39 ms | 32.4% bf16 MFU | 125426 tok/s step 18366/19560 | loss 3.270265 (-0.14z)| norm 0.2105 (-0.11z)| lr 8.16e-06 | 4217.50 ms | 32.0% bf16 MFU | 125371 tok/s step 18367/19560 | loss 3.359746 (+2.21z)| norm 0.2169 (+0.10z)| lr 8.14e-06 | 4169.06 ms | 32.4% bf16 MFU | 125390 tok/s step 18368/19560 | loss 3.263221 (-0.35z)| norm 0.2045 (-0.32z)| lr 8.13e-06 | 4174.27 ms | 32.3% bf16 MFU | 125401 tok/s step 18369/19560 | loss 3.298076 (+0.57z)| norm 0.2007 (-0.45z)| lr 8.11e-06 | 4168.73 ms | 32.4% bf16 MFU | 125419 tok/s step 18370/19560 | loss 3.293411 (+0.46z)| norm 0.2239 (+0.35z)| lr 8.10e-06 | 4158.47 ms | 32.5% bf16 MFU | 125452 tok/s step 18371/19560 | loss 3.318561 (+1.14z)| norm 0.2109 (-0.09z)| lr 8.09e-06 | 4176.42 ms | 32.3% bf16 MFU | 125456 tok/s step 18372/19560 | loss 3.323164 (+1.24z)| norm 0.2054 (-0.28z)| lr 8.07e-06 | 4176.13 ms | 32.3% bf16 MFU | 125460 tok/s step 18373/19560 | loss 3.307441 (+0.81z)| norm 0.2027 (-0.38z)| lr 8.06e-06 | 4158.70 ms | 32.5% bf16 MFU | 125491 tok/s step 18374/19560 | loss 3.268777 (-0.22z)| norm 0.2158 (+0.07z)| lr 8.05e-06 | 4174.22 ms | 32.3% bf16 MFU | 125496 tok/s step 18375/19560 | loss 3.304872 (+0.74z)| norm 0.2128 (-0.03z)| lr 8.03e-06 | 4164.16 ms | 32.4% bf16 MFU | 125517 tok/s step 18376/19560 | loss 3.217901 (-1.57z)| norm 0.2083 (-0.19z)| lr 8.02e-06 | 4194.55 ms | 32.2% bf16 MFU | 125491 tok/s step 18377/19560 | loss 3.273661 (-0.09z)| norm 0.2073 (-0.21z)| lr 8.01e-06 | 4259.11 ms | 31.7% bf16 MFU | 125371 tok/s step 18378/19560 | loss 3.209683 (-1.75z)| norm 0.2215 (+0.27z)| lr 7.99e-06 | 4176.11 ms | 32.3% bf16 MFU | 125380 tok/s step 18379/19560 | loss 3.255113 (-0.55z)| norm 0.2076 (-0.20z)| lr 7.98e-06 | 4169.05 ms | 32.4% bf16 MFU | 125399 tok/s step 18380/19560 | loss 3.300565 (+0.64z)| norm 0.2073 (-0.21z)| lr 7.97e-06 | 4168.98 ms | 32.4% bf16 MFU | 125417 tok/s step 18381/19560 | loss 3.241993 (-0.89z)| norm 0.2133 (-0.01z)| lr 7.95e-06 | 4160.17 ms | 32.5% bf16 MFU | 125447 tok/s step 18382/19560 | loss 3.260008 (-0.43z)| norm 0.2067 (-0.23z)| lr 7.94e-06 | 4161.65 ms | 32.4% bf16 MFU | 125474 tok/s step 18383/19560 | loss 3.323829 (+1.28z)| norm 0.2069 (-0.22z)| lr 7.93e-06 | 4174.06 ms | 32.3% bf16 MFU | 125480 tok/s step 18384/19560 | loss 3.277799 (+0.03z)| norm 0.2038 (-0.33z)| lr 7.91e-06 | 4160.17 ms | 32.5% bf16 MFU | 125508 tok/s step 18385/19560 | loss 3.273634 (-0.08z)| norm 0.2044 (-0.31z)| lr 7.90e-06 | 4169.94 ms | 32.4% bf16 MFU | 125519 tok/s step 18386/19560 | loss 3.200989 (-2.00z)| norm 0.2148 (+0.05z)| lr 7.89e-06 | 4178.49 ms | 32.3% bf16 MFU | 125516 tok/s step 18387/19560 | loss 3.233601 (-1.13z)| norm 0.2046 (-0.29z)| lr 7.87e-06 | 4174.96 ms | 32.3% bf16 MFU | 125520 tok/s step 18388/19560 | loss 3.292284 (+0.48z)| norm 0.2318 (+0.63z)| lr 7.86e-06 | 4165.29 ms | 32.4% bf16 MFU | 125537 tok/s step 18389/19560 | loss 3.292311 (+0.49z)| norm 0.2049 (-0.29z)| lr 7.85e-06 | 4154.16 ms | 32.5% bf16 MFU | 125571 tok/s step 18390/19560 | loss 3.230584 (-1.19z)| norm 0.2085 (-0.17z)| lr 7.83e-06 | 4168.00 ms | 32.4% bf16 MFU | 125582 tok/s step 18391/19560 | loss 3.252305 (-0.60z)| norm 0.2068 (-0.22z)| lr 7.82e-06 | 4173.54 ms | 32.4% bf16 MFU | 125584 tok/s step 18392/19560 | loss 3.243978 (-0.83z)| norm 0.2018 (-0.39z)| lr 7.81e-06 | 4171.53 ms | 32.4% bf16 MFU | 125589 tok/s step 18393/19560 | loss 3.294877 (+0.56z)| norm 0.2061 (-0.24z)| lr 7.79e-06 | 4172.65 ms | 32.4% bf16 MFU | 125592 tok/s step 18394/19560 | loss 3.400134 (+3.28z)| norm 0.2078 (-0.18z)| lr 7.78e-06 | 4164.21 ms | 32.4% bf16 MFU | 125607 tok/s step 18395/19560 | loss 3.273345 (-0.06z)| norm 0.2032 (-0.34z)| lr 7.77e-06 | 4163.71 ms | 32.4% bf16 MFU | 125623 tok/s step 18396/19560 | loss 3.232956 (-1.11z)| norm 0.2093 (-0.13z)| lr 7.75e-06 | 4171.88 ms | 32.4% bf16 MFU | 125625 tok/s step 18397/19560 | loss 3.274072 (-0.03z)| norm 0.2044 (-0.30z)| lr 7.74e-06 | 4163.44 ms | 32.4% bf16 MFU | 125640 tok/s step 18398/19560 | loss 3.292150 (+0.45z)| norm 0.2059 (-0.24z)| lr 7.73e-06 | 4165.42 ms | 32.4% bf16 MFU | 125652 tok/s step 18399/19560 | loss 3.223289 (-1.35z)| norm 0.2015 (-0.39z)| lr 7.71e-06 | 4161.34 ms | 32.4% bf16 MFU | 125668 tok/s step 18400/19560 | loss 3.254340 (-0.52z)| norm 0.2137 (+0.02z)| lr 7.70e-06 | 4165.48 ms | 32.4% bf16 MFU | 125678 tok/s step 18401/19560 | loss 3.336693 (+1.66z)| norm 0.2204 (+0.25z)| lr 7.69e-06 | 4175.99 ms | 32.3% bf16 MFU | 125672 tok/s step 18402/19560 | loss 3.260215 (-0.36z)| norm 0.2124 (-0.02z)| lr 7.67e-06 | 4164.32 ms | 32.4% bf16 MFU | 125683 tok/s step 18403/19560 | loss 3.299403 (+0.69z)| norm 0.2003 (-0.43z)| lr 7.66e-06 | 4169.56 ms | 32.4% bf16 MFU | 125686 tok/s step 18404/19560 | loss 3.299155 (+0.68z)| norm 0.2222 (+0.32z)| lr 7.65e-06 | 4169.36 ms | 32.4% bf16 MFU | 125689 tok/s step 18405/19560 | loss 3.326178 (+1.38z)| norm 0.1977 (-0.52z)| lr 7.63e-06 | 4173.01 ms | 32.4% bf16 MFU | 125687 tok/s step 18406/19560 | loss 3.245005 (-0.77z)| norm 0.2085 (-0.15z)| lr 7.62e-06 | 4161.66 ms | 32.4% bf16 MFU | 125701 tok/s step 18407/19560 | loss 3.319542 (+1.21z)| norm 0.2080 (-0.17z)| lr 7.61e-06 | 4170.20 ms | 32.4% bf16 MFU | 125702 tok/s step 18408/19560 | loss 3.207755 (-1.74z)| norm 0.2048 (-0.27z)| lr 7.59e-06 | 4192.53 ms | 32.2% bf16 MFU | 125670 tok/s step 18409/19560 | loss 3.248797 (-0.65z)| norm 0.2080 (-0.17z)| lr 7.58e-06 | 4172.86 ms | 32.4% bf16 MFU | 125669 tok/s step 18410/19560 | loss 3.246671 (-0.71z)| norm 0.2065 (-0.20z)| lr 7.57e-06 | 4170.35 ms | 32.4% bf16 MFU | 125671 tok/s step 18411/19560 | loss 3.256644 (-0.44z)| norm 0.2209 (+0.29z)| lr 7.55e-06 | 4159.96 ms | 32.5% bf16 MFU | 125689 tok/s step 18412/19560 | loss 3.239871 (-0.87z)| norm 0.2176 (+0.17z)| lr 7.54e-06 | 4770.93 ms | 28.3% bf16 MFU | 124899 tok/s step 18413/19560 | loss 3.275077 (+0.05z)| norm 0.2032 (-0.32z)| lr 7.53e-06 | 4856.29 ms | 27.8% bf16 MFU | 124052 tok/s step 18414/19560 | loss 3.285160 (+0.32z)| norm 0.2038 (-0.30z)| lr 7.52e-06 | 4532.62 ms | 29.8% bf16 MFU | 123633 tok/s step 18415/19560 | loss 3.297944 (+0.64z)| norm 0.2043 (-0.28z)| lr 7.50e-06 | 4710.52 ms | 28.7% bf16 MFU | 123017 tok/s step 18416/19560 | loss 3.299925 (+0.70z)| norm 0.2009 (-0.40z)| lr 7.49e-06 | 4673.97 ms | 28.9% bf16 MFU | 122474 tok/s step 18417/19560 | loss 3.296215 (+0.60z)| norm 0.2143 (+0.06z)| lr 7.48e-06 | 4309.13 ms | 31.3% bf16 MFU | 122434 tok/s step 18418/19560 | loss 3.254827 (-0.50z)| norm 0.2045 (-0.27z)| lr 7.46e-06 | 4383.62 ms | 30.8% bf16 MFU | 122292 tok/s step 18419/19560 | loss 3.305982 (+0.85z)| norm 0.2044 (-0.28z)| lr 7.45e-06 | 4357.67 ms | 31.0% bf16 MFU | 122193 tok/s step 18420/19560 | loss 3.251453 (-0.59z)| norm 0.2156 (+0.10z)| lr 7.44e-06 | 4310.52 ms | 31.3% bf16 MFU | 122165 tok/s step 18421/19560 | loss 3.258570 (-0.43z)| norm 0.2085 (-0.14z)| lr 7.42e-06 | 4199.89 ms | 32.1% bf16 MFU | 122299 tok/s step 18422/19560 | loss 3.237997 (-0.97z)| norm 0.2102 (-0.08z)| lr 7.41e-06 | 4276.83 ms | 31.6% bf16 MFU | 122313 tok/s step 18423/19560 | loss 3.283071 (+0.26z)| norm 0.2007 (-0.41z)| lr 7.40e-06 | 4251.73 ms | 31.8% bf16 MFU | 122363 tok/s step 18424/19560 | loss 3.319100 (+1.23z)| norm 0.2213 (+0.30z)| lr 7.39e-06 | 4214.00 ms | 32.0% bf16 MFU | 122466 tok/s step 18425/19560 | loss 3.272995 (-0.02z)| norm 0.2113 (-0.05z)| lr 7.37e-06 | 4211.26 ms | 32.1% bf16 MFU | 122567 tok/s step 18426/19560 | loss 3.301816 (+0.79z)| norm 0.2072 (-0.19z)| lr 7.36e-06 | 4186.07 ms | 32.3% bf16 MFU | 122701 tok/s step 18427/19560 | loss 3.264862 (-0.23z)| norm 0.2006 (-0.42z)| lr 7.35e-06 | 4286.46 ms | 31.5% bf16 MFU | 122682 tok/s step 18428/19560 | loss 3.263174 (-0.28z)| norm 0.2091 (-0.13z)| lr 7.33e-06 | 4213.50 ms | 32.0% bf16 MFU | 122769 tok/s step 18429/19560 | loss 3.257289 (-0.44z)| norm 0.2027 (-0.35z)| lr 7.32e-06 | 4225.36 ms | 32.0% bf16 MFU | 122835 tok/s step 18430/19560 | loss 3.259151 (-0.39z)| norm 0.2018 (-0.38z)| lr 7.31e-06 | 4416.41 ms | 30.6% bf16 MFU | 122629 tok/s step 18431/19560 | loss 3.336073 (+1.70z)| norm 0.2184 (+0.19z)| lr 7.29e-06 | 4272.70 ms | 31.6% bf16 MFU | 122633 tok/s step 18432/19560 | loss 3.251019 (-0.61z)| norm 0.2026 (-0.35z)| lr 7.28e-06 | 4252.28 ms | 31.8% bf16 MFU | 122666 tok/s step 18433/19560 | loss 3.302706 (+0.78z)| norm 0.2097 (-0.11z)| lr 7.27e-06 | 4216.67 ms | 32.0% bf16 MFU | 122749 tok/s step 18434/19560 | loss 3.272384 (-0.04z)| norm 0.2034 (-0.33z)| lr 7.26e-06 | 4170.29 ms | 32.4% bf16 MFU | 122898 tok/s step 18435/19560 | loss 3.309823 (+0.99z)| norm 0.2067 (-0.21z)| lr 7.24e-06 | 4180.04 ms | 32.3% bf16 MFU | 123024 tok/s step 18436/19560 | loss 3.291616 (+0.48z)| norm 0.2080 (-0.17z)| lr 7.23e-06 | 4176.20 ms | 32.3% bf16 MFU | 123150 tok/s step 18437/19560 | loss 3.278841 (+0.15z)| norm 0.2121 (-0.02z)| lr 7.22e-06 | 4215.27 ms | 32.0% bf16 MFU | 123212 tok/s step 18438/19560 | loss 3.406503 (+3.52z)| norm 0.2190 (+0.22z)| lr 7.20e-06 | 4161.48 ms | 32.4% bf16 MFU | 123350 tok/s step 18439/19560 | loss 3.251360 (-0.62z)| norm 0.1944 (-0.63z)| lr 7.19e-06 | 4169.52 ms | 32.4% bf16 MFU | 123470 tok/s step 18440/19560 | loss 3.285423 (+0.29z)| norm 0.2038 (-0.30z)| lr 7.18e-06 | 4176.87 ms | 32.3% bf16 MFU | 123573 tok/s step 18441/19560 | loss 3.274397 (-0.00z)| norm 0.2168 (+0.15z)| lr 7.17e-06 | 4171.77 ms | 32.4% bf16 MFU | 123678 tok/s step 18442/19560 | loss 3.304512 (+0.79z)| norm 0.2037 (-0.31z)| lr 7.15e-06 | 4202.47 ms | 32.1% bf16 MFU | 123732 tok/s step 18443/19560 | loss 3.296703 (+0.57z)| norm 0.2084 (-0.14z)| lr 7.14e-06 | 4168.27 ms | 32.4% bf16 MFU | 123834 tok/s step 18444/19560 | loss 3.289144 (+0.38z)| norm 0.2050 (-0.26z)| lr 7.13e-06 | 4168.42 ms | 32.4% bf16 MFU | 123931 tok/s step 18445/19560 | loss 3.288188 (+0.35z)| norm 0.2036 (-0.30z)| lr 7.12e-06 | 4179.72 ms | 32.3% bf16 MFU | 124006 tok/s step 18446/19560 | loss 3.319140 (+1.18z)| norm 0.2199 (+0.26z)| lr 7.10e-06 | 4176.82 ms | 32.3% bf16 MFU | 124082 tok/s step 18447/19560 | loss 3.236480 (-1.06z)| norm 0.2063 (-0.20z)| lr 7.09e-06 | 4179.80 ms | 32.3% bf16 MFU | 124150 tok/s step 18448/19560 | loss 3.259675 (-0.44z)| norm 0.2048 (-0.25z)| lr 7.08e-06 | 4227.58 ms | 31.9% bf16 MFU | 124143 tok/s step 18449/19560 | loss 3.246830 (-0.81z)| norm 0.2161 (+0.14z)| lr 7.06e-06 | 4203.69 ms | 32.1% bf16 MFU | 124172 tok/s step 18450/19560 | loss 3.287462 (+0.31z)| norm 0.2085 (-0.13z)| lr 7.05e-06 | 4178.07 ms | 32.3% bf16 MFU | 124238 tok/s step 18451/19560 | loss 3.272953 (-0.09z)| norm 0.2157 (+0.12z)| lr 7.04e-06 | 4182.10 ms | 32.3% bf16 MFU | 124294 tok/s step 18452/19560 | loss 3.380084 (+2.76z)| norm 0.2113 (-0.03z)| lr 7.03e-06 | 4222.33 ms | 32.0% bf16 MFU | 124288 tok/s step 18453/19560 | loss 3.242027 (-0.94z)| norm 0.2158 (+0.12z)| lr 7.01e-06 | 4187.52 ms | 32.2% bf16 MFU | 124334 tok/s step 18454/19560 | loss 3.341043 (+1.68z)| norm 0.1981 (-0.49z)| lr 7.00e-06 | 4167.55 ms | 32.4% bf16 MFU | 124407 tok/s step 18455/19560 | loss 3.264424 (-0.35z)| norm 0.2037 (-0.29z)| lr 6.99e-06 | 4194.65 ms | 32.2% bf16 MFU | 124436 tok/s step 18456/19560 | loss 3.265528 (-0.32z)| norm 0.2013 (-0.37z)| lr 6.98e-06 | 4187.07 ms | 32.2% bf16 MFU | 124475 tok/s step 18457/19560 | loss 3.282809 (+0.14z)| norm 0.2065 (-0.19z)| lr 6.96e-06 | 4170.09 ms | 32.4% bf16 MFU | 124538 tok/s step 18458/19560 | loss 3.274619 (-0.08z)| norm 0.2064 (-0.40z)| lr 6.95e-06 | 4176.64 ms | 32.3% bf16 MFU | 124587 tok/s step 18459/19560 | loss 3.308981 (+0.82z)| norm 0.2312 (+2.81z)| lr 6.94e-06 | 4174.18 ms | 32.3% bf16 MFU | 124638 tok/s step 18460/19560 | loss 3.336750 (+1.53z)| norm 0.2090 (-0.07z)| lr 6.93e-06 | 4165.80 ms | 32.4% bf16 MFU | 124699 tok/s step 18461/19560 | loss 3.265940 (-0.34z)| norm 0.2008 (-1.12z)| lr 6.91e-06 | 4191.39 ms | 32.2% bf16 MFU | 124718 tok/s step 18462/19560 | loss 3.271003 (-0.22z)| norm 0.2149 (+0.69z)| lr 6.90e-06 | 4173.19 ms | 32.4% bf16 MFU | 124764 tok/s step 18463/19560 | loss 3.297694 (+0.49z)| norm 0.1982 (-1.44z)| lr 6.89e-06 | 4173.37 ms | 32.4% bf16 MFU | 124807 tok/s step 18464/19560 | loss 3.374991 (+2.49z)| norm 0.2076 (-0.23z)| lr 6.88e-06 | 4181.80 ms | 32.3% bf16 MFU | 124835 tok/s step 18465/19560 | loss 3.298679 (+0.48z)| norm 0.2030 (-0.81z)| lr 6.86e-06 | 4179.38 ms | 32.3% bf16 MFU | 124866 tok/s step 18466/19560 | loss 3.238382 (-1.11z)| norm 0.2085 (-0.10z)| lr 6.85e-06 | 4272.39 ms | 31.6% bf16 MFU | 124758 tok/s step 18467/19560 | loss 3.275759 (-0.14z)| norm 0.1925 (-2.10z)| lr 6.84e-06 | 4183.40 ms | 32.3% bf16 MFU | 124787 tok/s step 18468/19560 | loss 3.305452 (+0.64z)| norm 0.2151 (+0.76z)| lr 6.83e-06 | 4168.57 ms | 32.4% bf16 MFU | 124836 tok/s step 18469/19560 | loss 3.285653 (+0.11z)| norm 0.1995 (-1.21z)| lr 6.81e-06 | 4176.43 ms | 32.3% bf16 MFU | 124871 tok/s step 18470/19560 | loss 3.239130 (-1.11z)| norm 0.2057 (-0.42z)| lr 6.80e-06 | 4179.50 ms | 32.3% bf16 MFU | 124900 tok/s step 18471/19560 | loss 3.224385 (-1.48z)| norm 0.2194 (+1.30z)| lr 6.79e-06 | 4181.16 ms | 32.3% bf16 MFU | 124924 tok/s step 18472/19560 | loss 3.277879 (-0.06z)| norm 0.2065 (-0.33z)| lr 6.78e-06 | 4177.07 ms | 32.3% bf16 MFU | 124954 tok/s step 18473/19560 | loss 3.294003 (+0.36z)| norm 0.2079 (-0.15z)| lr 6.76e-06 | 4171.63 ms | 32.4% bf16 MFU | 124990 tok/s step 18474/19560 | loss 3.260348 (-0.53z)| norm 0.2084 (-0.10z)| lr 6.75e-06 | 4208.03 ms | 32.1% bf16 MFU | 124970 tok/s step 18475/19560 | loss 3.279455 (-0.03z)| norm 0.2042 (-0.63z)| lr 6.74e-06 | 4169.54 ms | 32.4% bf16 MFU | 125009 tok/s step 18476/19560 | loss 3.278548 (-0.06z)| norm 0.2074 (-0.22z)| lr 6.73e-06 | 4173.84 ms | 32.3% bf16 MFU | 125039 tok/s step 18477/19560 | loss 3.220225 (-1.59z)| norm 0.1980 (-1.40z)| lr 6.71e-06 | 4168.14 ms | 32.4% bf16 MFU | 125076 tok/s step 18478/19560 | loss 3.293280 (+0.33z)| norm 0.2106 (+0.19z)| lr 6.70e-06 | 4176.07 ms | 32.3% bf16 MFU | 125100 tok/s step 18479/19560 | loss 3.228832 (-1.35z)| norm 0.1976 (-1.43z)| lr 6.69e-06 | 4178.24 ms | 32.3% bf16 MFU | 125119 tok/s step 18480/19560 | loss 3.263303 (-0.44z)| norm 0.2061 (-0.35z)| lr 6.68e-06 | 4172.45 ms | 32.4% bf16 MFU | 125146 tok/s step 18481/19560 | loss 3.345129 (+1.72z)| norm 0.2079 (-0.13z)| lr 6.67e-06 | 4177.86 ms | 32.3% bf16 MFU | 125163 tok/s step 18482/19560 | loss 3.266767 (-0.36z)| norm 0.2088 (+0.01z)| lr 6.65e-06 | 4200.48 ms | 32.1% bf16 MFU | 125146 tok/s step 18483/19560 | loss 3.254534 (-0.68z)| norm 0.2116 (+0.45z)| lr 6.64e-06 | 4161.94 ms | 32.4% bf16 MFU | 125187 tok/s step 18484/19560 | loss 3.259190 (-0.55z)| norm 0.2018 (-0.97z)| lr 6.63e-06 | 4187.06 ms | 32.2% bf16 MFU | 125188 tok/s step 18485/19560 | loss 3.237748 (-1.13z)| norm 0.2048 (-0.51z)| lr 6.62e-06 | 4163.00 ms | 32.4% bf16 MFU | 125226 tok/s step 18486/19560 | loss 3.457452 (+4.47z)| norm 0.2457 (+4.89z)| lr 6.60e-06 | 4173.18 ms | 32.4% bf16 MFU | 125246 tok/s step 18487/19560 | loss 3.290633 (+0.26z)| norm 0.2132 (+0.60z)| lr 6.59e-06 | 4386.03 ms | 30.8% bf16 MFU | 124961 tok/s step 18488/19560 | loss 3.264110 (-0.40z)| norm 0.2082 (-0.05z)| lr 6.58e-06 | 4179.14 ms | 32.3% bf16 MFU | 124985 tok/s step 18489/19560 | loss 3.292161 (+0.30z)| norm 0.2162 (+0.99z)| lr 6.57e-06 | 4175.58 ms | 32.3% bf16 MFU | 125014 tok/s step 18490/19560 | loss 3.226760 (-1.33z)| norm 0.2080 (-0.10z)| lr 6.55e-06 | 4165.22 ms | 32.4% bf16 MFU | 125057 tok/s step 18491/19560 | loss 3.211055 (-1.70z)| norm 0.2185 (+1.29z)| lr 6.54e-06 | 4180.97 ms | 32.3% bf16 MFU | 125074 tok/s step 18492/19560 | loss 3.294672 (+0.37z)| norm 0.2117 (+0.40z)| lr 6.53e-06 | 4217.53 ms | 32.0% bf16 MFU | 125036 tok/s step 18493/19560 | loss 3.251146 (-0.70z)| norm 0.2015 (-0.95z)| lr 6.52e-06 | 4228.12 ms | 31.9% bf16 MFU | 124984 tok/s step 18494/19560 | loss 3.368946 (+2.16z)| norm 0.2203 (+1.52z)| lr 6.51e-06 | 4179.04 ms | 32.3% bf16 MFU | 125008 tok/s step 18495/19560 | loss 3.235932 (-1.07z)| norm 0.2289 (+2.58z)| lr 6.49e-06 | 4251.85 ms | 31.8% bf16 MFU | 124923 tok/s step 18496/19560 | loss 3.272932 (-0.16z)| norm 0.2049 (-0.51z)| lr 6.48e-06 | 4179.68 ms | 32.3% bf16 MFU | 124949 tok/s step 18497/19560 | loss 3.258724 (-0.50z)| norm 0.2010 (-1.00z)| lr 6.47e-06 | 4204.88 ms | 32.1% bf16 MFU | 124935 tok/s step 18498/19560 | loss 3.384261 (+2.51z)| norm 0.2167 (+1.02z)| lr 6.46e-06 | 4177.22 ms | 32.3% bf16 MFU | 124964 tok/s step 18499/19560 | loss 3.282337 (+0.07z)| norm 0.2002 (-1.10z)| lr 6.45e-06 | 4230.72 ms | 31.9% bf16 MFU | 124912 tok/s step 18500/19560 | loss 3.243496 (-0.86z)| norm 0.2024 (-0.80z)| lr 6.43e-06 | 4179.36 ms | 32.3% bf16 MFU | 124939 tok/s val loss 3.252230 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3060/10042 = 0.304720 step 18501/19560 | loss 3.261404 (-0.42z)| norm 0.2055 (-0.41z)| lr 6.42e-06 | 4213.18 ms | 32.0% bf16 MFU | 124914 tok/s step 18502/19560 | loss 3.273704 (-0.12z)| norm 0.2036 (-0.64z)| lr 6.41e-06 | 4184.31 ms | 32.3% bf16 MFU | 124933 tok/s step 18503/19560 | loss 3.262195 (-0.39z)| norm 0.2021 (-0.83z)| lr 6.40e-06 | 4185.76 ms | 32.3% bf16 MFU | 124949 tok/s step 18504/19560 | loss 3.245310 (-0.81z)| norm 0.2007 (-1.00z)| lr 6.39e-06 | 4183.59 ms | 32.3% bf16 MFU | 124968 tok/s step 18505/19560 | loss 3.335519 (+1.36z)| norm 0.2011 (-0.93z)| lr 6.37e-06 | 4168.61 ms | 32.4% bf16 MFU | 125008 tok/s step 18506/19560 | loss 3.274170 (-0.13z)| norm 0.2048 (-0.45z)| lr 6.36e-06 | 4173.17 ms | 32.4% bf16 MFU | 125039 tok/s step 18507/19560 | loss 3.294118 (+0.35z)| norm 0.2094 (+0.15z)| lr 6.35e-06 | 4178.56 ms | 32.3% bf16 MFU | 125061 tok/s step 18508/19560 | loss 3.274855 (-0.12z)| norm 0.2031 (-0.67z)| lr 6.34e-06 | 4178.76 ms | 32.3% bf16 MFU | 125081 tok/s step 18509/19560 | loss 3.236535 (-1.05z)| norm 0.1955 (-1.62z)| lr 6.32e-06 | 4167.24 ms | 32.4% bf16 MFU | 125118 tok/s step 18510/19560 | loss 3.305974 (+0.64z)| norm 0.2095 (+0.17z)| lr 6.31e-06 | 4229.55 ms | 31.9% bf16 MFU | 125060 tok/s step 18511/19560 | loss 3.278207 (-0.03z)| norm 0.2099 (+0.22z)| lr 6.30e-06 | 4180.04 ms | 32.3% bf16 MFU | 125078 tok/s step 18512/19560 | loss 3.220735 (-1.42z)| norm 0.1987 (-1.21z)| lr 6.29e-06 | 4173.62 ms | 32.4% bf16 MFU | 125105 tok/s step 18513/19560 | loss 3.273742 (-0.13z)| norm 0.2070 (-0.15z)| lr 6.28e-06 | 4174.73 ms | 32.3% bf16 MFU | 125129 tok/s step 18514/19560 | loss 3.264433 (-0.37z)| norm 0.1971 (-1.39z)| lr 6.27e-06 | 4179.27 ms | 32.3% bf16 MFU | 125145 tok/s step 18515/19560 | loss 3.248240 (-0.78z)| norm 0.1990 (-1.13z)| lr 6.25e-06 | 4192.75 ms | 32.2% bf16 MFU | 125140 tok/s step 18516/19560 | loss 3.272482 (-0.18z)| norm 0.2156 (+1.01z)| lr 6.24e-06 | 4165.88 ms | 32.4% bf16 MFU | 125176 tok/s step 18517/19560 | loss 3.310236 (+0.75z)| norm 0.2105 (+0.35z)| lr 6.23e-06 | 4196.06 ms | 32.2% bf16 MFU | 125164 tok/s step 18518/19560 | loss 3.267313 (-0.31z)| norm 0.2193 (+1.47z)| lr 6.22e-06 | 4210.68 ms | 32.1% bf16 MFU | 125132 tok/s step 18519/19560 | loss 3.258006 (-0.55z)| norm 0.2315 (+2.92z)| lr 6.21e-06 | 4172.35 ms | 32.4% bf16 MFU | 125158 tok/s step 18520/19560 | loss 3.240954 (-0.97z)| norm 0.2121 (+0.48z)| lr 6.19e-06 | 4175.97 ms | 32.3% bf16 MFU | 125178 tok/s step 18521/19560 | loss 3.246674 (-0.82z)| norm 0.1984 (-1.22z)| lr 6.18e-06 | 4182.65 ms | 32.3% bf16 MFU | 125186 tok/s step 18522/19560 | loss 3.322954 (+1.12z)| norm 0.2076 (-0.08z)| lr 6.17e-06 | 4197.81 ms | 32.2% bf16 MFU | 125172 tok/s step 18523/19560 | loss 3.286627 (+0.19z)| norm 0.2194 (+1.38z)| lr 6.16e-06 | 4174.50 ms | 32.3% bf16 MFU | 125193 tok/s step 18524/19560 | loss 3.207579 (-1.81z)| norm 0.2067 (-0.19z)| lr 6.15e-06 | 4175.21 ms | 32.3% bf16 MFU | 125212 tok/s step 18525/19560 | loss 3.279421 (+0.01z)| norm 0.2097 (+0.18z)| lr 6.13e-06 | 4172.05 ms | 32.4% bf16 MFU | 125234 tok/s step 18526/19560 | loss 3.310681 (+0.80z)| norm 0.2077 (-0.08z)| lr 6.12e-06 | 4172.85 ms | 32.4% bf16 MFU | 125255 tok/s step 18527/19560 | loss 3.217357 (-1.56z)| norm 0.2086 (+0.03z)| lr 6.11e-06 | 4168.92 ms | 32.4% bf16 MFU | 125280 tok/s step 18528/19560 | loss 3.254519 (-0.62z)| norm 0.2018 (-0.81z)| lr 6.10e-06 | 4171.80 ms | 32.4% bf16 MFU | 125300 tok/s step 18529/19560 | loss 3.308132 (+0.74z)| norm 0.2120 (+0.48z)| lr 6.09e-06 | 4215.39 ms | 32.0% bf16 MFU | 125254 tok/s step 18530/19560 | loss 3.381983 (+2.54z)| norm 0.2247 (+2.02z)| lr 6.08e-06 | 4162.35 ms | 32.4% bf16 MFU | 125289 tok/s step 18531/19560 | loss 3.346152 (+1.62z)| norm 0.2496 (+4.62z)| lr 6.06e-06 | 4166.10 ms | 32.4% bf16 MFU | 125317 tok/s step 18532/19560 | loss 3.379222 (+2.37z)| norm 0.2443 (+3.80z)| lr 6.05e-06 | 4300.54 ms | 31.4% bf16 MFU | 125147 tok/s step 18533/19560 | loss 3.190229 (-2.12z)| norm 0.2172 (+0.88z)| lr 6.04e-06 | 4174.34 ms | 32.3% bf16 MFU | 125169 tok/s step 18534/19560 | loss 3.248341 (-0.75z)| norm 0.2049 (-0.44z)| lr 6.03e-06 | 4169.49 ms | 32.4% bf16 MFU | 125198 tok/s step 18535/19560 | loss 3.296965 (+0.41z)| norm 0.2165 (+0.80z)| lr 6.02e-06 | 4187.45 ms | 32.2% bf16 MFU | 125198 tok/s step 18536/19560 | loss 3.203193 (-1.82z)| norm 0.2060 (-0.33z)| lr 6.01e-06 | 4454.97 ms | 30.3% bf16 MFU | 124823 tok/s step 18537/19560 | loss 3.333938 (+1.27z)| norm 0.2107 (+0.17z)| lr 5.99e-06 | 4188.64 ms | 32.2% bf16 MFU | 124840 tok/s step 18538/19560 | loss 3.316514 (+0.85z)| norm 0.2745 (+5.93z)| lr 5.98e-06 | 4236.27 ms | 31.9% bf16 MFU | 124786 tok/s step 18539/19560 | loss 3.275108 (-0.14z)| norm 0.2065 (-0.28z)| lr 5.97e-06 | 4159.55 ms | 32.5% bf16 MFU | 124849 tok/s step 18540/19560 | loss 3.281352 (+0.00z)| norm 0.2114 (+0.18z)| lr 5.96e-06 | 4167.08 ms | 32.4% bf16 MFU | 124897 tok/s step 18541/19560 | loss 3.215687 (-1.53z)| norm 0.2202 (+0.97z)| lr 5.95e-06 | 4172.84 ms | 32.4% bf16 MFU | 124935 tok/s step 18542/19560 | loss 3.197940 (-1.91z)| norm 0.2127 (+0.28z)| lr 5.94e-06 | 4167.60 ms | 32.4% bf16 MFU | 124978 tok/s step 18543/19560 | loss 3.303741 (+0.55z)| norm 0.2131 (+0.31z)| lr 5.92e-06 | 4169.77 ms | 32.4% bf16 MFU | 125016 tok/s step 18544/19560 | loss 3.247301 (-0.75z)| norm 0.2207 (+0.99z)| lr 5.91e-06 | 4181.05 ms | 32.3% bf16 MFU | 125035 tok/s step 18545/19560 | loss 3.246984 (-0.75z)| norm 0.2078 (-0.19z)| lr 5.90e-06 | 4165.98 ms | 32.4% bf16 MFU | 125076 tok/s step 18546/19560 | loss 3.345450 (+1.50z)| norm 0.2063 (-0.32z)| lr 5.89e-06 | 4178.34 ms | 32.3% bf16 MFU | 125096 tok/s step 18547/19560 | loss 3.280284 (+0.01z)| norm 0.2095 (-0.03z)| lr 5.88e-06 | 4170.45 ms | 32.4% bf16 MFU | 125127 tok/s step 18548/19560 | loss 3.372740 (+2.08z)| norm 0.2364 (+2.37z)| lr 5.87e-06 | 4208.14 ms | 32.1% bf16 MFU | 125100 tok/s step 18549/19560 | loss 3.287951 (+0.16z)| norm 0.2082 (-0.17z)| lr 5.85e-06 | 4175.00 ms | 32.3% bf16 MFU | 125124 tok/s step 18550/19560 | loss 3.280618 (-0.02z)| norm 0.1995 (-0.94z)| lr 5.84e-06 | 4210.01 ms | 32.1% bf16 MFU | 125094 tok/s step 18551/19560 | loss 3.280620 (-0.02z)| norm 0.2038 (-0.56z)| lr 5.83e-06 | 4159.46 ms | 32.5% bf16 MFU | 125142 tok/s step 18552/19560 | loss 3.307957 (+0.61z)| norm 0.2072 (-0.24z)| lr 5.82e-06 | 4228.96 ms | 31.9% bf16 MFU | 125083 tok/s step 18553/19560 | loss 3.208383 (-1.63z)| norm 0.2033 (-0.59z)| lr 5.81e-06 | 4176.42 ms | 32.3% bf16 MFU | 125106 tok/s step 18554/19560 | loss 3.234672 (-1.02z)| norm 0.2164 (+0.58z)| lr 5.80e-06 | 4161.51 ms | 32.4% bf16 MFU | 125150 tok/s step 18555/19560 | loss 3.254167 (-0.58z)| norm 0.2006 (-0.84z)| lr 5.79e-06 | 4174.84 ms | 32.3% bf16 MFU | 125172 tok/s step 18556/19560 | loss 3.254065 (-0.58z)| norm 0.2022 (-0.68z)| lr 5.77e-06 | 4184.87 ms | 32.3% bf16 MFU | 125177 tok/s step 18557/19560 | loss 3.255379 (-0.55z)| norm 0.2123 (+0.21z)| lr 5.76e-06 | 4156.15 ms | 32.5% bf16 MFU | 125226 tok/s step 18558/19560 | loss 3.292939 (+0.28z)| norm 0.2180 (+0.72z)| lr 5.75e-06 | 4169.74 ms | 32.4% bf16 MFU | 125251 tok/s step 18559/19560 | loss 3.294341 (+0.32z)| norm 0.2174 (+0.66z)| lr 5.74e-06 | 4160.55 ms | 32.5% bf16 MFU | 125289 tok/s step 18560/19560 | loss 3.298433 (+0.41z)| norm 0.2146 (+0.40z)| lr 5.73e-06 | 4160.14 ms | 32.5% bf16 MFU | 125326 tok/s step 18561/19560 | loss 3.239669 (-0.91z)| norm 0.2038 (-0.56z)| lr 5.72e-06 | 4159.71 ms | 32.5% bf16 MFU | 125362 tok/s step 18562/19560 | loss 3.289148 (+0.21z)| norm 0.2039 (-0.56z)| lr 5.71e-06 | 4175.90 ms | 32.3% bf16 MFU | 125371 tok/s step 18563/19560 | loss 3.325228 (+1.01z)| norm 0.2261 (+1.41z)| lr 5.69e-06 | 4177.81 ms | 32.3% bf16 MFU | 125377 tok/s step 18564/19560 | loss 3.268456 (-0.26z)| norm 0.1998 (-0.92z)| lr 5.68e-06 | 4160.05 ms | 32.5% bf16 MFU | 125410 tok/s step 18565/19560 | loss 3.268296 (-0.26z)| norm 0.2063 (-0.34z)| lr 5.67e-06 | 4230.00 ms | 31.9% bf16 MFU | 125337 tok/s step 18566/19560 | loss 3.352154 (+1.67z)| norm 0.2048 (-0.47z)| lr 5.66e-06 | 4166.00 ms | 32.4% bf16 MFU | 125362 tok/s step 18567/19560 | loss 3.311935 (+0.74z)| norm 0.2106 (+0.04z)| lr 5.65e-06 | 4221.20 ms | 32.0% bf16 MFU | 125304 tok/s step 18568/19560 | loss 3.232657 (-1.07z)| norm 0.1989 (-1.00z)| lr 5.64e-06 | 4167.28 ms | 32.4% bf16 MFU | 125330 tok/s step 18569/19560 | loss 3.319729 (+0.91z)| norm 0.2041 (-0.53z)| lr 5.63e-06 | 4168.75 ms | 32.4% bf16 MFU | 125352 tok/s step 18570/19560 | loss 3.244504 (-0.79z)| norm 0.1989 (-0.99z)| lr 5.61e-06 | 4162.03 ms | 32.4% bf16 MFU | 125382 tok/s step 18571/19560 | loss 3.229331 (-1.12z)| norm 0.2010 (-0.79z)| lr 5.60e-06 | 4166.86 ms | 32.4% bf16 MFU | 125405 tok/s step 18572/19560 | loss 3.331512 (+1.18z)| norm 0.2135 (+0.31z)| lr 5.59e-06 | 4189.45 ms | 32.2% bf16 MFU | 125392 tok/s step 18573/19560 | loss 3.321590 (+0.95z)| norm 0.2099 (-0.02z)| lr 5.58e-06 | 4179.61 ms | 32.3% bf16 MFU | 125394 tok/s step 18574/19560 | loss 3.266735 (-0.28z)| norm 0.2077 (-0.20z)| lr 5.57e-06 | 4166.59 ms | 32.4% bf16 MFU | 125416 tok/s step 18575/19560 | loss 3.313835 (+0.77z)| norm 0.2057 (-0.38z)| lr 5.56e-06 | 4192.04 ms | 32.2% bf16 MFU | 125398 tok/s step 18576/19560 | loss 3.267078 (-0.28z)| norm 0.2143 (+0.38z)| lr 5.55e-06 | 4172.41 ms | 32.4% bf16 MFU | 125411 tok/s step 18577/19560 | loss 3.259494 (-0.46z)| norm 0.1963 (-1.21z)| lr 5.54e-06 | 4182.83 ms | 32.3% bf16 MFU | 125408 tok/s step 18578/19560 | loss 3.317465 (+0.84z)| norm 0.2148 (+0.44z)| lr 5.52e-06 | 4212.81 ms | 32.0% bf16 MFU | 125360 tok/s step 18579/19560 | loss 3.244560 (-0.79z)| norm 0.2006 (-0.82z)| lr 5.51e-06 | 4207.10 ms | 32.1% bf16 MFU | 125323 tok/s step 18580/19560 | loss 3.233667 (-1.03z)| norm 0.2062 (-0.32z)| lr 5.50e-06 | 4169.65 ms | 32.4% bf16 MFU | 125344 tok/s step 18581/19560 | loss 3.272719 (-0.14z)| norm 0.2015 (-0.73z)| lr 5.49e-06 | 4174.01 ms | 32.3% bf16 MFU | 125357 tok/s step 18582/19560 | loss 3.307804 (+0.67z)| norm 0.2030 (-0.59z)| lr 5.48e-06 | 4185.47 ms | 32.3% bf16 MFU | 125352 tok/s step 18583/19560 | loss 3.301299 (+0.52z)| norm 0.1989 (-0.95z)| lr 5.47e-06 | 4171.83 ms | 32.4% bf16 MFU | 125368 tok/s step 18584/19560 | loss 3.310178 (+0.71z)| norm 0.2054 (-0.38z)| lr 5.46e-06 | 4166.93 ms | 32.4% bf16 MFU | 125391 tok/s step 18585/19560 | loss 3.235631 (-0.99z)| norm 0.2042 (-0.48z)| lr 5.45e-06 | 4331.36 ms | 31.2% bf16 MFU | 125174 tok/s step 18586/19560 | loss 3.326788 (+1.08z)| norm 0.2079 (-0.16z)| lr 5.43e-06 | 4167.15 ms | 32.4% bf16 MFU | 125206 tok/s step 18587/19560 | loss 3.232894 (-1.04z)| norm 0.2089 (-0.05z)| lr 5.42e-06 | 4333.59 ms | 31.2% bf16 MFU | 124995 tok/s step 18588/19560 | loss 3.290847 (+0.29z)| norm 0.2017 (-0.70z)| lr 5.41e-06 | 4440.89 ms | 30.4% bf16 MFU | 124648 tok/s step 18589/19560 | loss 3.240030 (-0.87z)| norm 0.1941 (-1.37z)| lr 5.40e-06 | 4191.25 ms | 32.2% bf16 MFU | 124670 tok/s step 18590/19560 | loss 3.312346 (+0.77z)| norm 0.2350 (+2.24z)| lr 5.39e-06 | 4172.98 ms | 32.4% bf16 MFU | 124718 tok/s step 18591/19560 | loss 3.283864 (+0.13z)| norm 0.1973 (-1.07z)| lr 5.38e-06 | 4169.44 ms | 32.4% bf16 MFU | 124770 tok/s step 18592/19560 | loss 3.239249 (-0.88z)| norm 0.2032 (-0.56z)| lr 5.37e-06 | 4176.00 ms | 32.3% bf16 MFU | 124809 tok/s step 18593/19560 | loss 3.257220 (-0.46z)| norm 0.2077 (-0.16z)| lr 5.36e-06 | 4525.87 ms | 29.8% bf16 MFU | 124360 tok/s step 18594/19560 | loss 3.245160 (-0.74z)| norm 0.1991 (-0.90z)| lr 5.35e-06 | 4296.03 ms | 31.4% bf16 MFU | 124244 tok/s step 18595/19560 | loss 3.284008 (+0.16z)| norm 0.2024 (-0.63z)| lr 5.34e-06 | 4165.79 ms | 32.4% bf16 MFU | 124325 tok/s step 18596/19560 | loss 3.292389 (+0.36z)| norm 0.1994 (-0.88z)| lr 5.32e-06 | 4231.74 ms | 31.9% bf16 MFU | 124303 tok/s step 18597/19560 | loss 3.250617 (-0.61z)| norm 0.1985 (-0.96z)| lr 5.31e-06 | 4250.89 ms | 31.8% bf16 MFU | 124255 tok/s step 18598/19560 | loss 3.269080 (-0.18z)| norm 0.2100 (+0.05z)| lr 5.30e-06 | 4212.48 ms | 32.1% bf16 MFU | 124265 tok/s step 18599/19560 | loss 3.302011 (+0.57z)| norm 0.2026 (-0.60z)| lr 5.29e-06 | 4828.82 ms | 28.0% bf16 MFU | 123481 tok/s step 18600/19560 | loss 3.238439 (-0.91z)| norm 0.2065 (-0.25z)| lr 5.28e-06 | 5887.74 ms | 22.9% bf16 MFU | 121759 tok/s step 18601/19560 | loss 3.276516 (-0.01z)| norm 0.2212 (+1.03z)| lr 5.27e-06 | 4567.46 ms | 29.6% bf16 MFU | 121410 tok/s step 18602/19560 | loss 3.285778 (+0.20z)| norm 0.2018 (-0.66z)| lr 5.26e-06 | 4401.23 ms | 30.7% bf16 MFU | 121296 tok/s step 18603/19560 | loss 3.264067 (-0.31z)| norm 0.2078 (-0.14z)| lr 5.25e-06 | 5030.07 ms | 26.8% bf16 MFU | 120443 tok/s step 18604/19560 | loss 3.329040 (+1.19z)| norm 0.2089 (-0.04z)| lr 5.24e-06 | 7321.48 ms | 18.4% bf16 MFU | 118001 tok/s step 18605/19560 | loss 3.197144 (-1.85z)| norm 0.2100 (+0.04z)| lr 5.23e-06 | 4800.87 ms | 28.1% bf16 MFU | 117561 tok/s step 18606/19560 | loss 3.286191 (+0.20z)| norm 0.2020 (-0.65z)| lr 5.21e-06 | 4850.97 ms | 27.8% bf16 MFU | 117087 tok/s step 18607/19560 | loss 3.302699 (+0.57z)| norm 0.2029 (-0.58z)| lr 5.20e-06 | 5213.04 ms | 25.9% bf16 MFU | 116262 tok/s step 18608/19560 | loss 3.303084 (+0.57z)| norm 0.2062 (-0.29z)| lr 5.19e-06 | 5029.87 ms | 26.8% bf16 MFU | 115660 tok/s step 18609/19560 | loss 3.277316 (-0.01z)| norm 0.2105 (+0.09z)| lr 5.18e-06 | 5045.56 ms | 26.8% bf16 MFU | 115073 tok/s step 18610/19560 | loss 3.252530 (-0.58z)| norm 0.2191 (+0.84z)| lr 5.17e-06 | 5003.35 ms | 27.0% bf16 MFU | 114559 tok/s step 18611/19560 | loss 3.301025 (+0.54z)| norm 0.2212 (+1.01z)| lr 5.16e-06 | 4493.37 ms | 30.0% bf16 MFU | 114665 tok/s step 18612/19560 | loss 3.299467 (+0.49z)| norm 0.2019 (-0.68z)| lr 5.15e-06 | 4806.77 ms | 28.1% bf16 MFU | 114385 tok/s step 18613/19560 | loss 3.242771 (-0.83z)| norm 0.2021 (-0.66z)| lr 5.14e-06 | 4154.29 ms | 32.5% bf16 MFU | 114976 tok/s step 18614/19560 | loss 3.224699 (-1.29z)| norm 0.1990 (-0.93z)| lr 5.13e-06 | 4624.36 ms | 29.2% bf16 MFU | 114896 tok/s step 18615/19560 | loss 3.256285 (-0.50z)| norm 0.1989 (-0.93z)| lr 5.12e-06 | 4394.58 ms | 30.7% bf16 MFU | 115116 tok/s step 18616/19560 | loss 3.281846 (+0.14z)| norm 0.2076 (-0.14z)| lr 5.11e-06 | 4311.23 ms | 31.3% bf16 MFU | 115441 tok/s step 18617/19560 | loss 3.271726 (-0.11z)| norm 0.2036 (-0.50z)| lr 5.10e-06 | 4691.73 ms | 28.8% bf16 MFU | 115256 tok/s step 18618/19560 | loss 3.209533 (-1.65z)| norm 0.2025 (-0.59z)| lr 5.08e-06 | 4167.84 ms | 32.4% bf16 MFU | 115783 tok/s step 18619/19560 | loss 3.281993 (+0.13z)| norm 0.1941 (-1.33z)| lr 5.07e-06 | 4164.24 ms | 32.4% bf16 MFU | 116289 tok/s step 18620/19560 | loss 3.294664 (+0.45z)| norm 0.2024 (-0.57z)| lr 5.06e-06 | 4461.23 ms | 30.3% bf16 MFU | 116351 tok/s step 18621/19560 | loss 3.269114 (-0.19z)| norm 0.2007 (-0.72z)| lr 5.05e-06 | 4483.13 ms | 30.1% bf16 MFU | 116381 tok/s step 18622/19560 | loss 3.285348 (+0.24z)| norm 0.2047 (-0.35z)| lr 5.04e-06 | 4368.61 ms | 30.9% bf16 MFU | 116562 tok/s step 18623/19560 | loss 3.270499 (-0.15z)| norm 0.2080 (-0.04z)| lr 5.03e-06 | 4179.64 ms | 32.3% bf16 MFU | 117006 tok/s step 18624/19560 | loss 3.258250 (-0.46z)| norm 0.2028 (-0.52z)| lr 5.02e-06 | 4345.39 ms | 31.1% bf16 MFU | 117188 tok/s step 18625/19560 | loss 3.242021 (-0.88z)| norm 0.2085 (+0.00z)| lr 5.01e-06 | 4262.74 ms | 31.7% bf16 MFU | 117479 tok/s step 18626/19560 | loss 3.300878 (+0.67z)| norm 0.2063 (-0.19z)| lr 5.00e-06 | 4282.31 ms | 31.5% bf16 MFU | 117726 tok/s step 18627/19560 | loss 3.360342 (+2.19z)| norm 0.2082 (-0.02z)| lr 4.99e-06 | 4194.96 ms | 32.2% bf16 MFU | 118089 tok/s step 18628/19560 | loss 3.285070 (+0.22z)| norm 0.2054 (-0.29z)| lr 4.98e-06 | 4193.05 ms | 32.2% bf16 MFU | 118436 tok/s step 18629/19560 | loss 3.269144 (-0.19z)| norm 0.2153 (+0.62z)| lr 4.97e-06 | 4298.95 ms | 31.4% bf16 MFU | 118612 tok/s step 18630/19560 | loss 3.297019 (+0.53z)| norm 0.2064 (-0.20z)| lr 4.96e-06 | 4238.79 ms | 31.9% bf16 MFU | 118866 tok/s step 18631/19560 | loss 3.294101 (+0.45z)| norm 0.2179 (+0.85z)| lr 4.95e-06 | 4185.83 ms | 32.3% bf16 MFU | 119186 tok/s step 18632/19560 | loss 3.263274 (-0.36z)| norm 0.2093 (+0.05z)| lr 4.94e-06 | 4290.93 ms | 31.5% bf16 MFU | 119336 tok/s step 18633/19560 | loss 3.230293 (-1.20z)| norm 0.2044 (-0.40z)| lr 4.92e-06 | 4177.04 ms | 32.3% bf16 MFU | 119645 tok/s step 18634/19560 | loss 3.262305 (-0.36z)| norm 0.2211 (+1.12z)| lr 4.91e-06 | 4265.95 ms | 31.7% bf16 MFU | 119807 tok/s step 18635/19560 | loss 3.205832 (-1.80z)| norm 0.2076 (-0.12z)| lr 4.90e-06 | 4289.65 ms | 31.5% bf16 MFU | 119928 tok/s step 18636/19560 | loss 3.299695 (+0.62z)| norm 0.2071 (-0.17z)| lr 4.89e-06 | 4174.08 ms | 32.3% bf16 MFU | 120212 tok/s step 18637/19560 | loss 3.228872 (-1.20z)| norm 0.2041 (-0.45z)| lr 4.88e-06 | 4181.14 ms | 32.3% bf16 MFU | 120471 tok/s step 18638/19560 | loss 3.204639 (-1.79z)| norm 0.2045 (-0.42z)| lr 4.87e-06 | 4230.74 ms | 31.9% bf16 MFU | 120644 tok/s step 18639/19560 | loss 3.196028 (-1.96z)| norm 0.1980 (-1.00z)| lr 4.86e-06 | 4350.17 ms | 31.0% bf16 MFU | 120638 tok/s step 18640/19560 | loss 3.209077 (-1.63z)| norm 0.1979 (-1.01z)| lr 4.85e-06 | 4250.72 ms | 31.8% bf16 MFU | 120773 tok/s step 18641/19560 | loss 3.254074 (-0.50z)| norm 0.2061 (-0.25z)| lr 4.84e-06 | 4192.94 ms | 32.2% bf16 MFU | 120986 tok/s step 18642/19560 | loss 3.270738 (-0.08z)| norm 0.2027 (-0.57z)| lr 4.83e-06 | 4203.49 ms | 32.1% bf16 MFU | 121173 tok/s step 18643/19560 | loss 3.293638 (+0.48z)| norm 0.2136 (+0.42z)| lr 4.82e-06 | 4197.89 ms | 32.2% bf16 MFU | 121359 tok/s step 18644/19560 | loss 3.306004 (+0.78z)| norm 0.2067 (-0.21z)| lr 4.81e-06 | 4301.73 ms | 31.4% bf16 MFU | 121385 tok/s step 18645/19560 | loss 3.272800 (-0.04z)| norm 0.2119 (+0.27z)| lr 4.80e-06 | 4171.98 ms | 32.4% bf16 MFU | 121599 tok/s step 18646/19560 | loss 3.216785 (-1.42z)| norm 0.1973 (-1.07z)| lr 4.79e-06 | 4199.89 ms | 32.1% bf16 MFU | 121761 tok/s step 18647/19560 | loss 3.214142 (-1.46z)| norm 0.1992 (-0.88z)| lr 4.78e-06 | 4280.24 ms | 31.5% bf16 MFU | 121797 tok/s step 18648/19560 | loss 3.260482 (-0.33z)| norm 0.2025 (-0.56z)| lr 4.77e-06 | 4221.96 ms | 32.0% bf16 MFU | 121917 tok/s step 18649/19560 | loss 3.240473 (-0.82z)| norm 0.2030 (-0.52z)| lr 4.76e-06 | 4171.84 ms | 32.4% bf16 MFU | 122104 tok/s step 18650/19560 | loss 3.304900 (+0.78z)| norm 0.2052 (-0.31z)| lr 4.75e-06 | 4266.55 ms | 31.6% bf16 MFU | 122143 tok/s step 18651/19560 | loss 3.244813 (-0.70z)| norm 0.2125 (+0.38z)| lr 4.74e-06 | 4173.21 ms | 32.4% bf16 MFU | 122318 tok/s step 18652/19560 | loss 3.244171 (-0.73z)| norm 0.1990 (-0.89z)| lr 4.73e-06 | 4208.74 ms | 32.1% bf16 MFU | 122431 tok/s step 18653/19560 | loss 3.239815 (-0.83z)| norm 0.2072 (-0.11z)| lr 4.72e-06 | 4181.71 ms | 32.3% bf16 MFU | 122578 tok/s step 18654/19560 | loss 3.226886 (-1.13z)| norm 0.2055 (-0.26z)| lr 4.70e-06 | 4265.56 ms | 31.7% bf16 MFU | 122595 tok/s step 18655/19560 | loss 3.297771 (+0.62z)| norm 0.2016 (-0.63z)| lr 4.69e-06 | 4188.93 ms | 32.2% bf16 MFU | 122723 tok/s step 18656/19560 | loss 3.280306 (+0.17z)| norm 0.2032 (-0.48z)| lr 4.68e-06 | 4226.01 ms | 31.9% bf16 MFU | 122790 tok/s step 18657/19560 | loss 3.292909 (+0.49z)| norm 0.2059 (-0.22z)| lr 4.67e-06 | 4194.90 ms | 32.2% bf16 MFU | 122899 tok/s step 18658/19560 | loss 3.269857 (-0.06z)| norm 0.1997 (-0.79z)| lr 4.66e-06 | 4258.67 ms | 31.7% bf16 MFU | 122910 tok/s step 18659/19560 | loss 3.274870 (+0.08z)| norm 0.1958 (-1.18z)| lr 4.65e-06 | 4227.64 ms | 31.9% bf16 MFU | 122965 tok/s step 18660/19560 | loss 3.256711 (-0.38z)| norm 0.2098 (+0.26z)| lr 4.64e-06 | 4178.87 ms | 32.3% bf16 MFU | 123090 tok/s step 18661/19560 | loss 3.302432 (+0.84z)| norm 0.2046 (-0.29z)| lr 4.63e-06 | 4178.38 ms | 32.3% bf16 MFU | 123209 tok/s step 18662/19560 | loss 3.293698 (+0.59z)| norm 0.2074 (+0.01z)| lr 4.62e-06 | 4176.09 ms | 32.3% bf16 MFU | 123326 tok/s step 18663/19560 | loss 3.308976 (+1.01z)| norm 0.2056 (-0.17z)| lr 4.61e-06 | 4189.55 ms | 32.2% bf16 MFU | 123417 tok/s step 18664/19560 | loss 3.300900 (+0.78z)| norm 0.2050 (-0.23z)| lr 4.60e-06 | 4261.45 ms | 31.7% bf16 MFU | 123398 tok/s step 18665/19560 | loss 3.186850 (-2.33z)| norm 0.2089 (+0.19z)| lr 4.59e-06 | 4226.46 ms | 31.9% bf16 MFU | 123430 tok/s step 18666/19560 | loss 3.273416 (+0.05z)| norm 0.2120 (+0.73z)| lr 4.58e-06 | 4189.34 ms | 32.2% bf16 MFU | 123516 tok/s step 18667/19560 | loss 3.224782 (-1.27z)| norm 0.2076 (+0.12z)| lr 4.57e-06 | 4195.53 ms | 32.2% bf16 MFU | 123588 tok/s step 18668/19560 | loss 3.245208 (-0.70z)| norm 0.1968 (-1.36z)| lr 4.56e-06 | 4260.00 ms | 31.7% bf16 MFU | 123563 tok/s step 18669/19560 | loss 3.245104 (-0.71z)| norm 0.1991 (-1.03z)| lr 4.55e-06 | 4186.27 ms | 32.3% bf16 MFU | 123646 tok/s step 18670/19560 | loss 3.276588 (+0.14z)| norm 0.2392 (+4.23z)| lr 4.54e-06 | 4179.04 ms | 32.3% bf16 MFU | 123737 tok/s step 18671/19560 | loss 3.262609 (-0.24z)| norm 0.2002 (-0.82z)| lr 4.53e-06 | 4256.10 ms | 31.7% bf16 MFU | 123709 tok/s step 18672/19560 | loss 3.266485 (-0.14z)| norm 0.2072 (+0.10z)| lr 4.52e-06 | 4490.36 ms | 30.1% bf16 MFU | 123362 tok/s step 18673/19560 | loss 3.282728 (+0.31z)| norm 0.2105 (+0.53z)| lr 4.51e-06 | 4187.35 ms | 32.2% bf16 MFU | 123454 tok/s step 18674/19560 | loss 3.225578 (-1.29z)| norm 0.2217 (+1.96z)| lr 4.50e-06 | 4173.98 ms | 32.3% bf16 MFU | 123562 tok/s step 18675/19560 | loss 3.217340 (-1.50z)| norm 0.1993 (-0.93z)| lr 4.49e-06 | 4187.05 ms | 32.2% bf16 MFU | 123645 tok/s step 18676/19560 | loss 3.281890 (+0.36z)| norm 0.2156 (+1.26z)| lr 4.48e-06 | 4180.91 ms | 32.3% bf16 MFU | 123732 tok/s step 18677/19560 | loss 3.210570 (-1.69z)| norm 0.1996 (-0.91z)| lr 4.47e-06 | 4184.46 ms | 32.3% bf16 MFU | 123810 tok/s step 18678/19560 | loss 3.235447 (-0.96z)| norm 0.2035 (-0.38z)| lr 4.46e-06 | 4185.73 ms | 32.3% bf16 MFU | 123883 tok/s step 18679/19560 | loss 3.250546 (-0.51z)| norm 0.2029 (-0.46z)| lr 4.45e-06 | 4423.74 ms | 30.5% bf16 MFU | 123614 tok/s step 18680/19560 | loss 3.321431 (+1.52z)| norm 0.2027 (-0.48z)| lr 4.44e-06 | 4207.99 ms | 32.1% bf16 MFU | 123663 tok/s step 18681/19560 | loss 3.266216 (-0.08z)| norm 0.2074 (+0.16z)| lr 4.43e-06 | 4247.59 ms | 31.8% bf16 MFU | 123652 tok/s step 18682/19560 | loss 3.244774 (-0.71z)| norm 0.1995 (-0.92z)| lr 4.42e-06 | 4358.02 ms | 31.0% bf16 MFU | 123484 tok/s step 18683/19560 | loss 3.225434 (-1.26z)| norm 0.1927 (-1.81z)| lr 4.41e-06 | 4240.40 ms | 31.8% bf16 MFU | 123492 tok/s step 18684/19560 | loss 3.277058 (+0.23z)| norm 0.2151 (+1.20z)| lr 4.40e-06 | 4200.75 ms | 32.1% bf16 MFU | 123558 tok/s step 18685/19560 | loss 3.207674 (-1.75z)| norm 0.2007 (-0.73z)| lr 4.39e-06 | 4282.85 ms | 31.5% bf16 MFU | 123501 tok/s step 18686/19560 | loss 3.205710 (-1.76z)| norm 0.2012 (-0.65z)| lr 4.38e-06 | 4188.50 ms | 32.2% bf16 MFU | 123585 tok/s step 18687/19560 | loss 3.290129 (+0.63z)| norm 0.2043 (-0.21z)| lr 4.37e-06 | 4175.28 ms | 32.3% bf16 MFU | 123684 tok/s step 18688/19560 | loss 3.281558 (+0.39z)| norm 0.2042 (-0.21z)| lr 4.36e-06 | 4186.54 ms | 32.3% bf16 MFU | 123761 tok/s step 18689/19560 | loss 3.237376 (-0.86z)| norm 0.1999 (-0.81z)| lr 4.35e-06 | 12588.02 ms | 10.7% bf16 MFU | 119656 tok/s step 18690/19560 | loss 3.332315 (+1.80z)| norm 0.2064 (+0.09z)| lr 4.34e-06 | 4294.09 ms | 31.4% bf16 MFU | 119778 tok/s step 18691/19560 | loss 3.286479 (+0.53z)| norm 0.2095 (+0.55z)| lr 4.33e-06 | 4150.12 ms | 32.5% bf16 MFU | 120105 tok/s step 18692/19560 | loss 3.232130 (-1.00z)| norm 0.1997 (-0.85z)| lr 4.32e-06 | 4161.08 ms | 32.4% bf16 MFU | 120400 tok/s step 18693/19560 | loss 3.235791 (-0.89z)| norm 0.2033 (-0.33z)| lr 4.31e-06 | 4220.80 ms | 32.0% bf16 MFU | 120591 tok/s step 18694/19560 | loss 3.257328 (-0.26z)| norm 0.2058 (+0.03z)| lr 4.30e-06 | 4160.52 ms | 32.5% bf16 MFU | 120862 tok/s step 18695/19560 | loss 3.292332 (+0.75z)| norm 0.2075 (+0.27z)| lr 4.29e-06 | 4195.72 ms | 32.2% bf16 MFU | 121067 tok/s step 18696/19560 | loss 3.337588 (+2.01z)| norm 0.2260 (+2.80z)| lr 4.28e-06 | 4166.44 ms | 32.4% bf16 MFU | 121305 tok/s step 18697/19560 | loss 3.182744 (-2.35z)| norm 0.2023 (-0.48z)| lr 4.27e-06 | 4166.45 ms | 32.4% bf16 MFU | 121532 tok/s step 18698/19560 | loss 3.299026 (+0.91z)| norm 0.2180 (+1.66z)| lr 4.26e-06 | 4171.89 ms | 32.4% bf16 MFU | 121739 tok/s step 18699/19560 | loss 3.271045 (+0.12z)| norm 0.2146 (+1.18z)| lr 4.25e-06 | 4176.57 ms | 32.3% bf16 MFU | 121928 tok/s step 18700/19560 | loss 3.314026 (+1.35z)| norm 0.2130 (+0.95z)| lr 4.24e-06 | 4181.18 ms | 32.3% bf16 MFU | 122101 tok/s step 18701/19560 | loss 3.225365 (-1.16z)| norm 0.2024 (-0.49z)| lr 4.23e-06 | 4189.78 ms | 32.2% bf16 MFU | 122253 tok/s step 18702/19560 | loss 3.269142 (+0.09z)| norm 0.2031 (-0.39z)| lr 4.22e-06 | 4235.44 ms | 31.9% bf16 MFU | 122330 tok/s step 18703/19560 | loss 3.259166 (-0.18z)| norm 0.2013 (-0.63z)| lr 4.21e-06 | 4193.16 ms | 32.2% bf16 MFU | 122465 tok/s step 18704/19560 | loss 3.262562 (-0.08z)| norm 0.2022 (-0.49z)| lr 4.20e-06 | 4226.63 ms | 31.9% bf16 MFU | 122544 tok/s step 18705/19560 | loss 3.283463 (+0.51z)| norm 0.2003 (-0.77z)| lr 4.19e-06 | 4266.41 ms | 31.6% bf16 MFU | 122561 tok/s step 18706/19560 | loss 3.237804 (-0.79z)| norm 0.2015 (-0.59z)| lr 4.18e-06 | 4195.20 ms | 32.2% bf16 MFU | 122682 tok/s step 18707/19560 | loss 3.273085 (+0.23z)| norm 0.2040 (-0.24z)| lr 4.17e-06 | 4179.35 ms | 32.3% bf16 MFU | 122820 tok/s step 18708/19560 | loss 3.215372 (-1.44z)| norm 0.2010 (-0.66z)| lr 4.16e-06 | 4256.64 ms | 31.7% bf16 MFU | 122837 tok/s step 18709/19560 | loss 3.231456 (-0.96z)| norm 0.2064 (+0.08z)| lr 4.15e-06 | 4174.41 ms | 32.3% bf16 MFU | 122975 tok/s step 18710/19560 | loss 3.273008 (+0.24z)| norm 0.2028 (-0.41z)| lr 4.14e-06 | 4302.99 ms | 31.4% bf16 MFU | 122919 tok/s step 18711/19560 | loss 3.219906 (-1.27z)| norm 0.2011 (-0.66z)| lr 4.13e-06 | 4238.18 ms | 31.9% bf16 MFU | 122958 tok/s step 18712/19560 | loss 3.240677 (-0.66z)| norm 0.2033 (-0.35z)| lr 4.12e-06 | 4175.77 ms | 32.3% bf16 MFU | 123088 tok/s step 18713/19560 | loss 3.248186 (-0.44z)| norm 0.1998 (-0.82z)| lr 4.11e-06 | 4324.19 ms | 31.2% bf16 MFU | 122996 tok/s step 18714/19560 | loss 3.240672 (-0.65z)| norm 0.2000 (-0.79z)| lr 4.10e-06 | 4297.02 ms | 31.4% bf16 MFU | 122947 tok/s step 18715/19560 | loss 3.218351 (-1.30z)| norm 0.2019 (-0.51z)| lr 4.09e-06 | 4202.91 ms | 32.1% bf16 MFU | 123036 tok/s step 18716/19560 | loss 3.238457 (-0.70z)| norm 0.2035 (-0.30z)| lr 4.08e-06 | 4214.28 ms | 32.0% bf16 MFU | 123105 tok/s step 18717/19560 | loss 3.295777 (+0.97z)| norm 0.2068 (+0.15z)| lr 4.07e-06 | 4193.91 ms | 32.2% bf16 MFU | 123200 tok/s step 18718/19560 | loss 3.281796 (+0.57z)| norm 0.2052 (-0.05z)| lr 4.07e-06 | 4221.23 ms | 32.0% bf16 MFU | 123250 tok/s step 18719/19560 | loss 3.281780 (+0.57z)| norm 0.1992 (-0.95z)| lr 4.06e-06 | 4193.13 ms | 32.2% bf16 MFU | 123340 tok/s step 18720/19560 | loss 3.315514 (+1.53z)| norm 0.2131 (+1.13z)| lr 4.05e-06 | 4188.32 ms | 32.2% bf16 MFU | 123432 tok/s step 18721/19560 | loss 3.293664 (+0.88z)| norm 0.2019 (-0.55z)| lr 4.04e-06 | 4194.34 ms | 32.2% bf16 MFU | 123510 tok/s step 18722/19560 | loss 3.297742 (+0.99z)| norm 0.2034 (-0.33z)| lr 4.03e-06 | 4314.14 ms | 31.3% bf16 MFU | 123411 tok/s step 18723/19560 | loss 3.302612 (+1.12z)| norm 0.2015 (-0.62z)| lr 4.02e-06 | 4200.92 ms | 32.1% bf16 MFU | 123481 tok/s step 18724/19560 | loss 3.220671 (-1.23z)| norm 0.2054 (-0.04z)| lr 4.01e-06 | 4192.41 ms | 32.2% bf16 MFU | 123559 tok/s step 18725/19560 | loss 3.271358 (+0.23z)| norm 0.1938 (-1.76z)| lr 4.00e-06 | 4201.64 ms | 32.1% bf16 MFU | 123620 tok/s step 18726/19560 | loss 3.245905 (-0.50z)| norm 0.2031 (-0.36z)| lr 3.99e-06 | 4210.96 ms | 32.1% bf16 MFU | 123665 tok/s step 18727/19560 | loss 3.261132 (-0.05z)| norm 0.1978 (-1.14z)| lr 3.98e-06 | 4216.09 ms | 32.0% bf16 MFU | 123699 tok/s step 18728/19560 | loss 3.354180 (+2.55z)| norm 0.2140 (+1.26z)| lr 3.97e-06 | 4175.84 ms | 32.3% bf16 MFU | 123792 tok/s step 18729/19560 | loss 3.255248 (-0.24z)| norm 0.2039 (-0.23z)| lr 3.96e-06 | 4170.81 ms | 32.4% bf16 MFU | 123887 tok/s step 18730/19560 | loss 3.277794 (+0.40z)| norm 0.2007 (-0.71z)| lr 3.95e-06 | 4175.64 ms | 32.3% bf16 MFU | 123971 tok/s step 18731/19560 | loss 3.326632 (+1.75z)| norm 0.2101 (+0.70z)| lr 3.94e-06 | 4186.92 ms | 32.2% bf16 MFU | 124033 tok/s step 18732/19560 | loss 3.258015 (-0.16z)| norm 0.2005 (-0.74z)| lr 3.93e-06 | 4204.00 ms | 32.1% bf16 MFU | 124067 tok/s step 18733/19560 | loss 3.256793 (-0.21z)| norm 0.1998 (-0.83z)| lr 3.92e-06 | 4261.40 ms | 31.7% bf16 MFU | 124016 tok/s step 18734/19560 | loss 3.291330 (+0.78z)| norm 0.2106 (+0.79z)| lr 3.91e-06 | 4269.07 ms | 31.6% bf16 MFU | 123955 tok/s step 18735/19560 | loss 3.217262 (-1.32z)| norm 0.1970 (-1.24z)| lr 3.90e-06 | 4320.43 ms | 31.3% bf16 MFU | 123825 tok/s step 18736/19560 | loss 3.278265 (+0.43z)| norm 0.2037 (-0.23z)| lr 3.89e-06 | 4244.35 ms | 31.8% bf16 MFU | 123810 tok/s step 18737/19560 | loss 3.259429 (-0.11z)| norm 0.2033 (-0.29z)| lr 3.88e-06 | 4192.40 ms | 32.2% bf16 MFU | 123873 tok/s step 18738/19560 | loss 3.180026 (-2.33z)| norm 0.2034 (-0.26z)| lr 3.87e-06 | 4185.36 ms | 32.3% bf16 MFU | 123942 tok/s step 18739/19560 | loss 3.300648 (+1.07z)| norm 0.2041 (-0.14z)| lr 3.87e-06 | 4193.92 ms | 32.2% bf16 MFU | 123996 tok/s step 18740/19560 | loss 3.244374 (-0.50z)| norm 0.1923 (-1.94z)| lr 3.86e-06 | 4178.95 ms | 32.3% bf16 MFU | 124069 tok/s step 18741/19560 | loss 3.198843 (-1.76z)| norm 0.2104 (+0.83z)| lr 3.85e-06 | 4213.69 ms | 32.0% bf16 MFU | 124087 tok/s step 18742/19560 | loss 3.251828 (-0.29z)| norm 0.2313 (+3.79z)| lr 3.84e-06 | 4229.59 ms | 31.9% bf16 MFU | 124080 tok/s step 18743/19560 | loss 3.295720 (+0.94z)| norm 0.2034 (-0.27z)| lr 3.83e-06 | 4177.40 ms | 32.3% bf16 MFU | 124151 tok/s step 18744/19560 | loss 3.279689 (+0.49z)| norm 0.2175 (+1.75z)| lr 3.82e-06 | 4184.94 ms | 32.3% bf16 MFU | 124208 tok/s step 18745/19560 | loss 3.254333 (-0.22z)| norm 0.2060 (+0.09z)| lr 3.81e-06 | 4242.47 ms | 31.8% bf16 MFU | 124177 tok/s step 18746/19560 | loss 3.258988 (-0.10z)| norm 0.2030 (-0.34z)| lr 3.80e-06 | 4189.55 ms | 32.2% bf16 MFU | 124225 tok/s step 18747/19560 | loss 3.258199 (-0.12z)| norm 0.2045 (-0.13z)| lr 3.79e-06 | 4197.02 ms | 32.2% bf16 MFU | 124260 tok/s step 18748/19560 | loss 3.182001 (-2.21z)| norm 0.1960 (-1.36z)| lr 3.78e-06 | 4176.02 ms | 32.3% bf16 MFU | 124324 tok/s step 18749/19560 | loss 3.284997 (+0.65z)| norm 0.2046 (-0.12z)| lr 3.77e-06 | 4201.86 ms | 32.1% bf16 MFU | 124346 tok/s step 18750/19560 | loss 3.257557 (-0.11z)| norm 0.2066 (+0.17z)| lr 3.76e-06 | 4176.88 ms | 32.3% bf16 MFU | 124405 tok/s val loss 3.251805 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3052/10042 = 0.303924 step 18751/19560 | loss 3.278942 (+0.49z)| norm 0.2044 (-0.14z)| lr 3.75e-06 | 4476.41 ms | 30.2% bf16 MFU | 124041 tok/s step 18752/19560 | loss 3.285737 (+0.67z)| norm 0.2106 (+0.74z)| lr 3.74e-06 | 4516.24 ms | 29.9% bf16 MFU | 123643 tok/s step 18753/19560 | loss 3.239479 (-0.61z)| norm 0.2047 (-0.11z)| lr 3.74e-06 | 4462.35 ms | 30.3% bf16 MFU | 123336 tok/s step 18754/19560 | loss 3.284317 (+0.64z)| norm 0.2036 (-0.26z)| lr 3.73e-06 | 4326.54 ms | 31.2% bf16 MFU | 123228 tok/s step 18755/19560 | loss 3.303091 (+1.20z)| norm 0.1992 (-0.89z)| lr 3.72e-06 | 4252.85 ms | 31.7% bf16 MFU | 123231 tok/s step 18756/19560 | loss 3.272767 (+0.34z)| norm 0.2023 (-0.44z)| lr 3.71e-06 | 4217.50 ms | 32.0% bf16 MFU | 123285 tok/s step 18757/19560 | loss 3.217000 (-1.24z)| norm 0.2164 (+1.59z)| lr 3.70e-06 | 4243.01 ms | 31.8% bf16 MFU | 123299 tok/s step 18758/19560 | loss 3.315350 (+1.55z)| norm 0.2207 (+2.16z)| lr 3.69e-06 | 4226.33 ms | 31.9% bf16 MFU | 123336 tok/s step 18759/19560 | loss 3.325551 (+1.81z)| norm 0.2310 (+3.47z)| lr 3.68e-06 | 4182.31 ms | 32.3% bf16 MFU | 123438 tok/s step 18760/19560 | loss 3.222220 (-1.07z)| norm 0.2105 (+0.67z)| lr 3.67e-06 | 4200.63 ms | 32.1% bf16 MFU | 123506 tok/s step 18761/19560 | loss 3.201423 (-1.63z)| norm 0.2075 (+0.26z)| lr 3.66e-06 | 4304.84 ms | 31.4% bf16 MFU | 123420 tok/s step 18762/19560 | loss 3.233640 (-0.73z)| norm 0.1956 (-1.35z)| lr 3.65e-06 | 4309.90 ms | 31.3% bf16 MFU | 123332 tok/s step 18763/19560 | loss 3.272775 (+0.34z)| norm 0.1997 (-0.77z)| lr 3.64e-06 | 4182.81 ms | 32.3% bf16 MFU | 123432 tok/s step 18764/19560 | loss 3.212533 (-1.32z)| norm 0.1910 (-1.92z)| lr 3.63e-06 | 4403.57 ms | 30.7% bf16 MFU | 123214 tok/s step 18765/19560 | loss 3.221054 (-1.08z)| norm 0.2023 (-0.39z)| lr 3.63e-06 | 4341.50 ms | 31.1% bf16 MFU | 123091 tok/s step 18766/19560 | loss 3.280697 (+0.56z)| norm 0.2034 (-0.24z)| lr 3.62e-06 | 4264.01 ms | 31.7% bf16 MFU | 123084 tok/s step 18767/19560 | loss 3.266139 (+0.14z)| norm 0.1995 (-0.77z)| lr 3.61e-06 | 4190.00 ms | 32.2% bf16 MFU | 123187 tok/s step 18768/19560 | loss 3.246797 (-0.42z)| norm 0.2150 (+1.32z)| lr 3.60e-06 | 4191.05 ms | 32.2% bf16 MFU | 123282 tok/s step 18769/19560 | loss 3.325830 (+1.80z)| norm 0.2588 (+6.04z)| lr 3.59e-06 | 4176.62 ms | 32.3% bf16 MFU | 123394 tok/s step 18770/19560 | loss 3.278437 (+0.46z)| norm 0.2078 (+0.23z)| lr 3.58e-06 | 4195.96 ms | 32.2% bf16 MFU | 123472 tok/s step 18771/19560 | loss 3.254801 (-0.20z)| norm 0.1980 (-0.86z)| lr 3.57e-06 | 4216.96 ms | 32.0% bf16 MFU | 123515 tok/s step 18772/19560 | loss 3.294976 (+0.94z)| norm 0.1971 (-0.96z)| lr 3.56e-06 | 4209.98 ms | 32.1% bf16 MFU | 123566 tok/s step 18773/19560 | loss 3.268827 (+0.21z)| norm 0.2038 (-0.20z)| lr 3.55e-06 | 4195.43 ms | 32.2% bf16 MFU | 123636 tok/s step 18774/19560 | loss 3.350250 (+2.44z)| norm 0.2167 (+1.25z)| lr 3.54e-06 | 4224.84 ms | 32.0% bf16 MFU | 123659 tok/s step 18775/19560 | loss 3.301324 (+1.06z)| norm 0.2069 (+0.14z)| lr 3.53e-06 | 4227.93 ms | 31.9% bf16 MFU | 123676 tok/s step 18776/19560 | loss 3.170829 (-2.50z)| norm 0.2109 (+0.59z)| lr 3.53e-06 | 4174.66 ms | 32.3% bf16 MFU | 123772 tok/s step 18777/19560 | loss 3.322249 (+1.59z)| norm 0.2097 (+0.44z)| lr 3.52e-06 | 4208.38 ms | 32.1% bf16 MFU | 123813 tok/s step 18778/19560 | loss 3.250653 (-0.33z)| norm 0.2089 (+0.34z)| lr 3.51e-06 | 4183.90 ms | 32.3% bf16 MFU | 123887 tok/s step 18779/19560 | loss 3.243001 (-0.54z)| norm 0.2012 (-0.52z)| lr 3.50e-06 | 4184.13 ms | 32.3% bf16 MFU | 123958 tok/s step 18780/19560 | loss 3.228682 (-0.92z)| norm 0.2074 (+0.18z)| lr 3.49e-06 | 4208.88 ms | 32.1% bf16 MFU | 123989 tok/s step 18781/19560 | loss 3.303151 (+1.08z)| norm 0.2016 (-0.47z)| lr 3.48e-06 | 4181.31 ms | 32.3% bf16 MFU | 124059 tok/s step 18782/19560 | loss 3.219097 (-1.19z)| norm 0.2014 (-0.50z)| lr 3.47e-06 | 4191.65 ms | 32.2% bf16 MFU | 124110 tok/s step 18783/19560 | loss 3.279836 (+0.45z)| norm 0.1984 (-0.83z)| lr 3.46e-06 | 4200.15 ms | 32.1% bf16 MFU | 124146 tok/s step 18784/19560 | loss 3.289529 (+0.71z)| norm 0.2007 (-0.57z)| lr 3.45e-06 | 4220.57 ms | 32.0% bf16 MFU | 124149 tok/s step 18785/19560 | loss 3.292749 (+0.80z)| norm 0.1942 (-1.28z)| lr 3.45e-06 | 4182.78 ms | 32.3% bf16 MFU | 124209 tok/s step 18786/19560 | loss 3.227535 (-0.95z)| norm 0.2006 (-0.56z)| lr 3.44e-06 | 4188.68 ms | 32.2% bf16 MFU | 124257 tok/s step 18787/19560 | loss 3.278298 (+0.42z)| norm 0.1971 (-0.96z)| lr 3.43e-06 | 4189.40 ms | 32.2% bf16 MFU | 124302 tok/s step 18788/19560 | loss 3.242151 (-0.55z)| norm 0.1992 (-0.71z)| lr 3.42e-06 | 4206.98 ms | 32.1% bf16 MFU | 124318 tok/s step 18789/19560 | loss 3.272676 (+0.28z)| norm 0.2046 (-0.11z)| lr 3.41e-06 | 4191.91 ms | 32.2% bf16 MFU | 124355 tok/s step 18790/19560 | loss 3.265684 (+0.09z)| norm 0.1959 (-1.07z)| lr 3.40e-06 | 4246.44 ms | 31.8% bf16 MFU | 124311 tok/s step 18791/19560 | loss 3.311601 (+1.33z)| norm 0.1979 (-0.84z)| lr 3.39e-06 | 4180.05 ms | 32.3% bf16 MFU | 124367 tok/s step 18792/19560 | loss 3.341786 (+2.11z)| norm 0.2123 (+0.76z)| lr 3.38e-06 | 4188.29 ms | 32.2% bf16 MFU | 124407 tok/s step 18793/19560 | loss 3.232339 (-0.82z)| norm 0.2041 (-0.15z)| lr 3.38e-06 | 4581.08 ms | 29.5% bf16 MFU | 123909 tok/s step 18794/19560 | loss 3.295987 (+0.89z)| norm 0.2069 (+0.17z)| lr 3.37e-06 | 5134.75 ms | 26.3% bf16 MFU | 122819 tok/s step 18795/19560 | loss 3.354331 (+2.38z)| norm 0.2197 (+1.58z)| lr 3.36e-06 | 4513.89 ms | 29.9% bf16 MFU | 122486 tok/s step 18796/19560 | loss 3.322522 (+1.52z)| norm 0.1917 (-1.51z)| lr 3.35e-06 | 4551.26 ms | 29.7% bf16 MFU | 122121 tok/s step 18797/19560 | loss 3.291189 (+0.68z)| norm 0.1944 (-1.20z)| lr 3.34e-06 | 4413.20 ms | 30.6% bf16 MFU | 121955 tok/s step 18798/19560 | loss 3.276847 (+0.31z)| norm 0.2059 (+0.09z)| lr 3.33e-06 | 4636.28 ms | 29.1% bf16 MFU | 121511 tok/s step 18799/19560 | loss 3.288023 (+0.60z)| norm 0.2038 (-0.16z)| lr 3.32e-06 | 4495.78 ms | 30.0% bf16 MFU | 121267 tok/s step 18800/19560 | loss 3.264496 (-0.02z)| norm 0.1976 (-0.87z)| lr 3.31e-06 | 4454.65 ms | 30.3% bf16 MFU | 121088 tok/s step 18801/19560 | loss 3.267337 (+0.06z)| norm 0.1966 (-0.97z)| lr 3.31e-06 | 4516.48 ms | 29.9% bf16 MFU | 120838 tok/s step 18802/19560 | loss 3.229523 (-0.93z)| norm 0.2190 (+1.64z)| lr 3.30e-06 | 4491.21 ms | 30.1% bf16 MFU | 120633 tok/s step 18803/19560 | loss 3.264600 (-0.02z)| norm 0.2126 (+0.88z)| lr 3.29e-06 | 4239.06 ms | 31.9% bf16 MFU | 120785 tok/s step 18804/19560 | loss 3.299033 (+0.88z)| norm 0.2060 (+0.12z)| lr 3.28e-06 | 4215.89 ms | 32.0% bf16 MFU | 120964 tok/s step 18805/19560 | loss 3.286964 (+0.55z)| norm 0.2060 (+0.12z)| lr 3.27e-06 | 4420.97 ms | 30.5% bf16 MFU | 120845 tok/s step 18806/19560 | loss 3.218582 (-1.25z)| norm 0.1949 (-1.17z)| lr 3.26e-06 | 4227.68 ms | 31.9% bf16 MFU | 121004 tok/s step 18807/19560 | loss 3.338583 (+1.87z)| norm 0.2022 (-0.33z)| lr 3.25e-06 | 4478.02 ms | 30.2% bf16 MFU | 120808 tok/s step 18808/19560 | loss 3.310000 (+1.13z)| norm 0.2109 (+0.68z)| lr 3.24e-06 | 4181.10 ms | 32.3% bf16 MFU | 121037 tok/s step 18809/19560 | loss 3.224722 (-1.08z)| norm 0.2018 (-0.37z)| lr 3.24e-06 | 4293.46 ms | 31.4% bf16 MFU | 121091 tok/s step 18810/19560 | loss 3.303690 (+0.96z)| norm 0.2042 (-0.09z)| lr 3.23e-06 | 4209.14 ms | 32.1% bf16 MFU | 121264 tok/s step 18811/19560 | loss 3.318996 (+1.33z)| norm 0.1932 (-1.38z)| lr 3.22e-06 | 4186.13 ms | 32.3% bf16 MFU | 121463 tok/s step 18812/19560 | loss 3.232159 (-0.90z)| norm 0.2026 (-0.27z)| lr 3.21e-06 | 4263.08 ms | 31.7% bf16 MFU | 121539 tok/s step 18813/19560 | loss 3.328612 (+1.56z)| norm 0.2150 (+1.16z)| lr 3.20e-06 | 4168.84 ms | 32.4% bf16 MFU | 121750 tok/s step 18814/19560 | loss 3.261796 (-0.18z)| norm 0.2084 (+0.38z)| lr 3.19e-06 | 4179.97 ms | 32.3% bf16 MFU | 121934 tok/s step 18815/19560 | loss 3.290022 (+0.56z)| norm 0.2117 (+0.77z)| lr 3.18e-06 | 4171.46 ms | 32.4% bf16 MFU | 122122 tok/s step 18816/19560 | loss 3.246884 (-0.56z)| norm 0.2009 (-0.49z)| lr 3.18e-06 | 4242.45 ms | 31.8% bf16 MFU | 122195 tok/s step 18817/19560 | loss 3.287760 (+0.50z)| norm 0.2142 (+1.04z)| lr 3.17e-06 | 4228.02 ms | 31.9% bf16 MFU | 122285 tok/s step 18818/19560 | loss 3.245348 (-0.60z)| norm 0.2232 (+2.04z)| lr 3.16e-06 | 4180.51 ms | 32.3% bf16 MFU | 122442 tok/s step 18819/19560 | loss 3.261958 (-0.16z)| norm 0.2397 (+3.68z)| lr 3.15e-06 | 4176.63 ms | 32.3% bf16 MFU | 122596 tok/s step 18820/19560 | loss 3.305268 (+0.97z)| norm 0.1973 (-0.89z)| lr 3.14e-06 | 4177.51 ms | 32.3% bf16 MFU | 122741 tok/s step 18821/19560 | loss 3.243317 (-0.66z)| norm 0.1990 (-0.70z)| lr 3.13e-06 | 4223.07 ms | 32.0% bf16 MFU | 122812 tok/s step 18822/19560 | loss 3.300664 (+0.84z)| norm 0.2085 (+0.32z)| lr 3.13e-06 | 4174.57 ms | 32.3% bf16 MFU | 122951 tok/s step 18823/19560 | loss 3.318002 (+1.28z)| norm 0.2025 (-0.33z)| lr 3.12e-06 | 4176.44 ms | 32.3% bf16 MFU | 123080 tok/s step 18824/19560 | loss 3.287304 (+0.49z)| norm 0.2018 (-0.39z)| lr 3.11e-06 | 4247.72 ms | 31.8% bf16 MFU | 123097 tok/s step 18825/19560 | loss 3.275506 (+0.17z)| norm 0.2090 (+0.39z)| lr 3.10e-06 | 4436.79 ms | 30.4% bf16 MFU | 122851 tok/s step 18826/19560 | loss 3.300776 (+0.85z)| norm 0.2038 (-0.16z)| lr 3.09e-06 | 4211.32 ms | 32.1% bf16 MFU | 122933 tok/s step 18827/19560 | loss 3.237304 (-0.86z)| norm 0.2045 (-0.08z)| lr 3.08e-06 | 4191.87 ms | 32.2% bf16 MFU | 123040 tok/s step 18828/19560 | loss 3.250711 (-0.48z)| norm 0.2059 (+0.08z)| lr 3.07e-06 | 4179.86 ms | 32.3% bf16 MFU | 123160 tok/s step 18829/19560 | loss 3.348689 (+2.12z)| norm 0.2163 (+1.21z)| lr 3.07e-06 | 4261.87 ms | 31.7% bf16 MFU | 123152 tok/s step 18830/19560 | loss 3.245177 (-0.65z)| norm 0.2025 (-0.31z)| lr 3.06e-06 | 4184.30 ms | 32.3% bf16 MFU | 123260 tok/s step 18831/19560 | loss 3.283216 (+0.36z)| norm 0.2041 (-0.13z)| lr 3.05e-06 | 4213.83 ms | 32.0% bf16 MFU | 123318 tok/s step 18832/19560 | loss 3.236658 (-0.87z)| norm 0.2047 (-0.06z)| lr 3.04e-06 | 4200.03 ms | 32.1% bf16 MFU | 123393 tok/s step 18833/19560 | loss 3.307141 (+1.00z)| norm 0.2044 (-0.10z)| lr 3.03e-06 | 4177.87 ms | 32.3% bf16 MFU | 123498 tok/s step 18834/19560 | loss 3.286672 (+0.45z)| norm 0.2002 (-0.57z)| lr 3.02e-06 | 4171.24 ms | 32.4% bf16 MFU | 123608 tok/s step 18835/19560 | loss 3.274407 (+0.12z)| norm 0.2067 (+0.15z)| lr 3.02e-06 | 4198.72 ms | 32.2% bf16 MFU | 123671 tok/s step 18836/19560 | loss 3.279768 (+0.25z)| norm 0.2029 (-0.27z)| lr 3.01e-06 | 4185.58 ms | 32.3% bf16 MFU | 123750 tok/s step 18837/19560 | loss 3.290369 (+0.52z)| norm 0.2021 (-0.35z)| lr 3.00e-06 | 4222.14 ms | 32.0% bf16 MFU | 123772 tok/s step 18838/19560 | loss 3.260833 (-0.27z)| norm 0.1946 (-1.17z)| lr 2.99e-06 | 4177.55 ms | 32.3% bf16 MFU | 123858 tok/s step 18839/19560 | loss 3.342245 (+1.88z)| norm 0.2048 (-0.06z)| lr 2.98e-06 | 4171.92 ms | 32.4% bf16 MFU | 123949 tok/s step 18840/19560 | loss 3.313067 (+1.09z)| norm 0.2028 (-0.28z)| lr 2.98e-06 | 4314.14 ms | 31.3% bf16 MFU | 123828 tok/s step 18841/19560 | loss 3.309666 (+0.98z)| norm 0.2054 (+0.01z)| lr 2.97e-06 | 4198.93 ms | 32.2% bf16 MFU | 123880 tok/s step 18842/19560 | loss 3.204376 (-1.79z)| norm 0.2060 (+0.07z)| lr 2.96e-06 | 4183.08 ms | 32.3% bf16 MFU | 123952 tok/s step 18843/19560 | loss 3.224978 (-1.26z)| norm 0.1981 (-0.80z)| lr 2.95e-06 | 4297.79 ms | 31.4% bf16 MFU | 123854 tok/s step 18844/19560 | loss 3.289834 (+0.45z)| norm 0.2206 (+1.65z)| lr 2.94e-06 | 4186.31 ms | 32.3% bf16 MFU | 123923 tok/s step 18845/19560 | loss 3.404771 (+3.31z)| norm 0.2087 (+0.35z)| lr 2.93e-06 | 4177.61 ms | 32.3% bf16 MFU | 124002 tok/s step 18846/19560 | loss 3.314506 (+1.02z)| norm 0.2042 (-0.14z)| lr 2.93e-06 | 4349.73 ms | 31.0% bf16 MFU | 123829 tok/s step 18847/19560 | loss 3.231848 (-1.05z)| norm 0.2147 (+0.99z)| lr 2.92e-06 | 4175.07 ms | 32.3% bf16 MFU | 123916 tok/s step 18848/19560 | loss 3.298391 (+0.62z)| norm 0.2055 (-0.00z)| lr 2.91e-06 | 4954.00 ms | 27.3% bf16 MFU | 123012 tok/s step 18849/19560 | loss 3.283175 (+0.24z)| norm 0.1921 (-1.44z)| lr 2.90e-06 | 4443.54 ms | 30.4% bf16 MFU | 122761 tok/s step 18850/19560 | loss 3.265298 (-0.20z)| norm 0.2179 (+1.32z)| lr 2.89e-06 | 4356.28 ms | 31.0% bf16 MFU | 122640 tok/s step 18851/19560 | loss 3.261577 (-0.29z)| norm 0.1979 (-0.82z)| lr 2.89e-06 | 4362.26 ms | 31.0% bf16 MFU | 122518 tok/s step 18852/19560 | loss 3.294606 (+0.54z)| norm 0.2035 (-0.22z)| lr 2.88e-06 | 5435.42 ms | 24.8% bf16 MFU | 121215 tok/s step 18853/19560 | loss 3.439653 (+3.93z)| norm 0.2047 (-0.10z)| lr 2.87e-06 | 4582.73 ms | 29.5% bf16 MFU | 120874 tok/s step 18854/19560 | loss 3.308115 (+0.78z)| norm 0.2062 (+0.05z)| lr 2.86e-06 | 4846.59 ms | 27.9% bf16 MFU | 120239 tok/s step 18855/19560 | loss 3.254738 (-0.49z)| norm 0.2076 (+0.20z)| lr 2.85e-06 | 4617.90 ms | 29.2% bf16 MFU | 119904 tok/s step 18856/19560 | loss 3.312006 (+0.89z)| norm 0.2009 (-0.51z)| lr 2.84e-06 | 4201.48 ms | 32.1% bf16 MFU | 120148 tok/s step 18857/19560 | loss 3.297751 (+0.54z)| norm 0.2020 (-0.40z)| lr 2.84e-06 | 4746.58 ms | 28.4% bf16 MFU | 119664 tok/s step 18858/19560 | loss 3.276576 (+0.03z)| norm 0.2012 (-0.48z)| lr 2.83e-06 | 4381.05 ms | 30.8% bf16 MFU | 119664 tok/s step 18859/19560 | loss 3.294162 (+0.46z)| norm 0.2154 (+1.05z)| lr 2.82e-06 | 4171.32 ms | 32.4% bf16 MFU | 119965 tok/s step 18860/19560 | loss 3.332764 (+1.37z)| norm 0.2089 (+0.35z)| lr 2.81e-06 | 4292.86 ms | 31.5% bf16 MFU | 120073 tok/s step 18861/19560 | loss 3.289469 (+0.33z)| norm 0.1953 (-1.12z)| lr 2.80e-06 | 4712.45 ms | 28.7% bf16 MFU | 119633 tok/s step 18862/19560 | loss 3.311316 (+0.85z)| norm 0.1991 (-0.70z)| lr 2.80e-06 | 5908.52 ms | 22.9% bf16 MFU | 118088 tok/s step 18863/19560 | loss 3.261784 (-0.35z)| norm 0.1979 (-0.84z)| lr 2.79e-06 | 4831.52 ms | 27.9% bf16 MFU | 117609 tok/s step 18864/19560 | loss 3.269110 (-0.17z)| norm 0.1992 (-0.69z)| lr 2.78e-06 | 5178.68 ms | 26.1% bf16 MFU | 116790 tok/s step 18865/19560 | loss 3.264226 (-0.29z)| norm 0.1981 (-0.80z)| lr 2.77e-06 | 4748.14 ms | 28.4% bf16 MFU | 116472 tok/s step 18866/19560 | loss 3.215200 (-1.50z)| norm 0.2110 (+0.58z)| lr 2.76e-06 | 4166.03 ms | 32.4% bf16 MFU | 116941 tok/s step 18867/19560 | loss 3.268455 (-0.19z)| norm 0.2005 (-0.54z)| lr 2.76e-06 | 4157.21 ms | 32.5% bf16 MFU | 117400 tok/s step 18868/19560 | loss 3.287845 (+0.27z)| norm 0.2022 (-0.37z)| lr 2.75e-06 | 4158.72 ms | 32.5% bf16 MFU | 117833 tok/s step 18869/19560 | loss 3.337430 (+1.47z)| norm 0.2033 (-0.24z)| lr 2.74e-06 | 4320.51 ms | 31.3% bf16 MFU | 118009 tok/s step 18870/19560 | loss 3.279365 (+0.04z)| norm 0.2190 (+1.49z)| lr 2.73e-06 | 4239.63 ms | 31.8% bf16 MFU | 118292 tok/s step 18871/19560 | loss 3.242796 (-0.86z)| norm 0.1938 (-1.28z)| lr 2.73e-06 | 4171.22 ms | 32.4% bf16 MFU | 118662 tok/s step 18872/19560 | loss 3.302704 (+0.62z)| norm 0.2015 (-0.42z)| lr 2.72e-06 | 4161.78 ms | 32.4% bf16 MFU | 119027 tok/s step 18873/19560 | loss 3.303747 (+0.63z)| norm 0.2109 (+0.62z)| lr 2.71e-06 | 4239.80 ms | 31.8% bf16 MFU | 119259 tok/s step 18874/19560 | loss 3.276959 (-0.03z)| norm 0.2005 (-0.53z)| lr 2.70e-06 | 4262.76 ms | 31.7% bf16 MFU | 119446 tok/s step 18875/19560 | loss 3.259317 (-0.47z)| norm 0.2030 (-0.25z)| lr 2.69e-06 | 4164.33 ms | 32.4% bf16 MFU | 119768 tok/s step 18876/19560 | loss 3.272648 (-0.16z)| norm 0.1995 (-0.64z)| lr 2.69e-06 | 4450.91 ms | 30.3% bf16 MFU | 119670 tok/s step 18877/19560 | loss 3.264393 (-0.36z)| norm 0.1994 (-0.65z)| lr 2.68e-06 | 4172.72 ms | 32.4% bf16 MFU | 119968 tok/s step 18878/19560 | loss 3.282877 (+0.10z)| norm 0.1902 (-1.63z)| lr 2.67e-06 | 4496.39 ms | 30.0% bf16 MFU | 119800 tok/s step 18879/19560 | loss 3.257331 (-0.54z)| norm 0.1981 (-0.76z)| lr 2.66e-06 | 4171.69 ms | 32.4% bf16 MFU | 120094 tok/s step 18880/19560 | loss 3.405924 (+3.06z)| norm 0.2193 (+1.53z)| lr 2.65e-06 | 4197.32 ms | 32.2% bf16 MFU | 120335 tok/s step 18881/19560 | loss 3.235841 (-1.06z)| norm 0.1971 (-0.86z)| lr 2.65e-06 | 4179.08 ms | 32.3% bf16 MFU | 120591 tok/s step 18882/19560 | loss 3.225016 (-1.31z)| norm 0.2022 (-0.31z)| lr 2.64e-06 | 4383.33 ms | 30.8% bf16 MFU | 120542 tok/s step 18883/19560 | loss 3.247657 (-0.75z)| norm 0.1998 (-0.57z)| lr 2.63e-06 | 4591.12 ms | 29.4% bf16 MFU | 120224 tok/s step 18884/19560 | loss 3.338971 (+1.42z)| norm 0.2007 (-0.48z)| lr 2.62e-06 | 4235.55 ms | 31.9% bf16 MFU | 120402 tok/s step 18885/19560 | loss 3.303839 (+0.57z)| norm 0.2050 (-0.01z)| lr 2.62e-06 | 4230.51 ms | 31.9% bf16 MFU | 120579 tok/s step 18886/19560 | loss 3.322947 (+1.03z)| norm 0.1998 (-0.55z)| lr 2.61e-06 | 4168.00 ms | 32.4% bf16 MFU | 120839 tok/s step 18887/19560 | loss 3.323860 (+1.05z)| norm 0.1988 (-0.65z)| lr 2.60e-06 | 4235.31 ms | 31.9% bf16 MFU | 120987 tok/s step 18888/19560 | loss 3.289880 (+0.22z)| norm 0.2048 (+0.03z)| lr 2.59e-06 | 4166.07 ms | 32.4% bf16 MFU | 121230 tok/s step 18889/19560 | loss 3.234514 (-1.13z)| norm 0.2233 (+2.08z)| lr 2.58e-06 | 4232.93 ms | 31.9% bf16 MFU | 121361 tok/s step 18890/19560 | loss 3.286458 (+0.13z)| norm 0.1954 (-1.03z)| lr 2.58e-06 | 4305.13 ms | 31.4% bf16 MFU | 121382 tok/s step 18891/19560 | loss 3.308988 (+0.67z)| norm 0.1933 (-1.25z)| lr 2.57e-06 | 4183.77 ms | 32.3% bf16 MFU | 121579 tok/s step 18892/19560 | loss 3.291396 (+0.23z)| norm 0.1988 (-0.66z)| lr 2.56e-06 | 4307.30 ms | 31.3% bf16 MFU | 121586 tok/s step 18893/19560 | loss 3.282919 (+0.01z)| norm 0.2126 (+0.87z)| lr 2.55e-06 | 4214.07 ms | 32.0% bf16 MFU | 121727 tok/s step 18894/19560 | loss 3.340788 (+1.43z)| norm 0.2027 (-0.23z)| lr 2.55e-06 | 4176.11 ms | 32.3% bf16 MFU | 121918 tok/s step 18895/19560 | loss 3.312452 (+0.72z)| norm 0.2027 (-0.24z)| lr 2.54e-06 | 4338.46 ms | 31.1% bf16 MFU | 121865 tok/s step 18896/19560 | loss 3.296726 (+0.32z)| norm 0.1982 (-0.72z)| lr 2.53e-06 | 4192.31 ms | 32.2% bf16 MFU | 122024 tok/s step 18897/19560 | loss 3.263325 (-0.50z)| norm 0.1966 (-0.99z)| lr 2.52e-06 | 4177.98 ms | 32.3% bf16 MFU | 122198 tok/s step 18898/19560 | loss 3.253692 (-0.73z)| norm 0.1951 (-1.18z)| lr 2.52e-06 | 4174.54 ms | 32.3% bf16 MFU | 122367 tok/s step 18899/19560 | loss 3.323208 (+0.98z)| norm 0.2115 (+0.95z)| lr 2.51e-06 | 4395.92 ms | 30.7% bf16 MFU | 122212 tok/s step 18900/19560 | loss 3.272973 (-0.26z)| norm 0.1995 (-0.61z)| lr 2.50e-06 | 4209.27 ms | 32.1% bf16 MFU | 122329 tok/s step 18901/19560 | loss 3.297920 (+0.35z)| norm 0.1947 (-1.23z)| lr 2.49e-06 | 4181.18 ms | 32.3% bf16 MFU | 122483 tok/s step 18902/19560 | loss 3.292771 (+0.24z)| norm 0.2062 (+0.28z)| lr 2.49e-06 | 4176.07 ms | 32.3% bf16 MFU | 122636 tok/s step 18903/19560 | loss 3.213121 (-1.73z)| norm 0.1957 (-1.09z)| lr 2.48e-06 | 4156.79 ms | 32.5% bf16 MFU | 122810 tok/s step 18904/19560 | loss 3.280831 (-0.07z)| norm 0.2105 (+0.87z)| lr 2.47e-06 | 4187.11 ms | 32.2% bf16 MFU | 122931 tok/s step 18905/19560 | loss 3.340874 (+1.46z)| norm 0.2050 (+0.14z)| lr 2.46e-06 | 4245.32 ms | 31.8% bf16 MFU | 122959 tok/s step 18906/19560 | loss 3.307433 (+0.60z)| norm 0.2068 (+0.39z)| lr 2.46e-06 | 4178.55 ms | 32.3% bf16 MFU | 123085 tok/s step 18907/19560 | loss 3.277378 (-0.18z)| norm 0.1965 (-0.97z)| lr 2.45e-06 | 4191.97 ms | 32.2% bf16 MFU | 123184 tok/s step 18908/19560 | loss 3.229381 (-1.41z)| norm 0.1991 (-0.62z)| lr 2.44e-06 | 4182.41 ms | 32.3% bf16 MFU | 123292 tok/s step 18909/19560 | loss 3.274967 (-0.24z)| norm 0.2017 (-0.28z)| lr 2.43e-06 | 4190.58 ms | 32.2% bf16 MFU | 123383 tok/s step 18910/19560 | loss 3.314289 (+0.76z)| norm 0.2026 (-0.16z)| lr 2.43e-06 | 4183.19 ms | 32.3% bf16 MFU | 123481 tok/s step 18911/19560 | loss 3.289713 (+0.12z)| norm 0.2009 (-0.39z)| lr 2.42e-06 | 4189.86 ms | 32.2% bf16 MFU | 123563 tok/s step 18912/19560 | loss 3.276280 (-0.22z)| norm 0.1992 (-0.60z)| lr 2.41e-06 | 4179.01 ms | 32.3% bf16 MFU | 123658 tok/s step 18913/19560 | loss 3.232442 (-1.34z)| norm 0.2007 (-0.42z)| lr 2.40e-06 | 4192.85 ms | 32.2% bf16 MFU | 123727 tok/s step 18914/19560 | loss 3.287168 (+0.06z)| norm 0.2005 (-0.44z)| lr 2.40e-06 | 4183.74 ms | 32.3% bf16 MFU | 123807 tok/s step 18915/19560 | loss 3.329856 (+1.15z)| norm 0.1958 (-1.07z)| lr 2.39e-06 | 4280.41 ms | 31.5% bf16 MFU | 123741 tok/s step 18916/19560 | loss 3.264818 (-0.53z)| norm 0.2024 (-0.19z)| lr 2.38e-06 | 4240.40 ms | 31.8% bf16 MFU | 123736 tok/s step 18917/19560 | loss 3.259365 (-0.67z)| norm 0.2113 (+0.97z)| lr 2.37e-06 | 4197.51 ms | 32.2% bf16 MFU | 123794 tok/s step 18918/19560 | loss 3.270847 (-0.38z)| norm 0.1958 (-1.07z)| lr 2.37e-06 | 4190.08 ms | 32.2% bf16 MFU | 123861 tok/s step 18919/19560 | loss 3.259819 (-0.65z)| norm 0.2028 (-0.15z)| lr 2.36e-06 | 4181.63 ms | 32.3% bf16 MFU | 123937 tok/s step 18920/19560 | loss 3.291768 (+0.19z)| norm 0.2001 (-0.49z)| lr 2.35e-06 | 4203.97 ms | 32.1% bf16 MFU | 123975 tok/s step 18921/19560 | loss 3.251587 (-0.87z)| norm 0.1993 (-0.60z)| lr 2.34e-06 | 4173.31 ms | 32.4% bf16 MFU | 124058 tok/s step 18922/19560 | loss 3.323864 (+1.02z)| norm 0.1970 (-0.90z)| lr 2.34e-06 | 4213.68 ms | 32.0% bf16 MFU | 124076 tok/s step 18923/19560 | loss 3.257200 (-0.71z)| norm 0.2018 (-0.25z)| lr 2.33e-06 | 4183.57 ms | 32.3% bf16 MFU | 124139 tok/s step 18924/19560 | loss 3.314108 (+0.79z)| norm 0.2007 (-0.41z)| lr 2.32e-06 | 4287.09 ms | 31.5% bf16 MFU | 124046 tok/s step 18925/19560 | loss 3.251542 (-0.85z)| norm 0.2102 (+0.87z)| lr 2.32e-06 | 4204.38 ms | 32.1% bf16 MFU | 124079 tok/s step 18926/19560 | loss 3.255067 (-0.75z)| norm 0.2627 (+6.50z)| lr 2.31e-06 | 4396.14 ms | 30.7% bf16 MFU | 123838 tok/s step 18927/19560 | loss 3.346167 (+1.62z)| norm 0.1965 (-0.85z)| lr 2.30e-06 | 4175.89 ms | 32.3% bf16 MFU | 123924 tok/s step 18928/19560 | loss 3.260950 (-0.60z)| norm 0.1991 (-0.57z)| lr 2.29e-06 | 4170.67 ms | 32.4% bf16 MFU | 124013 tok/s step 18929/19560 | loss 3.278871 (-0.14z)| norm 0.2093 (+0.55z)| lr 2.29e-06 | 4219.16 ms | 32.0% bf16 MFU | 124026 tok/s step 18930/19560 | loss 3.261454 (-0.60z)| norm 0.2035 (-0.07z)| lr 2.28e-06 | 4261.88 ms | 31.7% bf16 MFU | 123975 tok/s step 18931/19560 | loss 3.280017 (-0.12z)| norm 0.1929 (-1.25z)| lr 2.27e-06 | 4247.17 ms | 31.8% bf16 MFU | 123949 tok/s step 18932/19560 | loss 3.296377 (+0.31z)| norm 0.1994 (-0.51z)| lr 2.26e-06 | 4204.84 ms | 32.1% bf16 MFU | 123986 tok/s step 18933/19560 | loss 3.311792 (+0.71z)| norm 0.1963 (-0.84z)| lr 2.26e-06 | 4185.11 ms | 32.3% bf16 MFU | 124050 tok/s step 18934/19560 | loss 3.289069 (+0.10z)| norm 0.1960 (-0.89z)| lr 2.25e-06 | 4172.99 ms | 32.4% bf16 MFU | 124129 tok/s step 18935/19560 | loss 3.246218 (-1.02z)| norm 0.2065 (+0.28z)| lr 2.24e-06 | 4197.08 ms | 32.2% bf16 MFU | 124169 tok/s step 18936/19560 | loss 3.267521 (-0.45z)| norm 0.2024 (-0.17z)| lr 2.24e-06 | 4465.29 ms | 30.2% bf16 MFU | 123831 tok/s step 18937/19560 | loss 3.277924 (-0.18z)| norm 0.2013 (-0.29z)| lr 2.23e-06 | 4172.63 ms | 32.4% bf16 MFU | 123922 tok/s step 18938/19560 | loss 3.337045 (+1.40z)| norm 0.1999 (-0.44z)| lr 2.22e-06 | 4182.63 ms | 32.3% bf16 MFU | 123993 tok/s step 18939/19560 | loss 3.275252 (-0.25z)| norm 0.2004 (-0.39z)| lr 2.21e-06 | 4181.11 ms | 32.3% bf16 MFU | 124063 tok/s step 18940/19560 | loss 3.248404 (-0.98z)| norm 0.1978 (-0.68z)| lr 2.21e-06 | 4207.68 ms | 32.1% bf16 MFU | 124090 tok/s step 18941/19560 | loss 3.286683 (+0.06z)| norm 0.2017 (-0.23z)| lr 2.20e-06 | 4175.75 ms | 32.3% bf16 MFU | 124164 tok/s step 18942/19560 | loss 3.374210 (+2.36z)| norm 0.2049 (+0.14z)| lr 2.19e-06 | 4204.97 ms | 32.1% bf16 MFU | 124190 tok/s step 18943/19560 | loss 3.232880 (-1.37z)| norm 0.2002 (-0.39z)| lr 2.19e-06 | 4196.85 ms | 32.2% bf16 MFU | 124226 tok/s step 18944/19560 | loss 3.247937 (-0.97z)| norm 0.1933 (-1.16z)| lr 2.18e-06 | 4181.18 ms | 32.3% bf16 MFU | 124285 tok/s step 18945/19560 | loss 3.361926 (+1.99z)| norm 0.2115 (+0.90z)| lr 2.17e-06 | 4228.81 ms | 31.9% bf16 MFU | 124269 tok/s step 18946/19560 | loss 3.239946 (-1.18z)| norm 0.2023 (-0.12z)| lr 2.17e-06 | 4223.46 ms | 32.0% bf16 MFU | 124263 tok/s step 18947/19560 | loss 3.241817 (-1.12z)| norm 0.2013 (-0.23z)| lr 2.16e-06 | 4198.78 ms | 32.2% bf16 MFU | 124293 tok/s step 18948/19560 | loss 3.273099 (-0.31z)| norm 0.1982 (-0.61z)| lr 2.15e-06 | 4176.88 ms | 32.3% bf16 MFU | 124354 tok/s step 18949/19560 | loss 3.293056 (+0.20z)| norm 0.2002 (-0.36z)| lr 2.14e-06 | 5992.07 ms | 22.5% bf16 MFU | 122512 tok/s step 18950/19560 | loss 3.343272 (+1.48z)| norm 0.2039 (+0.10z)| lr 2.14e-06 | 4248.15 ms | 31.8% bf16 MFU | 122557 tok/s step 18951/19560 | loss 3.316343 (+0.79z)| norm 0.2042 (+0.14z)| lr 2.13e-06 | 4769.55 ms | 28.3% bf16 MFU | 121925 tok/s step 18952/19560 | loss 3.282069 (-0.09z)| norm 0.1965 (-0.81z)| lr 2.12e-06 | 4197.72 ms | 32.2% bf16 MFU | 122074 tok/s step 18953/19560 | loss 3.251580 (-0.87z)| norm 0.2182 (+1.85z)| lr 2.12e-06 | 4255.48 ms | 31.7% bf16 MFU | 122130 tok/s step 18954/19560 | loss 3.238140 (-1.19z)| norm 0.1962 (-0.84z)| lr 2.11e-06 | 4178.65 ms | 32.3% bf16 MFU | 122297 tok/s step 18955/19560 | loss 3.256892 (-0.72z)| norm 0.2094 (+0.77z)| lr 2.10e-06 | 4313.41 ms | 31.3% bf16 MFU | 122260 tok/s step 18956/19560 | loss 3.267892 (-0.44z)| norm 0.2019 (-0.14z)| lr 2.10e-06 | 9865.33 ms | 13.7% bf16 MFU | 118804 tok/s step 18957/19560 | loss 3.293936 (+0.24z)| norm 0.2105 (+0.92z)| lr 2.09e-06 | 5645.34 ms | 23.9% bf16 MFU | 117507 tok/s step 18958/19560 | loss 3.262832 (-0.57z)| norm 0.2006 (-0.30z)| lr 2.08e-06 | 4188.34 ms | 32.2% bf16 MFU | 117891 tok/s step 18959/19560 | loss 3.259000 (-0.67z)| norm 0.2056 (+0.32z)| lr 2.07e-06 | 4971.69 ms | 27.2% bf16 MFU | 117269 tok/s step 18960/19560 | loss 3.297320 (+0.32z)| norm 0.2248 (+2.59z)| lr 2.07e-06 | 4375.52 ms | 30.9% bf16 MFU | 117397 tok/s step 18961/19560 | loss 3.319355 (+0.89z)| norm 0.1969 (-0.74z)| lr 2.06e-06 | 4167.41 ms | 32.4% bf16 MFU | 117817 tok/s step 18962/19560 | loss 3.294829 (+0.25z)| norm 0.1937 (-1.11z)| lr 2.05e-06 | 4151.53 ms | 32.5% bf16 MFU | 118241 tok/s step 18963/19560 | loss 3.271921 (-0.35z)| norm 0.2015 (-0.18z)| lr 2.05e-06 | 4198.24 ms | 32.2% bf16 MFU | 118573 tok/s step 18964/19560 | loss 3.235372 (-1.29z)| norm 0.1925 (-1.23z)| lr 2.04e-06 | 4170.92 ms | 32.4% bf16 MFU | 118929 tok/s step 18965/19560 | loss 3.226715 (-1.48z)| norm 0.2119 (+1.04z)| lr 2.03e-06 | 4166.07 ms | 32.4% bf16 MFU | 119275 tok/s step 18966/19560 | loss 3.289191 (+0.12z)| norm 0.2003 (-0.33z)| lr 2.03e-06 | 4164.35 ms | 32.4% bf16 MFU | 119606 tok/s step 18967/19560 | loss 3.321991 (+0.97z)| norm 0.2177 (+1.69z)| lr 2.02e-06 | 4243.87 ms | 31.8% bf16 MFU | 119803 tok/s step 18968/19560 | loss 3.320774 (+0.93z)| norm 0.1951 (-0.93z)| lr 2.01e-06 | 4160.40 ms | 32.5% bf16 MFU | 120114 tok/s step 18969/19560 | loss 3.293032 (+0.22z)| norm 0.2063 (+0.37z)| lr 2.01e-06 | 4188.52 ms | 32.2% bf16 MFU | 120367 tok/s step 18970/19560 | loss 3.326945 (+1.09z)| norm 0.1966 (-0.75z)| lr 2.00e-06 | 4184.44 ms | 32.3% bf16 MFU | 120613 tok/s step 18971/19560 | loss 3.337146 (+1.34z)| norm 0.1999 (-0.37z)| lr 1.99e-06 | 4182.57 ms | 32.3% bf16 MFU | 120850 tok/s step 18972/19560 | loss 3.272741 (-0.35z)| norm 0.1977 (-0.62z)| lr 1.99e-06 | 4173.41 ms | 32.4% bf16 MFU | 121089 tok/s step 18973/19560 | loss 3.268214 (-0.46z)| norm 0.2155 (+1.47z)| lr 1.98e-06 | 4213.73 ms | 32.0% bf16 MFU | 121256 tok/s step 18974/19560 | loss 3.261662 (-0.63z)| norm 0.2054 (+0.28z)| lr 1.97e-06 | 4272.88 ms | 31.6% bf16 MFU | 121328 tok/s step 18975/19560 | loss 3.269867 (-0.42z)| norm 0.1916 (-1.31z)| lr 1.97e-06 | 4174.52 ms | 32.3% bf16 MFU | 121541 tok/s step 18976/19560 | loss 3.251484 (-0.91z)| norm 0.1977 (-0.59z)| lr 1.96e-06 | 4180.05 ms | 32.3% bf16 MFU | 121735 tok/s step 18977/19560 | loss 3.281703 (-0.08z)| norm 0.1925 (-1.19z)| lr 1.95e-06 | 4162.33 ms | 32.4% bf16 MFU | 121947 tok/s step 18978/19560 | loss 3.233100 (-1.40z)| norm 0.2010 (-0.18z)| lr 1.95e-06 | 4179.14 ms | 32.3% bf16 MFU | 122122 tok/s step 18979/19560 | loss 3.302937 (+0.50z)| norm 0.2086 (+0.71z)| lr 1.94e-06 | 4181.94 ms | 32.3% bf16 MFU | 122284 tok/s step 18980/19560 | loss 3.269287 (-0.41z)| norm 0.2022 (-0.05z)| lr 1.93e-06 | 4180.25 ms | 32.3% bf16 MFU | 122441 tok/s step 18981/19560 | loss 3.363183 (+2.29z)| norm 0.2035 (+0.11z)| lr 1.93e-06 | 4178.18 ms | 32.3% bf16 MFU | 122593 tok/s step 18982/19560 | loss 3.242593 (-1.17z)| norm 0.1972 (-0.64z)| lr 1.92e-06 | 4177.62 ms | 32.3% bf16 MFU | 122738 tok/s step 18983/19560 | loss 3.283174 (-0.01z)| norm 0.1975 (-0.59z)| lr 1.91e-06 | 4183.23 ms | 32.3% bf16 MFU | 122868 tok/s step 18984/19560 | loss 3.277730 (-0.16z)| norm 0.1977 (-0.56z)| lr 1.91e-06 | 5573.16 ms | 24.2% bf16 MFU | 121428 tok/s step 18985/19560 | loss 3.295014 (+0.34z)| norm 0.1969 (-0.65z)| lr 1.90e-06 | 4615.93 ms | 29.3% bf16 MFU | 121036 tok/s step 18986/19560 | loss 3.311194 (+0.80z)| norm 0.2132 (+1.26z)| lr 1.89e-06 | 4870.09 ms | 27.7% bf16 MFU | 120367 tok/s step 18987/19560 | loss 3.286609 (+0.09z)| norm 0.2072 (+0.56z)| lr 1.89e-06 | 5216.31 ms | 25.9% bf16 MFU | 119374 tok/s step 18988/19560 | loss 3.239871 (-1.24z)| norm 0.2069 (+0.53z)| lr 1.88e-06 | 4491.03 ms | 30.1% bf16 MFU | 119242 tok/s step 18989/19560 | loss 3.252751 (-0.86z)| norm 0.2040 (+0.17z)| lr 1.87e-06 | 4283.99 ms | 31.5% bf16 MFU | 119399 tok/s step 18990/19560 | loss 3.292721 (+0.30z)| norm 0.1950 (-0.88z)| lr 1.87e-06 | 4231.60 ms | 31.9% bf16 MFU | 119624 tok/s step 18991/19560 | loss 3.283152 (+0.02z)| norm 0.2006 (-0.22z)| lr 1.86e-06 | 8591.61 ms | 15.7% bf16 MFU | 116694 tok/s step 18992/19560 | loss 3.268039 (-0.42z)| norm 0.2017 (-0.09z)| lr 1.85e-06 | 4820.18 ms | 28.0% bf16 MFU | 116298 tok/s step 18993/19560 | loss 3.231002 (-1.47z)| norm 0.1974 (-0.61z)| lr 1.85e-06 | 4412.61 ms | 30.6% bf16 MFU | 116424 tok/s step 18994/19560 | loss 3.282475 (-0.01z)| norm 0.1976 (-0.57z)| lr 1.84e-06 | 7398.89 ms | 18.2% bf16 MFU | 114146 tok/s step 18995/19560 | loss 3.337831 (+1.57z)| norm 0.2065 (+0.48z)| lr 1.83e-06 | 4198.49 ms | 32.2% bf16 MFU | 114682 tok/s step 18996/19560 | loss 3.342508 (+1.68z)| norm 0.2299 (+3.12z)| lr 1.83e-06 | 4203.96 ms | 32.1% bf16 MFU | 115184 tok/s step 18997/19560 | loss 3.268464 (-0.42z)| norm 0.2158 (+1.48z)| lr 1.82e-06 | 4308.68 ms | 31.3% bf16 MFU | 115509 tok/s step 18998/19560 | loss 3.295796 (+0.36z)| norm 0.1995 (-0.36z)| lr 1.81e-06 | 4180.10 ms | 32.3% bf16 MFU | 116004 tok/s step 18999/19560 | loss 3.219914 (-1.81z)| norm 0.2044 (+0.19z)| lr 1.81e-06 | 4387.38 ms | 30.8% bf16 MFU | 116179 tok/s step 19000/19560 | loss 3.305523 (+0.64z)| norm 0.2015 (-0.14z)| lr 1.80e-06 | 4366.40 ms | 30.9% bf16 MFU | 116374 tok/s val loss 3.251616 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3050/10042 = 0.303724 step 19001/19560 | loss 3.287857 (+0.14z)| norm 0.1959 (-0.77z)| lr 1.80e-06 | 4340.64 ms | 31.1% bf16 MFU | 116595 tok/s step 19002/19560 | loss 3.289145 (+0.17z)| norm 0.2008 (-0.21z)| lr 1.79e-06 | 4474.11 ms | 30.2% bf16 MFU | 116624 tok/s step 19003/19560 | loss 3.327713 (+1.25z)| norm 0.2031 (+0.06z)| lr 1.78e-06 | 4240.57 ms | 31.8% bf16 MFU | 116975 tok/s step 19004/19560 | loss 3.367145 (+2.30z)| norm 0.2032 (+0.07z)| lr 1.78e-06 | 4269.51 ms | 31.6% bf16 MFU | 117266 tok/s step 19005/19560 | loss 3.208055 (-2.08z)| norm 0.2094 (+0.77z)| lr 1.77e-06 | 5300.58 ms | 25.5% bf16 MFU | 116348 tok/s step 19006/19560 | loss 3.300234 (+0.44z)| norm 0.1960 (-0.78z)| lr 1.76e-06 | 4473.90 ms | 30.2% bf16 MFU | 116390 tok/s step 19007/19560 | loss 3.249601 (-0.94z)| norm 0.1995 (-0.38z)| lr 1.76e-06 | 4413.01 ms | 30.6% bf16 MFU | 116511 tok/s step 19008/19560 | loss 3.282521 (-0.02z)| norm 0.2034 (+0.09z)| lr 1.75e-06 | 6262.42 ms | 21.6% bf16 MFU | 114871 tok/s step 19009/19560 | loss 3.295804 (+0.35z)| norm 0.2027 (+0.01z)| lr 1.74e-06 | 5396.13 ms | 25.0% bf16 MFU | 113986 tok/s step 19010/19560 | loss 3.284417 (+0.01z)| norm 0.2032 (+0.06z)| lr 1.74e-06 | 4276.74 ms | 31.6% bf16 MFU | 114416 tok/s step 19011/19560 | loss 3.401875 (+3.27z)| norm 0.2078 (+0.60z)| lr 1.73e-06 | 4469.22 ms | 30.2% bf16 MFU | 114561 tok/s step 19012/19560 | loss 3.230453 (-1.51z)| norm 0.1952 (-0.88z)| lr 1.73e-06 | 4162.09 ms | 32.4% bf16 MFU | 115131 tok/s step 19013/19560 | loss 3.257765 (-0.74z)| norm 0.1958 (-0.80z)| lr 1.72e-06 | 5360.32 ms | 25.2% bf16 MFU | 114265 tok/s step 19014/19560 | loss 3.292291 (+0.24z)| norm 0.1996 (-0.36z)| lr 1.71e-06 | 4746.51 ms | 28.4% bf16 MFU | 114074 tok/s step 19015/19560 | loss 3.339529 (+1.56z)| norm 0.2251 (+2.55z)| lr 1.71e-06 | 4155.73 ms | 32.5% bf16 MFU | 114679 tok/s step 19016/19560 | loss 3.269513 (-0.40z)| norm 0.1997 (-0.36z)| lr 1.70e-06 | 4471.46 ms | 30.2% bf16 MFU | 114807 tok/s step 19017/19560 | loss 3.216829 (-1.86z)| norm 0.2181 (+1.77z)| lr 1.69e-06 | 4255.72 ms | 31.7% bf16 MFU | 115227 tok/s step 19018/19560 | loss 3.243927 (-1.09z)| norm 0.1936 (-1.05z)| lr 1.69e-06 | 6085.63 ms | 22.2% bf16 MFU | 113773 tok/s step 19019/19560 | loss 3.260468 (-0.62z)| norm 0.1918 (-1.26z)| lr 1.68e-06 | 5801.76 ms | 23.3% bf16 MFU | 112603 tok/s step 19020/19560 | loss 3.216294 (-1.81z)| norm 0.2132 (+1.19z)| lr 1.68e-06 | 5495.93 ms | 24.6% bf16 MFU | 111742 tok/s step 19021/19560 | loss 3.279006 (-0.09z)| norm 0.2027 (-0.00z)| lr 1.67e-06 | 4540.09 ms | 29.7% bf16 MFU | 111929 tok/s step 19022/19560 | loss 3.403871 (+3.21z)| norm 0.2164 (+1.55z)| lr 1.66e-06 | 4808.97 ms | 28.1% bf16 MFU | 111784 tok/s step 19023/19560 | loss 3.318288 (+0.94z)| norm 0.1966 (-0.70z)| lr 1.66e-06 | 5867.45 ms | 23.0% bf16 MFU | 110663 tok/s step 19024/19560 | loss 3.280687 (-0.05z)| norm 0.2095 (+0.75z)| lr 1.65e-06 | 5538.31 ms | 24.4% bf16 MFU | 109863 tok/s step 19025/19560 | loss 3.293859 (+0.29z)| norm 0.2059 (+0.33z)| lr 1.65e-06 | 4177.90 ms | 32.3% bf16 MFU | 110644 tok/s step 19026/19560 | loss 3.264288 (-0.50z)| norm 0.2063 (+0.37z)| lr 1.64e-06 | 4428.15 ms | 30.5% bf16 MFU | 111032 tok/s step 19027/19560 | loss 3.274192 (-0.23z)| norm 0.2389 (+3.83z)| lr 1.63e-06 | 5374.27 ms | 25.1% bf16 MFU | 110358 tok/s step 19028/19560 | loss 3.258965 (-0.63z)| norm 0.2102 (+0.73z)| lr 1.63e-06 | 5725.18 ms | 23.6% bf16 MFU | 109419 tok/s step 19029/19560 | loss 3.254384 (-0.74z)| norm 0.2003 (-0.33z)| lr 1.62e-06 | 5091.68 ms | 26.5% bf16 MFU | 109096 tok/s step 19030/19560 | loss 3.322500 (+1.06z)| norm 0.2033 (-0.01z)| lr 1.61e-06 | 4363.66 ms | 30.9% bf16 MFU | 109649 tok/s step 19031/19560 | loss 3.307182 (+0.64z)| norm 0.2004 (-0.32z)| lr 1.61e-06 | 6330.35 ms | 21.3% bf16 MFU | 108308 tok/s step 19032/19560 | loss 3.265741 (-0.46z)| norm 0.2051 (+0.19z)| lr 1.60e-06 | 4314.61 ms | 31.3% bf16 MFU | 108968 tok/s step 19033/19560 | loss 3.220700 (-1.64z)| norm 0.1987 (-0.50z)| lr 1.60e-06 | 4456.66 ms | 30.3% bf16 MFU | 109402 tok/s step 19034/19560 | loss 3.265199 (-0.44z)| norm 0.1950 (-0.89z)| lr 1.59e-06 | 4913.91 ms | 27.5% bf16 MFU | 109266 tok/s step 19035/19560 | loss 3.314553 (+0.87z)| norm 0.1927 (-1.13z)| lr 1.58e-06 | 4202.94 ms | 32.1% bf16 MFU | 110040 tok/s step 19036/19560 | loss 3.284561 (+0.06z)| norm 0.1915 (-1.24z)| lr 1.58e-06 | 4162.75 ms | 32.4% bf16 MFU | 110836 tok/s step 19037/19560 | loss 3.239677 (-1.14z)| norm 0.2075 (+0.46z)| lr 1.57e-06 | 4815.04 ms | 28.0% bf16 MFU | 110738 tok/s step 19038/19560 | loss 3.230461 (-1.36z)| norm 0.2009 (-0.24z)| lr 1.57e-06 | 4256.62 ms | 31.7% bf16 MFU | 111360 tok/s step 19039/19560 | loss 3.250185 (-0.83z)| norm 0.2095 (+0.67z)| lr 1.56e-06 | 4574.70 ms | 29.5% bf16 MFU | 111522 tok/s step 19040/19560 | loss 3.365945 (+2.19z)| norm 0.2155 (+1.29z)| lr 1.55e-06 | 4334.99 ms | 31.1% bf16 MFU | 111993 tok/s step 19041/19560 | loss 3.200355 (-2.10z)| norm 0.2217 (+1.90z)| lr 1.55e-06 | 4734.54 ms | 28.5% bf16 MFU | 111930 tok/s step 19042/19560 | loss 3.290634 (+0.23z)| norm 0.1965 (-0.73z)| lr 1.54e-06 | 4672.72 ms | 28.9% bf16 MFU | 111944 tok/s step 19043/19560 | loss 3.278677 (-0.07z)| norm 0.1943 (-0.96z)| lr 1.54e-06 | 4359.27 ms | 31.0% bf16 MFU | 112360 tok/s step 19044/19560 | loss 3.274416 (-0.18z)| norm 0.1999 (-0.37z)| lr 1.53e-06 | 4336.14 ms | 31.1% bf16 MFU | 112788 tok/s step 19045/19560 | loss 3.285381 (+0.10z)| norm 0.1983 (-0.53z)| lr 1.52e-06 | 4178.28 ms | 32.3% bf16 MFU | 113422 tok/s step 19046/19560 | loss 3.243583 (-0.98z)| norm 0.1971 (-0.65z)| lr 1.52e-06 | 4568.82 ms | 29.6% bf16 MFU | 113489 tok/s step 19047/19560 | loss 3.233833 (-1.22z)| norm 0.1975 (-0.61z)| lr 1.51e-06 | 4191.70 ms | 32.2% bf16 MFU | 114068 tok/s step 19048/19560 | loss 3.317775 (+0.93z)| norm 0.2025 (-0.09z)| lr 1.51e-06 | 4189.57 ms | 32.2% bf16 MFU | 114622 tok/s step 19049/19560 | loss 3.279073 (-0.07z)| norm 0.2040 (+0.06z)| lr 1.50e-06 | 4181.39 ms | 32.3% bf16 MFU | 115160 tok/s step 19050/19560 | loss 3.294147 (+0.33z)| norm 0.1977 (-0.60z)| lr 1.50e-06 | 4292.51 ms | 31.5% bf16 MFU | 115509 tok/s step 19051/19560 | loss 3.222697 (-1.50z)| norm 0.1998 (-0.37z)| lr 1.49e-06 | 4327.85 ms | 31.2% bf16 MFU | 115791 tok/s step 19052/19560 | loss 3.266876 (-0.36z)| norm 0.2083 (+0.51z)| lr 1.48e-06 | 4218.36 ms | 32.0% bf16 MFU | 116216 tok/s step 19053/19560 | loss 3.266390 (-0.37z)| norm 0.2081 (+0.49z)| lr 1.48e-06 | 4553.82 ms | 29.6% bf16 MFU | 116161 tok/s step 19054/19560 | loss 3.249535 (-0.81z)| norm 0.1951 (-0.98z)| lr 1.47e-06 | 4181.03 ms | 32.3% bf16 MFU | 116623 tok/s step 19055/19560 | loss 3.267775 (-0.32z)| norm 0.2068 (+0.48z)| lr 1.47e-06 | 4307.01 ms | 31.3% bf16 MFU | 116878 tok/s step 19056/19560 | loss 3.232142 (-1.24z)| norm 0.2043 (+0.16z)| lr 1.46e-06 | 4194.59 ms | 32.2% bf16 MFU | 117284 tok/s step 19057/19560 | loss 3.307015 (+0.69z)| norm 0.1976 (-0.67z)| lr 1.45e-06 | 4329.84 ms | 31.2% bf16 MFU | 117474 tok/s step 19058/19560 | loss 3.239271 (-1.05z)| norm 0.2005 (-0.30z)| lr 1.45e-06 | 4305.49 ms | 31.4% bf16 MFU | 117689 tok/s step 19059/19560 | loss 3.265733 (-0.36z)| norm 0.2019 (-0.13z)| lr 1.44e-06 | 4451.59 ms | 30.3% bf16 MFU | 117693 tok/s step 19060/19560 | loss 3.318044 (+0.97z)| norm 0.2073 (+0.54z)| lr 1.44e-06 | 4501.29 ms | 30.0% bf16 MFU | 117633 tok/s step 19061/19560 | loss 3.330088 (+1.27z)| norm 0.2032 (+0.02z)| lr 1.43e-06 | 4177.04 ms | 32.3% bf16 MFU | 118027 tok/s step 19062/19560 | loss 3.264175 (-0.41z)| norm 0.1981 (-0.64z)| lr 1.43e-06 | 4185.43 ms | 32.3% bf16 MFU | 118389 tok/s step 19063/19560 | loss 3.321182 (+1.03z)| norm 0.2011 (-0.25z)| lr 1.42e-06 | 4228.14 ms | 31.9% bf16 MFU | 118669 tok/s step 19064/19560 | loss 3.297675 (+0.43z)| norm 0.1970 (-0.76z)| lr 1.41e-06 | 4421.39 ms | 30.5% bf16 MFU | 118665 tok/s step 19065/19560 | loss 3.254387 (-0.67z)| norm 0.2017 (-0.17z)| lr 1.41e-06 | 4414.34 ms | 30.6% bf16 MFU | 118670 tok/s step 19066/19560 | loss 3.339763 (+1.50z)| norm 0.2514 (+5.34z)| lr 1.40e-06 | 4275.88 ms | 31.6% bf16 MFU | 118867 tok/s step 19067/19560 | loss 3.224370 (-1.41z)| norm 0.2038 (+0.04z)| lr 1.40e-06 | 4270.29 ms | 31.6% bf16 MFU | 119063 tok/s step 19068/19560 | loss 3.277913 (-0.07z)| norm 0.1985 (-0.55z)| lr 1.39e-06 | 4253.23 ms | 31.7% bf16 MFU | 119273 tok/s step 19069/19560 | loss 3.239099 (-1.03z)| norm 0.2012 (-0.25z)| lr 1.39e-06 | 4512.83 ms | 29.9% bf16 MFU | 119118 tok/s step 19070/19560 | loss 3.295957 (+0.42z)| norm 0.1942 (-1.02z)| lr 1.38e-06 | 4588.40 ms | 29.4% bf16 MFU | 118875 tok/s step 19071/19560 | loss 3.253554 (-0.68z)| norm 0.2064 (+0.33z)| lr 1.38e-06 | 4289.19 ms | 31.5% bf16 MFU | 119043 tok/s step 19072/19560 | loss 3.286561 (+0.17z)| norm 0.2083 (+0.53z)| lr 1.37e-06 | 4661.85 ms | 29.0% bf16 MFU | 118714 tok/s step 19073/19560 | loss 3.219861 (-1.54z)| norm 0.1989 (-0.50z)| lr 1.36e-06 | 4287.58 ms | 31.5% bf16 MFU | 118893 tok/s step 19074/19560 | loss 3.297081 (+0.47z)| norm 0.2066 (+0.35z)| lr 1.36e-06 | 4238.28 ms | 31.9% bf16 MFU | 119133 tok/s step 19075/19560 | loss 3.284854 (+0.14z)| norm 0.1981 (-0.59z)| lr 1.35e-06 | 4306.30 ms | 31.4% bf16 MFU | 119264 tok/s step 19076/19560 | loss 3.304683 (+0.65z)| norm 0.1987 (-0.53z)| lr 1.35e-06 | 4179.30 ms | 32.3% bf16 MFU | 119573 tok/s step 19077/19560 | loss 3.473331 (+4.61z)| norm 0.2416 (+3.95z)| lr 1.34e-06 | 4172.50 ms | 32.4% bf16 MFU | 119877 tok/s step 19078/19560 | loss 3.223049 (-1.38z)| norm 0.1980 (-0.59z)| lr 1.34e-06 | 4175.18 ms | 32.3% bf16 MFU | 120162 tok/s step 19079/19560 | loss 3.289763 (+0.23z)| norm 0.1945 (-0.95z)| lr 1.33e-06 | 4460.05 ms | 30.3% bf16 MFU | 120032 tok/s step 19080/19560 | loss 3.286262 (+0.15z)| norm 0.2019 (-0.19z)| lr 1.33e-06 | 4286.34 ms | 31.5% bf16 MFU | 120146 tok/s step 19081/19560 | loss 3.289507 (+0.22z)| norm 0.2204 (+1.73z)| lr 1.32e-06 | 4197.28 ms | 32.2% bf16 MFU | 120384 tok/s step 19082/19560 | loss 3.197101 (-1.98z)| norm 0.2029 (-0.09z)| lr 1.31e-06 | 4267.22 ms | 31.6% bf16 MFU | 120508 tok/s step 19083/19560 | loss 3.281118 (+0.02z)| norm 0.2054 (+0.18z)| lr 1.31e-06 | 4661.75 ms | 29.0% bf16 MFU | 120106 tok/s step 19084/19560 | loss 3.260386 (-0.48z)| norm 0.2113 (+0.78z)| lr 1.30e-06 | 4178.70 ms | 32.3% bf16 MFU | 120374 tok/s step 19085/19560 | loss 3.267488 (-0.30z)| norm 0.2162 (+1.27z)| lr 1.30e-06 | 4179.91 ms | 32.3% bf16 MFU | 120627 tok/s step 19086/19560 | loss 3.303402 (+0.55z)| norm 0.2014 (-0.26z)| lr 1.29e-06 | 4408.30 ms | 30.6% bf16 MFU | 120542 tok/s step 19087/19560 | loss 3.297607 (+0.40z)| norm 0.2025 (-0.14z)| lr 1.29e-06 | 4632.23 ms | 29.1% bf16 MFU | 120174 tok/s step 19088/19560 | loss 3.386770 (+2.46z)| norm 0.2167 (+1.35z)| lr 1.28e-06 | 4436.59 ms | 30.4% bf16 MFU | 120074 tok/s step 19089/19560 | loss 3.406556 (+2.82z)| norm 0.2889 (+6.96z)| lr 1.28e-06 | 4191.10 ms | 32.2% bf16 MFU | 120325 tok/s step 19090/19560 | loss 3.287288 (+0.12z)| norm 0.1999 (-0.38z)| lr 1.27e-06 | 4248.91 ms | 31.8% bf16 MFU | 120479 tok/s step 19091/19560 | loss 3.251746 (-0.68z)| norm 0.1916 (-1.05z)| lr 1.27e-06 | 4256.01 ms | 31.7% bf16 MFU | 120614 tok/s step 19092/19560 | loss 3.341773 (+1.33z)| norm 0.2075 (+0.24z)| lr 1.26e-06 | 4180.85 ms | 32.3% bf16 MFU | 120853 tok/s step 19093/19560 | loss 3.261563 (-0.49z)| norm 0.1949 (-0.79z)| lr 1.25e-06 | 4300.56 ms | 31.4% bf16 MFU | 120906 tok/s step 19094/19560 | loss 3.274562 (-0.19z)| norm 0.2034 (-0.09z)| lr 1.25e-06 | 4232.65 ms | 31.9% bf16 MFU | 121054 tok/s step 19095/19560 | loss 3.272035 (-0.24z)| norm 0.1983 (-0.50z)| lr 1.24e-06 | 4199.13 ms | 32.2% bf16 MFU | 121244 tok/s step 19096/19560 | loss 3.224682 (-1.29z)| norm 0.1997 (-0.39z)| lr 1.24e-06 | 4289.78 ms | 31.5% bf16 MFU | 121293 tok/s step 19097/19560 | loss 3.260151 (-0.48z)| norm 0.1983 (-0.49z)| lr 1.23e-06 | 4274.54 ms | 31.6% bf16 MFU | 121361 tok/s step 19098/19560 | loss 3.256207 (-0.56z)| norm 0.2019 (-0.20z)| lr 1.23e-06 | 4292.67 ms | 31.5% bf16 MFU | 121400 tok/s step 19099/19560 | loss 3.235816 (-1.01z)| norm 0.1973 (-0.58z)| lr 1.22e-06 | 4324.34 ms | 31.2% bf16 MFU | 121392 tok/s step 19100/19560 | loss 3.227990 (-1.17z)| norm 0.2004 (-0.33z)| lr 1.22e-06 | 4184.50 ms | 32.3% bf16 MFU | 121587 tok/s step 19101/19560 | loss 3.324953 (+1.01z)| norm 0.2025 (-0.14z)| lr 1.21e-06 | 4174.25 ms | 32.3% bf16 MFU | 121788 tok/s step 19102/19560 | loss 3.342223 (+1.37z)| norm 0.2148 (+0.88z)| lr 1.21e-06 | 4321.11 ms | 31.2% bf16 MFU | 121765 tok/s step 19103/19560 | loss 3.315446 (+0.77z)| norm 0.2023 (-0.18z)| lr 1.20e-06 | 4423.70 ms | 30.5% bf16 MFU | 121603 tok/s step 19104/19560 | loss 3.214358 (-1.48z)| norm 0.1979 (-0.54z)| lr 1.20e-06 | 4245.24 ms | 31.8% bf16 MFU | 121697 tok/s step 19105/19560 | loss 3.255528 (-0.56z)| norm 0.1954 (-0.76z)| lr 1.19e-06 | 4271.35 ms | 31.6% bf16 MFU | 121750 tok/s step 19106/19560 | loss 3.274443 (-0.15z)| norm 0.1993 (-0.42z)| lr 1.19e-06 | 4388.76 ms | 30.8% bf16 MFU | 121635 tok/s step 19107/19560 | loss 3.307233 (+0.58z)| norm 0.2119 (+0.63z)| lr 1.18e-06 | 4211.28 ms | 32.1% bf16 MFU | 121778 tok/s step 19108/19560 | loss 3.305075 (+0.53z)| norm 0.1956 (-0.73z)| lr 1.18e-06 | 4178.85 ms | 32.3% bf16 MFU | 121963 tok/s step 19109/19560 | loss 3.213300 (-1.49z)| norm 0.2007 (-0.30z)| lr 1.17e-06 | 4222.37 ms | 32.0% bf16 MFU | 122073 tok/s step 19110/19560 | loss 3.294572 (+0.31z)| norm 0.1931 (-0.93z)| lr 1.16e-06 | 4181.56 ms | 32.3% bf16 MFU | 122238 tok/s step 19111/19560 | loss 3.294478 (+0.31z)| norm 0.1967 (-0.63z)| lr 1.16e-06 | 4187.39 ms | 32.2% bf16 MFU | 122387 tok/s step 19112/19560 | loss 3.356173 (+1.66z)| norm 0.2138 (+0.78z)| lr 1.15e-06 | 4194.48 ms | 32.2% bf16 MFU | 122517 tok/s step 19113/19560 | loss 3.296576 (+0.34z)| norm 0.1968 (-0.63z)| lr 1.15e-06 | 4201.07 ms | 32.1% bf16 MFU | 122631 tok/s step 19114/19560 | loss 3.269625 (-0.25z)| norm 0.1950 (-0.77z)| lr 1.14e-06 | 4193.63 ms | 32.2% bf16 MFU | 122751 tok/s step 19115/19560 | loss 3.270612 (-0.23z)| norm 0.2058 (+0.13z)| lr 1.14e-06 | 4283.37 ms | 31.5% bf16 MFU | 122733 tok/s step 19116/19560 | loss 3.418892 (+2.93z)| norm 0.2082 (+0.32z)| lr 1.13e-06 | 4173.76 ms | 32.3% bf16 MFU | 122877 tok/s step 19117/19560 | loss 3.269035 (-0.29z)| norm 0.2076 (+0.28z)| lr 1.13e-06 | 4179.52 ms | 32.3% bf16 MFU | 123005 tok/s step 19118/19560 | loss 3.210008 (-1.53z)| norm 0.1993 (-0.42z)| lr 1.12e-06 | 4205.35 ms | 32.1% bf16 MFU | 123089 tok/s step 19119/19560 | loss 3.289903 (+0.17z)| norm 0.2012 (-0.26z)| lr 1.12e-06 | 6363.77 ms | 21.2% bf16 MFU | 121054 tok/s step 19120/19560 | loss 3.292993 (+0.23z)| norm 0.1979 (-0.54z)| lr 1.11e-06 | 4173.36 ms | 32.4% bf16 MFU | 121282 tok/s step 19121/19560 | loss 3.313313 (+0.66z)| norm 0.1988 (-0.46z)| lr 1.11e-06 | 4159.08 ms | 32.5% bf16 MFU | 121521 tok/s step 19122/19560 | loss 3.212458 (-1.48z)| norm 0.1993 (-0.42z)| lr 1.10e-06 | 4165.77 ms | 32.4% bf16 MFU | 121738 tok/s step 19123/19560 | loss 3.283953 (+0.05z)| norm 0.1977 (-0.55z)| lr 1.10e-06 | 4218.60 ms | 32.0% bf16 MFU | 121865 tok/s step 19124/19560 | loss 3.204930 (-1.61z)| norm 0.2028 (-0.11z)| lr 1.09e-06 | 4304.99 ms | 31.4% bf16 MFU | 121861 tok/s step 19125/19560 | loss 3.250827 (-0.63z)| norm 0.1965 (-0.63z)| lr 1.09e-06 | 4165.47 ms | 32.4% bf16 MFU | 122061 tok/s step 19126/19560 | loss 3.221955 (-1.22z)| norm 0.2012 (-0.23z)| lr 1.08e-06 | 4176.28 ms | 32.3% bf16 MFU | 122235 tok/s step 19127/19560 | loss 3.232643 (-1.00z)| norm 0.2087 (+0.40z)| lr 1.08e-06 | 4208.70 ms | 32.1% bf16 MFU | 122352 tok/s step 19128/19560 | loss 3.276406 (-0.07z)| norm 0.1966 (-0.62z)| lr 1.07e-06 | 4197.67 ms | 32.2% bf16 MFU | 122479 tok/s step 19129/19560 | loss 3.317774 (+0.80z)| norm 0.1948 (-0.77z)| lr 1.07e-06 | 4233.09 ms | 31.9% bf16 MFU | 122548 tok/s step 19130/19560 | loss 3.283643 (+0.08z)| norm 0.1948 (-0.76z)| lr 1.06e-06 | 4160.35 ms | 32.5% bf16 MFU | 122722 tok/s step 19131/19560 | loss 3.396396 (+2.40z)| norm 0.2878 (+5.97z)| lr 1.06e-06 | 4193.47 ms | 32.2% bf16 MFU | 122837 tok/s step 19132/19560 | loss 3.242585 (-0.77z)| norm 0.1966 (-0.56z)| lr 1.05e-06 | 4186.72 ms | 32.2% bf16 MFU | 122956 tok/s step 19133/19560 | loss 3.271174 (-0.19z)| norm 0.2049 (+0.03z)| lr 1.05e-06 | 4187.18 ms | 32.2% bf16 MFU | 123069 tok/s step 19134/19560 | loss 3.234887 (-0.94z)| norm 0.1982 (-0.45z)| lr 1.04e-06 | 4265.65 ms | 31.7% bf16 MFU | 123061 tok/s step 19135/19560 | loss 3.264723 (-0.31z)| norm 0.2075 (+0.21z)| lr 1.04e-06 | 4168.96 ms | 32.4% bf16 MFU | 123196 tok/s step 19136/19560 | loss 3.302979 (+0.49z)| norm 0.1957 (-0.63z)| lr 1.03e-06 | 4738.66 ms | 28.5% bf16 MFU | 122568 tok/s step 19137/19560 | loss 3.285560 (+0.12z)| norm 0.2004 (-0.29z)| lr 1.03e-06 | 5415.77 ms | 24.9% bf16 MFU | 121280 tok/s step 19138/19560 | loss 3.298441 (+0.39z)| norm 0.2154 (+0.77z)| lr 1.02e-06 | 4226.23 ms | 31.9% bf16 MFU | 121419 tok/s step 19139/19560 | loss 3.254353 (-0.52z)| norm 0.1966 (-0.56z)| lr 1.02e-06 | 4370.63 ms | 30.9% bf16 MFU | 121346 tok/s step 19140/19560 | loss 3.250344 (-0.62z)| norm 0.1985 (-0.43z)| lr 1.02e-06 | 4426.75 ms | 30.5% bf16 MFU | 121201 tok/s step 19141/19560 | loss 3.322880 (+0.94z)| norm 0.2072 (+0.19z)| lr 1.01e-06 | 4424.19 ms | 30.5% bf16 MFU | 121066 tok/s step 19142/19560 | loss 3.276070 (-0.07z)| norm 0.2148 (+0.72z)| lr 1.01e-06 | 4180.39 ms | 32.3% bf16 MFU | 121283 tok/s step 19143/19560 | loss 3.275940 (-0.06z)| norm 0.2136 (+0.65z)| lr 1.00e-06 | 4332.61 ms | 31.2% bf16 MFU | 121270 tok/s step 19144/19560 | loss 3.247978 (-0.66z)| norm 0.1994 (-0.37z)| lr 9.96e-07 | 4271.06 ms | 31.6% bf16 MFU | 121344 tok/s step 19145/19560 | loss 3.278290 (-0.02z)| norm 0.2036 (-0.06z)| lr 9.91e-07 | 4162.33 ms | 32.4% bf16 MFU | 121575 tok/s step 19146/19560 | loss 3.291873 (+0.27z)| norm 0.1968 (-0.56z)| lr 9.86e-07 | 4177.45 ms | 32.3% bf16 MFU | 121771 tok/s step 19147/19560 | loss 3.200762 (-1.69z)| norm 0.2070 (+0.17z)| lr 9.82e-07 | 4167.68 ms | 32.4% bf16 MFU | 121972 tok/s step 19148/19560 | loss 3.317230 (+0.82z)| norm 0.2064 (+0.13z)| lr 9.77e-07 | 4179.25 ms | 32.3% bf16 MFU | 122146 tok/s step 19149/19560 | loss 3.278603 (-0.02z)| norm 0.1960 (-0.62z)| lr 9.72e-07 | 4171.24 ms | 32.4% bf16 MFU | 122324 tok/s step 19150/19560 | loss 3.240700 (-0.84z)| norm 0.1993 (-0.37z)| lr 9.68e-07 | 4206.78 ms | 32.1% bf16 MFU | 122439 tok/s step 19151/19560 | loss 3.238242 (-0.89z)| norm 0.2038 (-0.05z)| lr 9.63e-07 | 4233.60 ms | 31.9% bf16 MFU | 122509 tok/s step 19152/19560 | loss 3.268451 (-0.21z)| norm 0.2006 (-0.27z)| lr 9.58e-07 | 4590.60 ms | 29.4% bf16 MFU | 122094 tok/s step 19153/19560 | loss 3.263532 (-0.31z)| norm 0.1989 (-0.39z)| lr 9.54e-07 | 4294.86 ms | 31.4% bf16 MFU | 122093 tok/s step 19154/19560 | loss 3.275086 (-0.05z)| norm 0.1939 (-0.75z)| lr 9.49e-07 | 4337.45 ms | 31.1% bf16 MFU | 122032 tok/s step 19155/19560 | loss 3.195881 (-1.79z)| norm 0.2022 (-0.13z)| lr 9.44e-07 | 4151.98 ms | 32.5% bf16 MFU | 122244 tok/s step 19156/19560 | loss 3.230413 (-1.02z)| norm 0.1973 (-0.49z)| lr 9.40e-07 | 4171.01 ms | 32.4% bf16 MFU | 122417 tok/s step 19157/19560 | loss 3.233437 (-0.95z)| norm 0.1964 (-0.55z)| lr 9.35e-07 | 4330.81 ms | 31.2% bf16 MFU | 122349 tok/s step 19158/19560 | loss 3.294336 (+0.40z)| norm 0.2032 (-0.05z)| lr 9.30e-07 | 4154.68 ms | 32.5% bf16 MFU | 122541 tok/s step 19159/19560 | loss 3.270811 (-0.12z)| norm 0.1987 (-0.37z)| lr 9.26e-07 | 4163.20 ms | 32.4% bf16 MFU | 122711 tok/s step 19160/19560 | loss 3.240921 (-0.77z)| norm 0.2061 (+0.17z)| lr 9.21e-07 | 4245.19 ms | 31.8% bf16 MFU | 122750 tok/s step 19161/19560 | loss 3.301958 (+0.56z)| norm 0.1942 (-0.71z)| lr 9.16e-07 | 4153.58 ms | 32.5% bf16 MFU | 122924 tok/s step 19162/19560 | loss 3.263433 (-0.29z)| norm 0.1896 (-1.04z)| lr 9.12e-07 | 4190.75 ms | 32.2% bf16 MFU | 123033 tok/s step 19163/19560 | loss 3.265435 (-0.24z)| norm 0.1991 (-0.34z)| lr 9.07e-07 | 4167.90 ms | 32.4% bf16 MFU | 123171 tok/s step 19164/19560 | loss 3.319026 (+0.94z)| norm 0.1993 (-0.34z)| lr 9.03e-07 | 4168.66 ms | 32.4% bf16 MFU | 123301 tok/s step 19165/19560 | loss 3.301068 (+0.54z)| norm 0.2040 (+0.02z)| lr 8.98e-07 | 4153.52 ms | 32.5% bf16 MFU | 123447 tok/s step 19166/19560 | loss 3.343525 (+1.45z)| norm 0.2072 (+0.25z)| lr 8.94e-07 | 4151.07 ms | 32.5% bf16 MFU | 123590 tok/s step 19167/19560 | loss 3.229666 (-1.05z)| norm 0.2006 (-0.24z)| lr 8.89e-07 | 4193.41 ms | 32.2% bf16 MFU | 123662 tok/s step 19168/19560 | loss 3.258399 (-0.41z)| norm 0.2142 (+0.78z)| lr 8.85e-07 | 4166.56 ms | 32.4% bf16 MFU | 123770 tok/s step 19169/19560 | loss 3.242981 (-0.77z)| norm 0.2019 (-0.13z)| lr 8.80e-07 | 4300.98 ms | 31.4% bf16 MFU | 123677 tok/s step 19170/19560 | loss 3.295000 (+0.40z)| norm 0.2010 (-0.20z)| lr 8.76e-07 | 4163.40 ms | 32.4% bf16 MFU | 123789 tok/s step 19171/19560 | loss 3.290896 (+0.31z)| norm 0.2091 (+0.40z)| lr 8.71e-07 | 4164.61 ms | 32.4% bf16 MFU | 123894 tok/s step 19172/19560 | loss 3.315880 (+0.86z)| norm 0.1940 (-0.73z)| lr 8.67e-07 | 4207.54 ms | 32.1% bf16 MFU | 123930 tok/s step 19173/19560 | loss 3.298965 (+0.48z)| norm 0.1979 (-0.44z)| lr 8.62e-07 | 4298.68 ms | 31.4% bf16 MFU | 123832 tok/s step 19174/19560 | loss 3.229877 (-1.07z)| norm 0.2022 (-0.11z)| lr 8.58e-07 | 4165.28 ms | 32.4% bf16 MFU | 123934 tok/s step 19175/19560 | loss 3.283754 (+0.13z)| norm 0.1980 (-0.43z)| lr 8.53e-07 | 5191.55 ms | 26.0% bf16 MFU | 122787 tok/s step 19176/19560 | loss 3.281445 (+0.09z)| norm 0.2193 (+1.15z)| lr 8.49e-07 | 4604.31 ms | 29.3% bf16 MFU | 122341 tok/s step 19177/19560 | loss 3.233761 (-0.98z)| norm 0.2010 (-0.21z)| lr 8.45e-07 | 4428.42 ms | 30.5% bf16 MFU | 122143 tok/s step 19178/19560 | loss 3.314943 (+0.84z)| norm 0.1979 (-0.45z)| lr 8.40e-07 | 4485.49 ms | 30.1% bf16 MFU | 121880 tok/s step 19179/19560 | loss 3.340591 (+1.39z)| norm 0.1990 (-0.36z)| lr 8.36e-07 | 4194.96 ms | 32.2% bf16 MFU | 122035 tok/s step 19180/19560 | loss 3.273526 (-0.11z)| norm 0.2012 (-0.19z)| lr 8.32e-07 | 4492.26 ms | 30.1% bf16 MFU | 121769 tok/s step 19181/19560 | loss 3.355115 (+1.68z)| norm 0.2394 (+2.57z)| lr 8.27e-07 | 4183.95 ms | 32.3% bf16 MFU | 121946 tok/s step 19182/19560 | loss 3.254859 (-0.54z)| norm 0.2036 (-0.04z)| lr 8.23e-07 | 4747.18 ms | 28.4% bf16 MFU | 121371 tok/s step 19183/19560 | loss 3.233024 (-1.01z)| norm 0.2237 (+1.40z)| lr 8.18e-07 | 4472.90 ms | 30.2% bf16 MFU | 121163 tok/s step 19184/19560 | loss 3.216996 (-1.36z)| norm 0.2029 (-0.10z)| lr 8.14e-07 | 4179.51 ms | 32.3% bf16 MFU | 121377 tok/s step 19185/19560 | loss 3.232779 (-0.99z)| norm 0.2009 (-0.24z)| lr 8.10e-07 | 4439.30 ms | 30.4% bf16 MFU | 121213 tok/s step 19186/19560 | loss 3.269833 (-0.19z)| norm 0.1945 (-0.70z)| lr 8.06e-07 | 4347.54 ms | 31.1% bf16 MFU | 121182 tok/s step 19187/19560 | loss 3.312672 (+0.74z)| norm 0.2195 (+1.08z)| lr 8.01e-07 | 4200.30 ms | 32.1% bf16 MFU | 121364 tok/s step 19188/19560 | loss 3.284906 (+0.14z)| norm 0.2052 (+0.06z)| lr 7.97e-07 | 4949.70 ms | 27.3% bf16 MFU | 120592 tok/s step 19189/19560 | loss 3.200187 (-1.69z)| norm 0.1978 (-0.47z)| lr 7.93e-07 | 4209.19 ms | 32.1% bf16 MFU | 120790 tok/s step 19190/19560 | loss 3.248313 (-0.63z)| norm 0.2065 (+0.15z)| lr 7.88e-07 | 4388.06 ms | 30.8% bf16 MFU | 120725 tok/s step 19191/19560 | loss 3.316794 (+0.86z)| norm 0.2048 (+0.03z)| lr 7.84e-07 | 4139.46 ms | 32.6% bf16 MFU | 121022 tok/s step 19192/19560 | loss 3.216273 (-1.31z)| norm 0.1968 (-0.55z)| lr 7.80e-07 | 5069.64 ms | 26.6% bf16 MFU | 120141 tok/s step 19193/19560 | loss 3.290413 (+0.29z)| norm 0.2105 (+0.43z)| lr 7.76e-07 | 4148.21 ms | 32.5% bf16 MFU | 120454 tok/s step 19194/19560 | loss 3.311889 (+0.77z)| norm 0.1990 (-0.38z)| lr 7.72e-07 | 4149.90 ms | 32.5% bf16 MFU | 120748 tok/s step 19195/19560 | loss 3.287145 (+0.22z)| norm 0.2092 (+0.38z)| lr 7.67e-07 | 4187.73 ms | 32.2% bf16 MFU | 120970 tok/s step 19196/19560 | loss 3.227533 (-1.08z)| norm 0.2013 (-0.21z)| lr 7.63e-07 | 4228.04 ms | 31.9% bf16 MFU | 121122 tok/s step 19197/19560 | loss 3.312331 (+0.76z)| norm 0.2019 (-0.17z)| lr 7.59e-07 | 4137.73 ms | 32.6% bf16 MFU | 121401 tok/s step 19198/19560 | loss 3.425869 (+3.10z)| norm 0.2112 (+0.52z)| lr 7.55e-07 | 4148.57 ms | 32.5% bf16 MFU | 121650 tok/s step 19199/19560 | loss 3.283019 (+0.09z)| norm 0.1946 (-0.72z)| lr 7.51e-07 | 4154.26 ms | 32.5% bf16 MFU | 121878 tok/s step 19200/19560 | loss 3.289552 (+0.23z)| norm 0.1975 (-0.50z)| lr 7.47e-07 | 4256.89 ms | 31.7% bf16 MFU | 121942 tok/s step 19201/19560 | loss 3.312292 (+0.70z)| norm 0.2027 (-0.11z)| lr 7.42e-07 | 4148.90 ms | 32.5% bf16 MFU | 122163 tok/s step 19202/19560 | loss 3.245451 (-0.71z)| norm 0.1955 (-0.64z)| lr 7.38e-07 | 4155.10 ms | 32.5% bf16 MFU | 122364 tok/s step 19203/19560 | loss 3.294892 (+0.33z)| norm 0.1998 (-0.32z)| lr 7.34e-07 | 4156.93 ms | 32.5% bf16 MFU | 122552 tok/s step 19204/19560 | loss 3.299811 (+0.44z)| norm 0.2030 (-0.08z)| lr 7.30e-07 | 4161.55 ms | 32.4% bf16 MFU | 122724 tok/s step 19205/19560 | loss 3.281177 (+0.08z)| norm 0.1991 (-0.36z)| lr 7.26e-07 | 4410.92 ms | 30.6% bf16 MFU | 122531 tok/s step 19206/19560 | loss 3.237979 (-0.90z)| norm 0.2037 (-0.01z)| lr 7.22e-07 | 4161.80 ms | 32.4% bf16 MFU | 122703 tok/s step 19207/19560 | loss 3.241998 (-0.80z)| norm 0.2057 (+0.14z)| lr 7.18e-07 | 4166.29 ms | 32.4% bf16 MFU | 122860 tok/s step 19208/19560 | loss 3.285978 (+0.20z)| norm 0.1962 (-0.59z)| lr 7.14e-07 | 4172.68 ms | 32.4% bf16 MFU | 122999 tok/s step 19209/19560 | loss 3.271757 (-0.12z)| norm 0.1929 (-0.83z)| lr 7.10e-07 | 4170.47 ms | 32.4% bf16 MFU | 123135 tok/s step 19210/19560 | loss 3.303712 (+0.59z)| norm 0.2137 (+0.78z)| lr 7.06e-07 | 4186.10 ms | 32.3% bf16 MFU | 123240 tok/s step 19211/19560 | loss 3.252322 (-0.58z)| norm 0.2068 (+0.24z)| lr 7.02e-07 | 4203.76 ms | 32.1% bf16 MFU | 123314 tok/s step 19212/19560 | loss 3.293972 (+0.37z)| norm 0.2051 (+0.12z)| lr 6.98e-07 | 4178.34 ms | 32.3% bf16 MFU | 123422 tok/s step 19213/19560 | loss 3.248086 (-0.68z)| norm 0.2054 (+0.14z)| lr 6.94e-07 | 4180.58 ms | 32.3% bf16 MFU | 123522 tok/s step 19214/19560 | loss 3.404810 (+2.80z)| norm 0.2629 (+4.26z)| lr 6.90e-07 | 4175.01 ms | 32.3% bf16 MFU | 123625 tok/s step 19215/19560 | loss 3.246330 (-0.71z)| norm 0.2081 (+0.29z)| lr 6.86e-07 | 4180.38 ms | 32.3% bf16 MFU | 123714 tok/s step 19216/19560 | loss 3.255609 (-0.49z)| norm 0.2013 (-0.20z)| lr 6.82e-07 | 4180.58 ms | 32.3% bf16 MFU | 123799 tok/s step 19217/19560 | loss 3.311913 (+0.83z)| norm 0.2078 (+0.39z)| lr 6.78e-07 | 4180.56 ms | 32.3% bf16 MFU | 123880 tok/s step 19218/19560 | loss 3.372725 (+2.20z)| norm 0.2099 (+0.56z)| lr 6.74e-07 | 4180.50 ms | 32.3% bf16 MFU | 123956 tok/s step 19219/19560 | loss 3.310680 (+0.76z)| norm 0.1950 (-0.74z)| lr 6.70e-07 | 4178.69 ms | 32.3% bf16 MFU | 124032 tok/s step 19220/19560 | loss 3.268610 (-0.20z)| norm 0.1994 (-0.34z)| lr 6.66e-07 | 4178.75 ms | 32.3% bf16 MFU | 124104 tok/s step 19221/19560 | loss 3.158471 (-2.65z)| norm 0.1993 (-0.36z)| lr 6.62e-07 | 4177.97 ms | 32.3% bf16 MFU | 124173 tok/s step 19222/19560 | loss 3.237751 (-0.86z)| norm 0.2001 (-0.29z)| lr 6.58e-07 | 4182.02 ms | 32.3% bf16 MFU | 124232 tok/s step 19223/19560 | loss 3.235179 (-0.91z)| norm 0.1984 (-0.43z)| lr 6.54e-07 | 4174.29 ms | 32.3% bf16 MFU | 124301 tok/s step 19224/19560 | loss 3.235290 (-0.91z)| norm 0.2012 (-0.19z)| lr 6.51e-07 | 4170.25 ms | 32.4% bf16 MFU | 124372 tok/s step 19225/19560 | loss 3.235795 (-0.89z)| norm 0.1976 (-0.50z)| lr 6.47e-07 | 4170.22 ms | 32.4% bf16 MFU | 124439 tok/s step 19226/19560 | loss 3.267175 (-0.19z)| norm 0.2008 (-0.23z)| lr 6.43e-07 | 4174.73 ms | 32.3% bf16 MFU | 124497 tok/s step 19227/19560 | loss 3.232656 (-0.96z)| norm 0.1984 (-0.43z)| lr 6.39e-07 | 4177.44 ms | 32.3% bf16 MFU | 124547 tok/s step 19228/19560 | loss 3.196501 (-1.75z)| norm 0.2025 (-0.08z)| lr 6.35e-07 | 4173.58 ms | 32.4% bf16 MFU | 124601 tok/s step 19229/19560 | loss 3.214790 (-1.32z)| norm 0.1949 (-0.73z)| lr 6.31e-07 | 4175.53 ms | 32.3% bf16 MFU | 124649 tok/s step 19230/19560 | loss 3.265172 (-0.20z)| norm 0.1999 (-0.29z)| lr 6.28e-07 | 4174.09 ms | 32.3% bf16 MFU | 124697 tok/s step 19231/19560 | loss 3.235280 (-0.85z)| norm 0.1965 (-0.58z)| lr 6.24e-07 | 4160.18 ms | 32.5% bf16 MFU | 124763 tok/s step 19232/19560 | loss 3.284287 (+0.24z)| norm 0.1985 (-0.41z)| lr 6.20e-07 | 4166.20 ms | 32.4% bf16 MFU | 124817 tok/s step 19233/19560 | loss 3.202317 (-1.59z)| norm 0.1984 (-0.42z)| lr 6.16e-07 | 4169.79 ms | 32.4% bf16 MFU | 124863 tok/s step 19234/19560 | loss 3.259207 (-0.31z)| norm 0.1976 (-0.49z)| lr 6.13e-07 | 4166.10 ms | 32.4% bf16 MFU | 124912 tok/s step 19235/19560 | loss 3.292571 (+0.43z)| norm 0.2000 (-0.27z)| lr 6.09e-07 | 4168.20 ms | 32.4% bf16 MFU | 124956 tok/s step 19236/19560 | loss 3.245655 (-0.60z)| norm 0.1999 (-0.28z)| lr 6.05e-07 | 4159.69 ms | 32.5% bf16 MFU | 125010 tok/s step 19237/19560 | loss 3.291456 (+0.41z)| norm 0.1981 (-0.44z)| lr 6.01e-07 | 4167.45 ms | 32.4% bf16 MFU | 125050 tok/s step 19238/19560 | loss 3.274221 (+0.02z)| norm 0.2002 (-0.26z)| lr 5.98e-07 | 4166.90 ms | 32.4% bf16 MFU | 125088 tok/s step 19239/19560 | loss 3.205435 (-1.50z)| norm 0.1948 (-0.73z)| lr 5.94e-07 | 4165.28 ms | 32.4% bf16 MFU | 125127 tok/s step 19240/19560 | loss 3.193667 (-1.73z)| norm 0.1998 (-0.29z)| lr 5.90e-07 | 4166.35 ms | 32.4% bf16 MFU | 125163 tok/s step 19241/19560 | loss 3.224582 (-1.03z)| norm 0.1965 (-0.58z)| lr 5.87e-07 | 4165.25 ms | 32.4% bf16 MFU | 125198 tok/s step 19242/19560 | loss 3.241131 (-0.65z)| norm 0.1888 (-1.24z)| lr 5.83e-07 | 4169.56 ms | 32.4% bf16 MFU | 125226 tok/s step 19243/19560 | loss 3.279104 (+0.19z)| norm 0.2044 (+0.12z)| lr 5.79e-07 | 4172.29 ms | 32.4% bf16 MFU | 125247 tok/s step 19244/19560 | loss 3.223654 (-1.05z)| norm 0.1940 (-0.77z)| lr 5.76e-07 | 4163.26 ms | 32.4% bf16 MFU | 125282 tok/s step 19245/19560 | loss 3.297172 (+0.65z)| norm 0.2025 (-0.03z)| lr 5.72e-07 | 4164.60 ms | 32.4% bf16 MFU | 125312 tok/s step 19246/19560 | loss 3.308957 (+0.91z)| norm 0.2040 (+0.09z)| lr 5.68e-07 | 4163.27 ms | 32.4% bf16 MFU | 125343 tok/s step 19247/19560 | loss 3.304177 (+0.79z)| norm 0.2025 (-0.04z)| lr 5.65e-07 | 4168.24 ms | 32.4% bf16 MFU | 125365 tok/s step 19248/19560 | loss 3.299746 (+0.69z)| norm 0.2015 (-0.13z)| lr 5.61e-07 | 4166.16 ms | 32.4% bf16 MFU | 125389 tok/s step 19249/19560 | loss 3.290655 (+0.48z)| norm 0.2066 (+0.32z)| lr 5.58e-07 | 4165.07 ms | 32.4% bf16 MFU | 125413 tok/s step 19250/19560 | loss 3.326513 (+1.29z)| norm 0.2078 (+0.42z)| lr 5.54e-07 | 4168.03 ms | 32.4% bf16 MFU | 125432 tok/s val loss 3.251479 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3057/10042 = 0.304421 step 19251/19560 | loss 3.250780 (-0.46z)| norm 0.1966 (-0.56z)| lr 5.51e-07 | 4163.17 ms | 32.4% bf16 MFU | 125457 tok/s step 19252/19560 | loss 3.303142 (+0.74z)| norm 0.2116 (+0.74z)| lr 5.47e-07 | 4158.56 ms | 32.5% bf16 MFU | 125488 tok/s step 19253/19560 | loss 3.257089 (-0.34z)| norm 0.1994 (-0.33z)| lr 5.43e-07 | 4164.68 ms | 32.4% bf16 MFU | 125508 tok/s step 19254/19560 | loss 3.251876 (-0.46z)| norm 0.2046 (+0.13z)| lr 5.40e-07 | 4160.32 ms | 32.5% bf16 MFU | 125534 tok/s step 19255/19560 | loss 3.233095 (-0.91z)| norm 0.1973 (-0.50z)| lr 5.36e-07 | 4163.22 ms | 32.4% bf16 MFU | 125554 tok/s step 19256/19560 | loss 3.282494 (+0.25z)| norm 0.2138 (+0.92z)| lr 5.33e-07 | 4163.51 ms | 32.4% bf16 MFU | 125572 tok/s step 19257/19560 | loss 3.254254 (-0.40z)| norm 0.1994 (-0.34z)| lr 5.29e-07 | 4164.82 ms | 32.4% bf16 MFU | 125588 tok/s step 19258/19560 | loss 3.257134 (-0.33z)| norm 0.2003 (-0.26z)| lr 5.26e-07 | 4161.20 ms | 32.4% bf16 MFU | 125608 tok/s step 19259/19560 | loss 3.302429 (+0.79z)| norm 0.1963 (-0.72z)| lr 5.23e-07 | 4162.66 ms | 32.4% bf16 MFU | 125625 tok/s step 19260/19560 | loss 3.284332 (+0.34z)| norm 0.2025 (-0.01z)| lr 5.19e-07 | 4160.29 ms | 32.5% bf16 MFU | 125645 tok/s step 19261/19560 | loss 3.274853 (+0.10z)| norm 0.1929 (-1.11z)| lr 5.16e-07 | 4163.25 ms | 32.4% bf16 MFU | 125660 tok/s step 19262/19560 | loss 3.241945 (-0.70z)| norm 0.2001 (-0.28z)| lr 5.12e-07 | 4162.62 ms | 32.4% bf16 MFU | 125674 tok/s step 19263/19560 | loss 3.219044 (-1.25z)| norm 0.1979 (-0.53z)| lr 5.09e-07 | 4160.42 ms | 32.5% bf16 MFU | 125691 tok/s step 19264/19560 | loss 3.218909 (-1.23z)| norm 0.1962 (-0.72z)| lr 5.05e-07 | 4160.43 ms | 32.5% bf16 MFU | 125708 tok/s step 19265/19560 | loss 3.283570 (+0.34z)| norm 0.2052 (+0.31z)| lr 5.02e-07 | 4162.90 ms | 32.4% bf16 MFU | 125719 tok/s step 19266/19560 | loss 3.254318 (-0.36z)| norm 0.1945 (-0.91z)| lr 4.99e-07 | 4161.34 ms | 32.4% bf16 MFU | 125733 tok/s step 19267/19560 | loss 3.254831 (-0.35z)| norm 0.1936 (-1.00z)| lr 4.95e-07 | 4159.31 ms | 32.5% bf16 MFU | 125749 tok/s step 19268/19560 | loss 3.279830 (+0.25z)| norm 0.1988 (-0.40z)| lr 4.92e-07 | 4163.00 ms | 32.4% bf16 MFU | 125758 tok/s step 19269/19560 | loss 3.271590 (+0.06z)| norm 0.1989 (-0.38z)| lr 4.88e-07 | 4164.19 ms | 32.4% bf16 MFU | 125766 tok/s step 19270/19560 | loss 3.346580 (+1.86z)| norm 0.2060 (+0.44z)| lr 4.85e-07 | 4162.19 ms | 32.4% bf16 MFU | 125776 tok/s step 19271/19560 | loss 3.321224 (+1.23z)| norm 0.2043 (+0.26z)| lr 4.82e-07 | 4159.35 ms | 32.5% bf16 MFU | 125789 tok/s step 19272/19560 | loss 3.321033 (+1.20z)| norm 0.2094 (+0.84z)| lr 4.79e-07 | 4159.68 ms | 32.5% bf16 MFU | 125802 tok/s step 19273/19560 | loss 3.261559 (-0.21z)| norm 0.2132 (+1.27z)| lr 4.75e-07 | 4164.42 ms | 32.4% bf16 MFU | 125807 tok/s step 19274/19560 | loss 3.228643 (-0.98z)| norm 0.2016 (-0.09z)| lr 4.72e-07 | 4160.65 ms | 32.5% bf16 MFU | 125817 tok/s step 19275/19560 | loss 3.222129 (-1.15z)| norm 0.1984 (-0.44z)| lr 4.69e-07 | 4160.12 ms | 32.5% bf16 MFU | 125827 tok/s step 19276/19560 | loss 3.226990 (-1.02z)| norm 0.1923 (-1.14z)| lr 4.65e-07 | 4162.08 ms | 32.4% bf16 MFU | 125834 tok/s step 19277/19560 | loss 3.275713 (+0.15z)| norm 0.1914 (-1.23z)| lr 4.62e-07 | 4163.49 ms | 32.4% bf16 MFU | 125839 tok/s step 19278/19560 | loss 3.205560 (-1.51z)| norm 0.1969 (-0.60z)| lr 4.59e-07 | 4159.73 ms | 32.5% bf16 MFU | 125849 tok/s step 19279/19560 | loss 3.239766 (-0.70z)| norm 0.2034 (+0.15z)| lr 4.56e-07 | 4162.08 ms | 32.4% bf16 MFU | 125855 tok/s step 19280/19560 | loss 3.305344 (+0.85z)| norm 0.2016 (-0.05z)| lr 4.52e-07 | 4161.67 ms | 32.4% bf16 MFU | 125861 tok/s step 19281/19560 | loss 3.259389 (-0.24z)| norm 0.1990 (-0.35z)| lr 4.49e-07 | 4161.97 ms | 32.4% bf16 MFU | 125867 tok/s step 19282/19560 | loss 3.206452 (-1.47z)| norm 0.1962 (-0.69z)| lr 4.46e-07 | 4158.58 ms | 32.5% bf16 MFU | 125877 tok/s step 19283/19560 | loss 3.238480 (-0.73z)| norm 0.2034 (+0.15z)| lr 4.43e-07 | 4153.69 ms | 32.5% bf16 MFU | 125894 tok/s step 19284/19560 | loss 3.277901 (+0.20z)| norm 0.2027 (+0.07z)| lr 4.40e-07 | 4155.81 ms | 32.5% bf16 MFU | 125907 tok/s step 19285/19560 | loss 3.261964 (-0.19z)| norm 0.1997 (-0.29z)| lr 4.36e-07 | 4156.81 ms | 32.5% bf16 MFU | 125918 tok/s step 19286/19560 | loss 3.271530 (+0.05z)| norm 0.1970 (-0.59z)| lr 4.33e-07 | 4164.58 ms | 32.4% bf16 MFU | 125917 tok/s step 19287/19560 | loss 3.293213 (+0.56z)| norm 0.1987 (-0.40z)| lr 4.30e-07 | 4155.65 ms | 32.5% bf16 MFU | 125929 tok/s step 19288/19560 | loss 3.316152 (+1.09z)| norm 0.2089 (+0.78z)| lr 4.27e-07 | 4160.18 ms | 32.5% bf16 MFU | 125934 tok/s step 19289/19560 | loss 3.299454 (+0.69z)| norm 0.1991 (-0.36z)| lr 4.24e-07 | 4162.45 ms | 32.4% bf16 MFU | 125935 tok/s step 19290/19560 | loss 3.280769 (+0.25z)| norm 0.2016 (-0.08z)| lr 4.21e-07 | 4160.73 ms | 32.5% bf16 MFU | 125939 tok/s step 19291/19560 | loss 3.276488 (+0.14z)| norm 0.1994 (-0.34z)| lr 4.18e-07 | 4159.37 ms | 32.5% bf16 MFU | 125945 tok/s step 19292/19560 | loss 3.307601 (+0.89z)| norm 0.1967 (-0.66z)| lr 4.15e-07 | 4155.74 ms | 32.5% bf16 MFU | 125955 tok/s step 19293/19560 | loss 3.272356 (+0.05z)| norm 0.1940 (-0.96z)| lr 4.12e-07 | 4157.48 ms | 32.5% bf16 MFU | 125963 tok/s step 19294/19560 | loss 3.280082 (+0.25z)| norm 0.1944 (-0.90z)| lr 4.08e-07 | 4159.65 ms | 32.5% bf16 MFU | 125967 tok/s step 19295/19560 | loss 3.287643 (+0.42z)| norm 0.2284 (+2.92z)| lr 4.05e-07 | 4157.80 ms | 32.5% bf16 MFU | 125973 tok/s step 19296/19560 | loss 3.322048 (+1.24z)| norm 0.1975 (-0.53z)| lr 4.02e-07 | 4158.05 ms | 32.5% bf16 MFU | 125979 tok/s step 19297/19560 | loss 3.288419 (+0.42z)| norm 0.2013 (-0.10z)| lr 3.99e-07 | 4160.40 ms | 32.5% bf16 MFU | 125981 tok/s step 19298/19560 | loss 3.388980 (+2.75z)| norm 0.2457 (+4.47z)| lr 3.96e-07 | 4156.99 ms | 32.5% bf16 MFU | 125988 tok/s step 19299/19560 | loss 3.287392 (+0.37z)| norm 0.1950 (-0.77z)| lr 3.93e-07 | 4163.35 ms | 32.4% bf16 MFU | 125985 tok/s step 19300/19560 | loss 3.255881 (-0.36z)| norm 0.1995 (-0.30z)| lr 3.90e-07 | 4157.84 ms | 32.5% bf16 MFU | 125991 tok/s step 19301/19560 | loss 3.274372 (+0.08z)| norm 0.2038 (+0.13z)| lr 3.87e-07 | 4159.01 ms | 32.5% bf16 MFU | 125994 tok/s step 19302/19560 | loss 3.288639 (+0.40z)| norm 0.2015 (-0.11z)| lr 3.84e-07 | 4157.22 ms | 32.5% bf16 MFU | 126000 tok/s step 19303/19560 | loss 3.287508 (+0.38z)| norm 0.1934 (-0.94z)| lr 3.81e-07 | 4160.26 ms | 32.5% bf16 MFU | 126001 tok/s step 19304/19560 | loss 3.276888 (+0.13z)| norm 0.1940 (-0.86z)| lr 3.78e-07 | 4159.07 ms | 32.5% bf16 MFU | 126004 tok/s step 19305/19560 | loss 3.267574 (-0.10z)| norm 0.1964 (-0.61z)| lr 3.75e-07 | 4161.26 ms | 32.4% bf16 MFU | 126004 tok/s step 19306/19560 | loss 3.259715 (-0.28z)| norm 0.2066 (+0.45z)| lr 3.73e-07 | 4159.23 ms | 32.5% bf16 MFU | 126006 tok/s step 19307/19560 | loss 3.255671 (-0.36z)| norm 0.1976 (-0.49z)| lr 3.70e-07 | 4157.65 ms | 32.5% bf16 MFU | 126011 tok/s step 19308/19560 | loss 3.265706 (-0.12z)| norm 0.1981 (-0.44z)| lr 3.67e-07 | 4154.12 ms | 32.5% bf16 MFU | 126021 tok/s step 19309/19560 | loss 3.264759 (-0.13z)| norm 0.2021 (+0.01z)| lr 3.64e-07 | 4158.14 ms | 32.5% bf16 MFU | 126024 tok/s step 19310/19560 | loss 3.226094 (-1.07z)| norm 0.2012 (-0.08z)| lr 3.61e-07 | 4159.50 ms | 32.5% bf16 MFU | 126025 tok/s step 19311/19560 | loss 3.348235 (+1.87z)| norm 0.2180 (+1.81z)| lr 3.58e-07 | 4156.19 ms | 32.5% bf16 MFU | 126031 tok/s step 19312/19560 | loss 3.304177 (+0.80z)| norm 0.2020 (+0.01z)| lr 3.55e-07 | 4158.74 ms | 32.5% bf16 MFU | 126033 tok/s step 19313/19560 | loss 3.283285 (+0.28z)| norm 0.1991 (-0.31z)| lr 3.52e-07 | 4158.37 ms | 32.5% bf16 MFU | 126036 tok/s step 19314/19560 | loss 3.220728 (-1.22z)| norm 0.1945 (-0.82z)| lr 3.50e-07 | 4158.50 ms | 32.5% bf16 MFU | 126038 tok/s step 19315/19560 | loss 3.221653 (-1.18z)| norm 0.2016 (-0.02z)| lr 3.47e-07 | 4153.29 ms | 32.5% bf16 MFU | 126047 tok/s step 19316/19560 | loss 3.279960 (+0.23z)| norm 0.2056 (+0.44z)| lr 3.44e-07 | 4156.68 ms | 32.5% bf16 MFU | 126052 tok/s step 19317/19560 | loss 3.290890 (+0.48z)| norm 0.1976 (-0.47z)| lr 3.41e-07 | 4157.68 ms | 32.5% bf16 MFU | 126054 tok/s step 19318/19560 | loss 3.302529 (+0.75z)| norm 0.2101 (+0.95z)| lr 3.38e-07 | 4161.85 ms | 32.4% bf16 MFU | 126050 tok/s step 19319/19560 | loss 3.251103 (-0.49z)| norm 0.2013 (-0.05z)| lr 3.36e-07 | 4159.77 ms | 32.5% bf16 MFU | 126050 tok/s step 19320/19560 | loss 3.189145 (-1.99z)| norm 0.1971 (-0.52z)| lr 3.33e-07 | 4156.04 ms | 32.5% bf16 MFU | 126055 tok/s step 19321/19560 | loss 3.247101 (-0.57z)| norm 0.1996 (-0.23z)| lr 3.30e-07 | 4161.13 ms | 32.4% bf16 MFU | 126052 tok/s step 19322/19560 | loss 3.300270 (+0.72z)| norm 0.2068 (+0.58z)| lr 3.27e-07 | 4153.24 ms | 32.5% bf16 MFU | 126061 tok/s step 19323/19560 | loss 3.237790 (-0.79z)| norm 0.2010 (-0.08z)| lr 3.25e-07 | 4155.21 ms | 32.5% bf16 MFU | 126067 tok/s step 19324/19560 | loss 3.349030 (+1.87z)| norm 0.2059 (+0.48z)| lr 3.22e-07 | 4157.03 ms | 32.5% bf16 MFU | 126069 tok/s step 19325/19560 | loss 3.264556 (-0.15z)| norm 0.2022 (+0.06z)| lr 3.19e-07 | 4158.68 ms | 32.5% bf16 MFU | 126069 tok/s step 19326/19560 | loss 3.278662 (+0.23z)| norm 0.1936 (-0.91z)| lr 3.16e-07 | 4155.74 ms | 32.5% bf16 MFU | 126074 tok/s step 19327/19560 | loss 3.259376 (-0.26z)| norm 0.1891 (-1.41z)| lr 3.14e-07 | 4153.89 ms | 32.5% bf16 MFU | 126081 tok/s step 19328/19560 | loss 3.304391 (+0.89z)| norm 0.2208 (+2.14z)| lr 3.11e-07 | 4154.52 ms | 32.5% bf16 MFU | 126087 tok/s step 19329/19560 | loss 3.272426 (+0.08z)| norm 0.1991 (-0.29z)| lr 3.08e-07 | 4155.40 ms | 32.5% bf16 MFU | 126091 tok/s step 19330/19560 | loss 3.248380 (-0.53z)| norm 0.2041 (+0.26z)| lr 3.06e-07 | 4156.11 ms | 32.5% bf16 MFU | 126094 tok/s step 19331/19560 | loss 3.258868 (-0.26z)| norm 0.2035 (+0.19z)| lr 3.03e-07 | 4156.70 ms | 32.5% bf16 MFU | 126096 tok/s step 19332/19560 | loss 3.227034 (-1.06z)| norm 0.1969 (-0.54z)| lr 3.00e-07 | 4158.19 ms | 32.5% bf16 MFU | 126095 tok/s step 19333/19560 | loss 3.271991 (+0.09z)| norm 0.1991 (-0.29z)| lr 2.98e-07 | 4156.87 ms | 32.5% bf16 MFU | 126097 tok/s step 19334/19560 | loss 3.244023 (-0.62z)| norm 0.1965 (-0.58z)| lr 2.95e-07 | 4154.92 ms | 32.5% bf16 MFU | 126101 tok/s step 19335/19560 | loss 3.244142 (-0.62z)| norm 0.2004 (-0.13z)| lr 2.93e-07 | 4157.77 ms | 32.5% bf16 MFU | 126101 tok/s step 19336/19560 | loss 3.295288 (+0.69z)| norm 0.2058 (+0.46z)| lr 2.90e-07 | 4154.56 ms | 32.5% bf16 MFU | 126106 tok/s step 19337/19560 | loss 3.236009 (-0.82z)| norm 0.1991 (-0.30z)| lr 2.87e-07 | 4152.04 ms | 32.5% bf16 MFU | 126114 tok/s step 19338/19560 | loss 3.280220 (+0.31z)| norm 0.1942 (-0.83z)| lr 2.85e-07 | 4157.08 ms | 32.5% bf16 MFU | 126114 tok/s step 19339/19560 | loss 3.449169 (+4.26z)| norm 0.2160 (+1.60z)| lr 2.82e-07 | 4158.79 ms | 32.5% bf16 MFU | 126112 tok/s step 19340/19560 | loss 3.248514 (-0.49z)| norm 0.1966 (-0.55z)| lr 2.80e-07 | 4159.11 ms | 32.5% bf16 MFU | 126109 tok/s step 19341/19560 | loss 3.309637 (+0.95z)| norm 0.1982 (-0.38z)| lr 2.77e-07 | 4158.06 ms | 32.5% bf16 MFU | 126108 tok/s step 19342/19560 | loss 3.235404 (-0.81z)| norm 0.1990 (-0.28z)| lr 2.75e-07 | 4153.08 ms | 32.5% bf16 MFU | 126115 tok/s step 19343/19560 | loss 3.288011 (+0.48z)| norm 0.1960 (-0.70z)| lr 2.72e-07 | 4156.60 ms | 32.5% bf16 MFU | 126116 tok/s step 19344/19560 | loss 3.224267 (-1.08z)| norm 0.1981 (-0.40z)| lr 2.70e-07 | 4156.01 ms | 32.5% bf16 MFU | 126118 tok/s step 19345/19560 | loss 3.214643 (-1.30z)| norm 0.1983 (-0.36z)| lr 2.67e-07 | 4153.02 ms | 32.5% bf16 MFU | 126124 tok/s step 19346/19560 | loss 3.304057 (+0.93z)| norm 0.1937 (-0.99z)| lr 2.65e-07 | 4153.85 ms | 32.5% bf16 MFU | 126129 tok/s step 19347/19560 | loss 3.221477 (-1.13z)| norm 0.2038 (+0.44z)| lr 2.62e-07 | 4156.55 ms | 32.5% bf16 MFU | 126129 tok/s step 19348/19560 | loss 3.286289 (+0.49z)| norm 0.2031 (+0.32z)| lr 2.60e-07 | 4156.16 ms | 32.5% bf16 MFU | 126130 tok/s step 19349/19560 | loss 3.291148 (+0.61z)| norm 0.1954 (-0.76z)| lr 2.58e-07 | 4155.16 ms | 32.5% bf16 MFU | 126132 tok/s step 19350/19560 | loss 3.292585 (+0.63z)| norm 0.2085 (+1.08z)| lr 2.55e-07 | 4151.35 ms | 32.5% bf16 MFU | 126140 tok/s step 19351/19560 | loss 3.242695 (-0.66z)| norm 0.1994 (-0.21z)| lr 2.53e-07 | 4154.97 ms | 32.5% bf16 MFU | 126142 tok/s step 19352/19560 | loss 3.242522 (-0.66z)| norm 0.1940 (-0.96z)| lr 2.50e-07 | 4157.55 ms | 32.5% bf16 MFU | 126141 tok/s step 19353/19560 | loss 3.249461 (-0.49z)| norm 0.1967 (-0.58z)| lr 2.48e-07 | 4151.80 ms | 32.5% bf16 MFU | 126148 tok/s step 19354/19560 | loss 3.233356 (-0.89z)| norm 0.1962 (-0.64z)| lr 2.46e-07 | 4155.39 ms | 32.5% bf16 MFU | 126149 tok/s step 19355/19560 | loss 3.315926 (+1.22z)| norm 0.2068 (+0.83z)| lr 2.43e-07 | 4153.48 ms | 32.5% bf16 MFU | 126153 tok/s step 19356/19560 | loss 3.248557 (-0.53z)| norm 0.1999 (-0.12z)| lr 2.41e-07 | 4150.26 ms | 32.5% bf16 MFU | 126161 tok/s step 19357/19560 | loss 3.374257 (+2.65z)| norm 0.2240 (+3.09z)| lr 2.38e-07 | 4151.02 ms | 32.5% bf16 MFU | 126168 tok/s step 19358/19560 | loss 3.289013 (+0.47z)| norm 0.2026 (+0.20z)| lr 2.36e-07 | 4151.07 ms | 32.5% bf16 MFU | 126175 tok/s step 19359/19560 | loss 3.242010 (-0.73z)| norm 0.1963 (-0.65z)| lr 2.34e-07 | 4153.87 ms | 32.5% bf16 MFU | 126177 tok/s step 19360/19560 | loss 3.287897 (+0.44z)| norm 0.1943 (-0.91z)| lr 2.31e-07 | 4152.56 ms | 32.5% bf16 MFU | 126181 tok/s step 19361/19560 | loss 3.256489 (-0.37z)| norm 0.2115 (+1.38z)| lr 2.29e-07 | 4154.09 ms | 32.5% bf16 MFU | 126183 tok/s step 19362/19560 | loss 3.332851 (+1.57z)| norm 0.2136 (+1.64z)| lr 2.27e-07 | 4151.53 ms | 32.5% bf16 MFU | 126188 tok/s step 19363/19560 | loss 3.232998 (-0.97z)| norm 0.2023 (+0.14z)| lr 2.25e-07 | 4147.51 ms | 32.6% bf16 MFU | 126199 tok/s step 19364/19560 | loss 3.208107 (-1.59z)| norm 0.1924 (-1.16z)| lr 2.22e-07 | 4155.83 ms | 32.5% bf16 MFU | 126197 tok/s step 19365/19560 | loss 3.309747 (+0.98z)| norm 0.2076 (+0.83z)| lr 2.20e-07 | 4386.78 ms | 30.8% bf16 MFU | 125863 tok/s step 19366/19560 | loss 3.277833 (+0.17z)| norm 0.2063 (+0.65z)| lr 2.18e-07 | 5505.78 ms | 24.5% bf16 MFU | 124331 tok/s step 19367/19560 | loss 3.263083 (-0.21z)| norm 0.1985 (-0.37z)| lr 2.16e-07 | 5726.35 ms | 23.6% bf16 MFU | 122692 tok/s step 19368/19560 | loss 3.275970 (+0.10z)| norm 0.2011 (-0.04z)| lr 2.13e-07 | 4628.54 ms | 29.2% bf16 MFU | 122221 tok/s step 19369/19560 | loss 3.340914 (+1.75z)| norm 0.2127 (+1.46z)| lr 2.11e-07 | 4622.61 ms | 29.2% bf16 MFU | 121781 tok/s step 19370/19560 | loss 3.339468 (+1.68z)| norm 0.2391 (+4.51z)| lr 2.09e-07 | 4459.56 ms | 30.3% bf16 MFU | 121570 tok/s step 19371/19560 | loss 3.325557 (+1.31z)| norm 0.2027 (+0.10z)| lr 2.07e-07 | 4575.96 ms | 29.5% bf16 MFU | 121221 tok/s step 19372/19560 | loss 3.276397 (+0.05z)| norm 0.2042 (+0.28z)| lr 2.05e-07 | 4377.60 ms | 30.8% bf16 MFU | 121148 tok/s step 19373/19560 | loss 3.274839 (+0.01z)| norm 0.1927 (-1.11z)| lr 2.03e-07 | 4308.77 ms | 31.3% bf16 MFU | 121174 tok/s step 19374/19560 | loss 3.274141 (+0.00z)| norm 0.1978 (-0.48z)| lr 2.00e-07 | 5183.61 ms | 26.0% bf16 MFU | 120173 tok/s step 19375/19560 | loss 3.315973 (+1.07z)| norm 0.1975 (-0.52z)| lr 1.98e-07 | 4748.97 ms | 28.4% bf16 MFU | 119684 tok/s step 19376/19560 | loss 3.284685 (+0.27z)| norm 0.2052 (+0.41z)| lr 1.96e-07 | 4445.55 ms | 30.4% bf16 MFU | 119597 tok/s step 19377/19560 | loss 3.256407 (-0.44z)| norm 0.1953 (-0.78z)| lr 1.94e-07 | 4153.40 ms | 32.5% bf16 MFU | 119928 tok/s step 19378/19560 | loss 3.296430 (+0.59z)| norm 0.1963 (-0.64z)| lr 1.92e-07 | 4190.40 ms | 32.2% bf16 MFU | 120188 tok/s step 19379/19560 | loss 3.343399 (+1.76z)| norm 0.1999 (-0.21z)| lr 1.90e-07 | 4383.37 ms | 30.8% bf16 MFU | 120159 tok/s step 19380/19560 | loss 3.263801 (-0.26z)| norm 0.2052 (+0.44z)| lr 1.88e-07 | 4180.33 ms | 32.3% bf16 MFU | 120422 tok/s step 19381/19560 | loss 3.251701 (-0.57z)| norm 0.2012 (-0.05z)| lr 1.86e-07 | 4168.66 ms | 32.4% bf16 MFU | 120689 tok/s step 19382/19560 | loss 3.284979 (+0.28z)| norm 0.1991 (-0.30z)| lr 1.84e-07 | 4229.16 ms | 31.9% bf16 MFU | 120853 tok/s step 19383/19560 | loss 3.278527 (+0.10z)| norm 0.2165 (+1.77z)| lr 1.82e-07 | 4343.23 ms | 31.1% bf16 MFU | 120846 tok/s step 19384/19560 | loss 3.283162 (+0.22z)| norm 0.1945 (-0.85z)| lr 1.80e-07 | 4267.43 ms | 31.6% bf16 MFU | 120947 tok/s step 19385/19560 | loss 3.321876 (+1.20z)| norm 0.2123 (+1.28z)| lr 1.77e-07 | 4179.54 ms | 32.3% bf16 MFU | 121172 tok/s step 19386/19560 | loss 3.210130 (-1.63z)| norm 0.1971 (-0.55z)| lr 1.75e-07 | 4178.09 ms | 32.3% bf16 MFU | 121387 tok/s step 19387/19560 | loss 3.257458 (-0.43z)| norm 0.1899 (-1.40z)| lr 1.73e-07 | 4181.16 ms | 32.3% bf16 MFU | 121588 tok/s step 19388/19560 | loss 3.305831 (+0.79z)| norm 0.1949 (-0.79z)| lr 1.71e-07 | 4190.77 ms | 32.2% bf16 MFU | 121763 tok/s step 19389/19560 | loss 3.282254 (+0.19z)| norm 0.1989 (-0.32z)| lr 1.70e-07 | 4386.17 ms | 30.8% bf16 MFU | 121652 tok/s step 19390/19560 | loss 3.248271 (-0.67z)| norm 0.1968 (-0.56z)| lr 1.68e-07 | 4653.16 ms | 29.0% bf16 MFU | 121203 tok/s step 19391/19560 | loss 3.312492 (+0.94z)| norm 0.1975 (-0.48z)| lr 1.66e-07 | 4693.06 ms | 28.8% bf16 MFU | 120729 tok/s step 19392/19560 | loss 3.293673 (+0.45z)| norm 0.2167 (+1.76z)| lr 1.64e-07 | 4651.59 ms | 29.0% bf16 MFU | 120328 tok/s step 19393/19560 | loss 3.325271 (+1.25z)| norm 0.1933 (-0.98z)| lr 1.62e-07 | 4166.57 ms | 32.4% bf16 MFU | 120603 tok/s step 19394/19560 | loss 3.314104 (+0.95z)| norm 0.1993 (-0.28z)| lr 1.60e-07 | 4198.89 ms | 32.2% bf16 MFU | 120816 tok/s step 19395/19560 | loss 3.327080 (+1.26z)| norm 0.1970 (-0.56z)| lr 1.58e-07 | 4242.82 ms | 31.8% bf16 MFU | 120954 tok/s step 19396/19560 | loss 3.324378 (+1.17z)| norm 0.2010 (-0.08z)| lr 1.56e-07 | 4172.53 ms | 32.4% bf16 MFU | 121189 tok/s step 19397/19560 | loss 3.310868 (+0.82z)| norm 0.1990 (-0.32z)| lr 1.54e-07 | 4194.92 ms | 32.2% bf16 MFU | 121378 tok/s step 19398/19560 | loss 3.249155 (-0.71z)| norm 0.1985 (-0.37z)| lr 1.52e-07 | 4185.64 ms | 32.3% bf16 MFU | 121572 tok/s step 19399/19560 | loss 3.290242 (+0.34z)| norm 0.1986 (-0.36z)| lr 1.50e-07 | 4341.99 ms | 31.1% bf16 MFU | 121531 tok/s step 19400/19560 | loss 3.318512 (+1.06z)| norm 0.1981 (-0.41z)| lr 1.49e-07 | 4173.76 ms | 32.3% bf16 MFU | 121735 tok/s step 19401/19560 | loss 3.325892 (+1.23z)| norm 0.1976 (-0.45z)| lr 1.47e-07 | 4171.94 ms | 32.4% bf16 MFU | 121932 tok/s step 19402/19560 | loss 3.321137 (+1.09z)| norm 0.2022 (+0.09z)| lr 1.45e-07 | 4197.43 ms | 32.2% bf16 MFU | 122081 tok/s step 19403/19560 | loss 3.275792 (-0.07z)| norm 0.1993 (-0.25z)| lr 1.43e-07 | 4194.51 ms | 32.2% bf16 MFU | 122226 tok/s step 19404/19560 | loss 3.287326 (+0.21z)| norm 0.2026 (+0.14z)| lr 1.41e-07 | 4178.42 ms | 32.3% bf16 MFU | 122389 tok/s step 19405/19560 | loss 3.351490 (+1.82z)| norm 0.4608 (+10.59z)| lr 1.39e-07 | 4170.43 ms | 32.4% bf16 MFU | 122555 tok/s step 19406/19560 | loss 3.262940 (-0.44z)| norm 0.1992 (-0.18z)| lr 1.38e-07 | 4177.82 ms | 32.3% bf16 MFU | 122702 tok/s step 19407/19560 | loss 3.275795 (-0.12z)| norm 0.2018 (-0.08z)| lr 1.36e-07 | 4321.81 ms | 31.2% bf16 MFU | 122633 tok/s step 19408/19560 | loss 3.302809 (+0.58z)| norm 0.2000 (-0.15z)| lr 1.34e-07 | 4195.10 ms | 32.2% bf16 MFU | 122750 tok/s step 19409/19560 | loss 3.367616 (+2.19z)| norm 0.3404 (+5.01z)| lr 1.32e-07 | 4521.33 ms | 29.9% bf16 MFU | 122410 tok/s step 19410/19560 | loss 3.334047 (+1.32z)| norm 0.2074 (+0.10z)| lr 1.31e-07 | 4189.72 ms | 32.2% bf16 MFU | 122547 tok/s step 19411/19560 | loss 3.316556 (+0.86z)| norm 0.2053 (+0.02z)| lr 1.29e-07 | 4185.33 ms | 32.3% bf16 MFU | 122683 tok/s step 19412/19560 | loss 3.306039 (+0.59z)| norm 0.2023 (-0.09z)| lr 1.27e-07 | 4172.02 ms | 32.4% bf16 MFU | 122832 tok/s step 19413/19560 | loss 3.222103 (-1.54z)| norm 0.2011 (-0.14z)| lr 1.26e-07 | 4306.85 ms | 31.3% bf16 MFU | 122777 tok/s step 19414/19560 | loss 3.262960 (-0.50z)| norm 0.1970 (-0.29z)| lr 1.24e-07 | 4398.57 ms | 30.7% bf16 MFU | 122598 tok/s step 19415/19560 | loss 3.286662 (+0.10z)| norm 0.2001 (-0.18z)| lr 1.22e-07 | 4202.76 ms | 32.1% bf16 MFU | 122705 tok/s step 19416/19560 | loss 3.274908 (-0.19z)| norm 0.2031 (-0.06z)| lr 1.20e-07 | 4179.90 ms | 32.3% bf16 MFU | 122842 tok/s step 19417/19560 | loss 3.351985 (+1.74z)| norm 0.2083 (+0.13z)| lr 1.19e-07 | 4224.11 ms | 32.0% bf16 MFU | 122905 tok/s step 19418/19560 | loss 3.333238 (+1.25z)| norm 0.2087 (+0.14z)| lr 1.17e-07 | 4173.59 ms | 32.4% bf16 MFU | 123041 tok/s step 19419/19560 | loss 3.264769 (-0.45z)| norm 0.1996 (-0.20z)| lr 1.16e-07 | 4196.47 ms | 32.2% bf16 MFU | 123136 tok/s step 19420/19560 | loss 3.334640 (+1.28z)| norm 0.2027 (-0.08z)| lr 1.14e-07 | 4184.65 ms | 32.3% bf16 MFU | 123244 tok/s step 19421/19560 | loss 3.341392 (+1.42z)| norm 0.2026 (-0.09z)| lr 1.12e-07 | 4197.22 ms | 32.2% bf16 MFU | 123327 tok/s step 19422/19560 | loss 3.319215 (+0.86z)| norm 0.2062 (+0.04z)| lr 1.11e-07 | 4195.92 ms | 32.2% bf16 MFU | 123408 tok/s step 19423/19560 | loss 3.258245 (-0.63z)| norm 0.2013 (-0.13z)| lr 1.09e-07 | 4191.57 ms | 32.2% bf16 MFU | 123492 tok/s step 19424/19560 | loss 3.280291 (-0.08z)| norm 0.1984 (-0.24z)| lr 1.08e-07 | 4203.26 ms | 32.1% bf16 MFU | 123554 tok/s step 19425/19560 | loss 3.281698 (-0.04z)| norm 0.1963 (-0.32z)| lr 1.06e-07 | 4191.49 ms | 32.2% bf16 MFU | 123631 tok/s step 19426/19560 | loss 3.283552 (+0.02z)| norm 0.2077 (+0.12z)| lr 1.04e-07 | 4198.37 ms | 32.2% bf16 MFU | 123693 tok/s step 19427/19560 | loss 3.281352 (-0.03z)| norm 0.2172 (+0.46z)| lr 1.03e-07 | 4172.46 ms | 32.4% bf16 MFU | 123791 tok/s step 19428/19560 | loss 3.322731 (+1.00z)| norm 0.2015 (-0.12z)| lr 1.01e-07 | 4188.95 ms | 32.2% bf16 MFU | 123859 tok/s step 19429/19560 | loss 3.288525 (+0.13z)| norm 0.1990 (-0.22z)| lr 9.98e-08 | 4180.82 ms | 32.3% bf16 MFU | 123937 tok/s step 19430/19560 | loss 3.381315 (+2.40z)| norm 0.2048 (+0.00z)| lr 9.83e-08 | 4180.06 ms | 32.3% bf16 MFU | 124011 tok/s step 19431/19560 | loss 3.289831 (+0.15z)| norm 0.1973 (-0.28z)| lr 9.68e-08 | 4351.04 ms | 31.0% bf16 MFU | 123835 tok/s step 19432/19560 | loss 3.344200 (+1.46z)| norm 0.2092 (+0.16z)| lr 9.53e-08 | 4187.14 ms | 32.2% bf16 MFU | 123904 tok/s step 19433/19560 | loss 3.306895 (+0.54z)| norm 0.1973 (-0.28z)| lr 9.39e-08 | 4208.78 ms | 32.1% bf16 MFU | 123938 tok/s step 19434/19560 | loss 3.271060 (-0.34z)| norm 0.1993 (-0.21z)| lr 9.24e-08 | 4288.67 ms | 31.5% bf16 MFU | 123853 tok/s step 19435/19560 | loss 3.220320 (-1.56z)| norm 0.1989 (-0.22z)| lr 9.10e-08 | 4491.33 ms | 30.1% bf16 MFU | 123497 tok/s step 19436/19560 | loss 3.344118 (+1.42z)| norm 0.2070 (+0.08z)| lr 8.95e-08 | 4179.35 ms | 32.3% bf16 MFU | 123595 tok/s step 19437/19560 | loss 3.277836 (-0.18z)| norm 0.1994 (-0.20z)| lr 8.81e-08 | 4517.56 ms | 29.9% bf16 MFU | 123218 tok/s step 19438/19560 | loss 3.307659 (+0.53z)| norm 0.2000 (-0.18z)| lr 8.67e-08 | 4164.87 ms | 32.4% bf16 MFU | 123351 tok/s step 19439/19560 | loss 3.289431 (+0.10z)| norm 0.2002 (-0.17z)| lr 8.53e-08 | 4510.61 ms | 29.9% bf16 MFU | 122995 tok/s step 19440/19560 | loss 3.372551 (+2.09z)| norm 0.2107 (+0.22z)| lr 8.39e-08 | 4185.53 ms | 32.3% bf16 MFU | 123109 tok/s step 19441/19560 | loss 3.267883 (-0.43z)| norm 0.2026 (-0.08z)| lr 8.25e-08 | 4179.74 ms | 32.3% bf16 MFU | 123225 tok/s step 19442/19560 | loss 3.353812 (+1.61z)| norm 0.2028 (-0.08z)| lr 8.11e-08 | 4487.42 ms | 30.1% bf16 MFU | 122905 tok/s step 19443/19560 | loss 3.279745 (-0.19z)| norm 0.2018 (-0.12z)| lr 7.98e-08 | 4170.99 ms | 32.4% bf16 MFU | 123045 tok/s step 19444/19560 | loss 3.299469 (+0.29z)| norm 0.2465 (+1.53z)| lr 7.84e-08 | 4430.09 ms | 30.5% bf16 MFU | 122810 tok/s step 19445/19560 | loss 3.341562 (+1.29z)| norm 0.2038 (-0.06z)| lr 7.71e-08 | 4180.31 ms | 32.3% bf16 MFU | 122941 tok/s step 19446/19560 | loss 3.394304 (+2.49z)| norm 0.2001 (-0.19z)| lr 7.58e-08 | 4183.07 ms | 32.3% bf16 MFU | 123060 tok/s step 19447/19560 | loss 3.272473 (-0.39z)| norm 0.1962 (-0.33z)| lr 7.45e-08 | 4182.95 ms | 32.3% bf16 MFU | 123174 tok/s step 19448/19560 | loss 3.281609 (-0.19z)| norm 0.1967 (-0.31z)| lr 7.32e-08 | 4160.58 ms | 32.5% bf16 MFU | 123316 tok/s step 19449/19560 | loss 3.263649 (-0.63z)| norm 0.2224 (+0.63z)| lr 7.19e-08 | 4477.39 ms | 30.2% bf16 MFU | 123005 tok/s step 19450/19560 | loss 3.271200 (-0.44z)| norm 0.1926 (-0.47z)| lr 7.06e-08 | 4173.12 ms | 32.4% bf16 MFU | 123137 tok/s step 19451/19560 | loss 3.360778 (+1.69z)| norm 0.2240 (+0.69z)| lr 6.93e-08 | 4197.88 ms | 32.2% bf16 MFU | 123225 tok/s step 19452/19560 | loss 3.235656 (-1.30z)| norm 0.2025 (-0.11z)| lr 6.81e-08 | 4269.38 ms | 31.6% bf16 MFU | 123203 tok/s step 19453/19560 | loss 3.227011 (-1.49z)| norm 0.2116 (+0.23z)| lr 6.68e-08 | 4208.17 ms | 32.1% bf16 MFU | 123273 tok/s step 19454/19560 | loss 3.231174 (-1.37z)| norm 0.1972 (-0.30z)| lr 6.56e-08 | 4178.60 ms | 32.3% bf16 MFU | 123382 tok/s step 19455/19560 | loss 3.291061 (+0.05z)| norm 0.1951 (-0.38z)| lr 6.44e-08 | 4189.60 ms | 32.2% bf16 MFU | 123470 tok/s step 19456/19560 | loss 3.264053 (-0.59z)| norm 0.1969 (-0.31z)| lr 6.32e-08 | 4194.07 ms | 32.2% bf16 MFU | 123547 tok/s step 19457/19560 | loss 3.267136 (-0.51z)| norm 0.1942 (-0.41z)| lr 6.20e-08 | 4191.06 ms | 32.2% bf16 MFU | 123625 tok/s step 19458/19560 | loss 3.403416 (+2.63z)| norm 0.2134 (+0.30z)| lr 6.08e-08 | 4171.53 ms | 32.4% bf16 MFU | 123728 tok/s step 19459/19560 | loss 3.229235 (-1.40z)| norm 0.1947 (-0.39z)| lr 5.96e-08 | 4378.00 ms | 30.8% bf16 MFU | 123529 tok/s step 19460/19560 | loss 3.263230 (-0.62z)| norm 0.2034 (-0.07z)| lr 5.85e-08 | 4200.18 ms | 32.1% bf16 MFU | 123594 tok/s step 19461/19560 | loss 3.310810 (+0.48z)| norm 0.2006 (-0.18z)| lr 5.73e-08 | 4571.42 ms | 29.5% bf16 MFU | 123148 tok/s step 19462/19560 | loss 3.334538 (+1.01z)| norm 0.2044 (-0.04z)| lr 5.62e-08 | 4177.23 ms | 32.3% bf16 MFU | 123267 tok/s step 19463/19560 | loss 3.254762 (-0.85z)| norm 0.2017 (-0.14z)| lr 5.50e-08 | 4171.96 ms | 32.4% bf16 MFU | 123387 tok/s step 19464/19560 | loss 3.286955 (-0.09z)| norm 0.2056 (+0.01z)| lr 5.39e-08 | 4230.07 ms | 31.9% bf16 MFU | 123415 tok/s step 19465/19560 | loss 3.377913 (+1.98z)| norm 0.2845 (+2.81z)| lr 5.28e-08 | 4156.33 ms | 32.5% bf16 MFU | 123551 tok/s step 19466/19560 | loss 3.398158 (+2.37z)| norm 0.2163 (+0.36z)| lr 5.17e-08 | 4192.01 ms | 32.2% bf16 MFU | 123627 tok/s step 19467/19560 | loss 3.323476 (+0.74z)| norm 0.2295 (+0.83z)| lr 5.06e-08 | 4153.49 ms | 32.5% bf16 MFU | 123757 tok/s step 19468/19560 | loss 3.287289 (-0.12z)| norm 0.1961 (-0.37z)| lr 4.96e-08 | 4792.06 ms | 28.2% bf16 MFU | 123039 tok/s step 19469/19560 | loss 3.380841 (+2.06z)| norm 0.1918 (-0.52z)| lr 4.85e-08 | 4186.87 ms | 32.2% bf16 MFU | 123149 tok/s step 19470/19560 | loss 3.315835 (+0.52z)| norm 0.2000 (-0.23z)| lr 4.74e-08 | 4167.66 ms | 32.4% bf16 MFU | 123281 tok/s step 19471/19560 | loss 3.294960 (+0.03z)| norm 0.1935 (-0.46z)| lr 4.64e-08 | 4170.04 ms | 32.4% bf16 MFU | 123403 tok/s step 19472/19560 | loss 3.295490 (+0.03z)| norm 0.2108 (+0.16z)| lr 4.54e-08 | 4544.71 ms | 29.7% bf16 MFU | 123001 tok/s step 19473/19560 | loss 3.325449 (+0.73z)| norm 0.1963 (-0.36z)| lr 4.44e-08 | 4372.44 ms | 30.9% bf16 MFU | 122847 tok/s step 19474/19560 | loss 3.300137 (+0.12z)| norm 0.2116 (+0.18z)| lr 4.34e-08 | 4421.95 ms | 30.5% bf16 MFU | 122633 tok/s step 19475/19560 | loss 3.283693 (-0.29z)| norm 0.1964 (-0.36z)| lr 4.24e-08 | 4162.36 ms | 32.4% bf16 MFU | 122799 tok/s step 19476/19560 | loss 3.268867 (-0.65z)| norm 0.2037 (-0.10z)| lr 4.14e-08 | 4295.26 ms | 31.4% bf16 MFU | 122762 tok/s step 19477/19560 | loss 3.283875 (-0.28z)| norm 0.1981 (-0.30z)| lr 4.04e-08 | 4181.42 ms | 32.3% bf16 MFU | 122893 tok/s step 19478/19560 | loss 3.309421 (+0.34z)| norm 0.2023 (-0.15z)| lr 3.95e-08 | 4220.91 ms | 32.0% bf16 MFU | 122959 tok/s step 19479/19560 | loss 3.314240 (+0.45z)| norm 0.1997 (-0.24z)| lr 3.85e-08 | 4195.64 ms | 32.2% bf16 MFU | 123059 tok/s step 19480/19560 | loss 3.242842 (-1.30z)| norm 0.1929 (-0.49z)| lr 3.76e-08 | 4173.33 ms | 32.4% bf16 MFU | 123188 tok/s step 19481/19560 | loss 3.287680 (-0.21z)| norm 0.2063 (-0.01z)| lr 3.67e-08 | 4433.24 ms | 30.5% bf16 MFU | 122941 tok/s step 19482/19560 | loss 3.248793 (-1.18z)| norm 0.2083 (+0.06z)| lr 3.58e-08 | 4156.41 ms | 32.5% bf16 MFU | 123101 tok/s step 19483/19560 | loss 3.256804 (-0.96z)| norm 0.1982 (-0.30z)| lr 3.49e-08 | 4262.28 ms | 31.7% bf16 MFU | 123097 tok/s step 19484/19560 | loss 3.296511 (+0.00z)| norm 0.2062 (-0.02z)| lr 3.40e-08 | 4168.85 ms | 32.4% bf16 MFU | 123230 tok/s step 19485/19560 | loss 3.285469 (-0.26z)| norm 0.2052 (-0.04z)| lr 3.31e-08 | 4162.40 ms | 32.4% bf16 MFU | 123366 tok/s step 19486/19560 | loss 3.284856 (-0.27z)| norm 0.1931 (-0.48z)| lr 3.22e-08 | 4185.93 ms | 32.3% bf16 MFU | 123460 tok/s step 19487/19560 | loss 3.309377 (+0.33z)| norm 0.1933 (-0.47z)| lr 3.14e-08 | 4175.42 ms | 32.3% bf16 MFU | 123566 tok/s step 19488/19560 | loss 3.278198 (-0.45z)| norm 0.2050 (-0.05z)| lr 3.05e-08 | 4811.48 ms | 28.1% bf16 MFU | 122836 tok/s step 19489/19560 | loss 3.276966 (-0.49z)| norm 0.1941 (-0.44z)| lr 2.97e-08 | 4240.23 ms | 31.8% bf16 MFU | 122876 tok/s step 19490/19560 | loss 3.234763 (-1.53z)| norm 0.2042 (-0.07z)| lr 2.89e-08 | 4158.25 ms | 32.5% bf16 MFU | 123037 tok/s step 19491/19560 | loss 3.215635 (-2.00z)| norm 0.2256 (+0.69z)| lr 2.81e-08 | 4178.37 ms | 32.3% bf16 MFU | 123159 tok/s step 19492/19560 | loss 3.261755 (-0.87z)| norm 0.1969 (-0.34z)| lr 2.73e-08 | 4219.59 ms | 32.0% bf16 MFU | 123213 tok/s step 19493/19560 | loss 3.271051 (-0.62z)| norm 0.1931 (-0.47z)| lr 2.65e-08 | 4173.61 ms | 32.4% bf16 MFU | 123334 tok/s step 19494/19560 | loss 3.358530 (+1.57z)| norm 0.2020 (-0.16z)| lr 2.57e-08 | 4178.12 ms | 32.3% bf16 MFU | 123441 tok/s step 19495/19560 | loss 3.279254 (-0.43z)| norm 0.1976 (-0.31z)| lr 2.50e-08 | 4182.24 ms | 32.3% bf16 MFU | 123537 tok/s step 19496/19560 | loss 3.272172 (-0.61z)| norm 0.2032 (-0.11z)| lr 2.42e-08 | 4167.56 ms | 32.4% bf16 MFU | 123650 tok/s step 19497/19560 | loss 3.321355 (+0.64z)| norm 0.2266 (+0.72z)| lr 2.35e-08 | 4163.13 ms | 32.4% bf16 MFU | 123765 tok/s step 19498/19560 | loss 3.262769 (-0.83z)| norm 0.1972 (-0.32z)| lr 2.27e-08 | 4400.24 ms | 30.7% bf16 MFU | 123534 tok/s step 19499/19560 | loss 3.319058 (+0.60z)| norm 0.1984 (-0.27z)| lr 2.20e-08 | 4170.72 ms | 32.4% bf16 MFU | 123643 tok/s step 19500/19560 | loss 3.210598 (-2.10z)| norm 0.2001 (-0.21z)| lr 2.13e-08 | 4166.80 ms | 32.4% bf16 MFU | 123752 tok/s val loss 3.251232 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3062/10042 = 0.304919 step 19501/19560 | loss 3.313899 (+0.47z)| norm 0.1970 (-0.33z)| lr 2.06e-08 | 4356.66 ms | 31.0% bf16 MFU | 123581 tok/s step 19502/19560 | loss 3.317192 (+0.54z)| norm 0.1979 (-0.29z)| lr 2.00e-08 | 4269.98 ms | 31.6% bf16 MFU | 123541 tok/s step 19503/19560 | loss 3.304688 (+0.23z)| norm 0.1955 (-0.38z)| lr 1.93e-08 | 4643.69 ms | 29.1% bf16 MFU | 123009 tok/s step 19504/19560 | loss 3.329735 (+0.85z)| norm 0.2000 (-0.22z)| lr 1.86e-08 | 4277.02 ms | 31.6% bf16 MFU | 122988 tok/s step 19505/19560 | loss 3.335673 (+0.98z)| norm 0.1955 (-0.38z)| lr 1.80e-08 | 4384.55 ms | 30.8% bf16 MFU | 122817 tok/s step 19506/19560 | loss 3.335935 (+0.98z)| norm 0.2111 (+0.18z)| lr 1.73e-08 | 7144.81 ms | 18.9% bf16 MFU | 120346 tok/s step 19507/19560 | loss 3.199407 (-2.35z)| norm 0.2067 (+0.02z)| lr 1.67e-08 | 4163.43 ms | 32.4% bf16 MFU | 120625 tok/s step 19508/19560 | loss 3.319319 (+0.57z)| norm 0.2036 (-0.09z)| lr 1.61e-08 | 4234.55 ms | 31.9% bf16 MFU | 120784 tok/s step 19509/19560 | loss 3.288441 (-0.19z)| norm 0.1972 (-0.32z)| lr 1.55e-08 | 4162.42 ms | 32.4% bf16 MFU | 121043 tok/s step 19510/19560 | loss 3.255720 (-0.99z)| norm 0.2216 (+0.55z)| lr 1.49e-08 | 4352.98 ms | 31.0% bf16 MFU | 121013 tok/s step 19511/19560 | loss 3.340640 (+1.08z)| norm 0.1978 (-0.30z)| lr 1.43e-08 | 4376.96 ms | 30.8% bf16 MFU | 120951 tok/s step 19512/19560 | loss 3.297168 (+0.01z)| norm 0.2057 (-0.02z)| lr 1.38e-08 | 4176.18 ms | 32.3% bf16 MFU | 121181 tok/s step 19513/19560 | loss 3.328109 (+0.77z)| norm 0.2159 (+0.34z)| lr 1.32e-08 | 4151.91 ms | 32.5% bf16 MFU | 121436 tok/s step 19514/19560 | loss 3.215303 (-1.99z)| norm 0.1947 (-0.42z)| lr 1.27e-08 | 4309.15 ms | 31.3% bf16 MFU | 121447 tok/s step 19515/19560 | loss 3.320034 (+0.56z)| norm 0.1977 (-0.31z)| lr 1.21e-08 | 4223.41 ms | 32.0% bf16 MFU | 121582 tok/s step 19516/19560 | loss 3.317335 (+0.49z)| norm 0.1990 (-0.26z)| lr 1.16e-08 | 4621.50 ms | 29.2% bf16 MFU | 121175 tok/s step 19517/19560 | loss 3.271970 (-0.62z)| norm 0.2034 (-0.11z)| lr 1.11e-08 | 5290.92 ms | 25.5% bf16 MFU | 120071 tok/s step 19518/19560 | loss 3.331641 (+0.83z)| norm 0.1961 (-0.37z)| lr 1.06e-08 | 4295.82 ms | 31.4% bf16 MFU | 120170 tok/s step 19519/19560 | loss 3.281169 (-0.40z)| norm 0.2037 (-0.10z)| lr 1.01e-08 | 4360.76 ms | 31.0% bf16 MFU | 120173 tok/s step 19520/19560 | loss 3.325060 (+0.67z)| norm 0.1939 (-0.44z)| lr 9.63e-09 | 4229.99 ms | 31.9% bf16 MFU | 120361 tok/s step 19521/19560 | loss 3.293123 (-0.11z)| norm 0.2168 (+0.37z)| lr 9.18e-09 | 4267.96 ms | 31.6% bf16 MFU | 120485 tok/s step 19522/19560 | loss 3.325170 (+0.67z)| norm 0.1991 (-0.27z)| lr 8.73e-09 | 5980.25 ms | 22.6% bf16 MFU | 118844 tok/s step 19523/19560 | loss 3.227540 (-1.68z)| norm 0.2110 (+0.16z)| lr 8.27e-09 | 5687.03 ms | 23.7% bf16 MFU | 117512 tok/s step 19524/19560 | loss 3.246906 (-1.19z)| norm 0.1981 (-0.30z)| lr 7.84e-09 | 4142.74 ms | 32.6% bf16 MFU | 117964 tok/s step 19525/19560 | loss 3.247264 (-1.17z)| norm 0.2069 (+0.01z)| lr 7.41e-09 | 4155.70 ms | 32.5% bf16 MFU | 118374 tok/s step 19526/19560 | loss 3.289895 (-0.15z)| norm 0.1977 (-0.32z)| lr 7.01e-09 | 4283.34 ms | 31.5% bf16 MFU | 118575 tok/s step 19527/19560 | loss 3.200017 (-2.26z)| norm 0.1965 (-0.36z)| lr 6.63e-09 | 4567.24 ms | 29.6% bf16 MFU | 118386 tok/s step 19528/19560 | loss 3.316048 (+0.49z)| norm 0.1993 (-0.26z)| lr 6.25e-09 | 4506.95 ms | 30.0% bf16 MFU | 118283 tok/s step 19529/19560 | loss 3.280232 (-0.35z)| norm 0.2008 (-0.21z)| lr 5.87e-09 | 4569.88 ms | 29.5% bf16 MFU | 118105 tok/s step 19530/19560 | loss 3.276835 (-0.42z)| norm 0.1956 (-0.39z)| lr 5.51e-09 | 4152.57 ms | 32.5% bf16 MFU | 118513 tok/s step 19531/19560 | loss 3.315094 (+0.48z)| norm 0.2056 (-0.04z)| lr 5.15e-09 | 4166.63 ms | 32.4% bf16 MFU | 118879 tok/s step 19532/19560 | loss 3.291578 (-0.08z)| norm 0.1940 (-0.45z)| lr 4.82e-09 | 4706.62 ms | 28.7% bf16 MFU | 118505 tok/s step 19533/19560 | loss 3.298278 (+0.09z)| norm 0.1979 (-0.40z)| lr 4.48e-09 | 4169.06 ms | 32.4% bf16 MFU | 118867 tok/s step 19534/19560 | loss 3.310124 (+0.36z)| norm 0.2043 (-0.01z)| lr 4.17e-09 | 4236.14 ms | 31.9% bf16 MFU | 119112 tok/s step 19535/19560 | loss 3.315853 (+0.49z)| norm 0.1967 (-0.48z)| lr 3.86e-09 | 4163.29 ms | 32.4% bf16 MFU | 119453 tok/s step 19536/19560 | loss 3.329041 (+0.80z)| norm 0.1970 (-0.45z)| lr 3.58e-09 | 4228.63 ms | 31.9% bf16 MFU | 119680 tok/s step 19537/19560 | loss 3.376697 (+1.93z)| norm 0.1945 (-0.79z)| lr 3.29e-09 | 4333.05 ms | 31.2% bf16 MFU | 119746 tok/s step 19538/19560 | loss 3.298517 (+0.08z)| norm 0.2022 (-0.10z)| lr 3.03e-09 | 4244.96 ms | 31.8% bf16 MFU | 119934 tok/s step 19539/19560 | loss 3.225399 (-1.64z)| norm 0.1947 (-0.76z)| lr 2.77e-09 | 4161.34 ms | 32.4% bf16 MFU | 120237 tok/s step 19540/19560 | loss 3.273483 (-0.49z)| norm 0.1960 (-0.64z)| lr 2.53e-09 | 4177.82 ms | 32.3% bf16 MFU | 120499 tok/s step 19541/19560 | loss 3.232302 (-1.47z)| norm 0.1958 (-0.65z)| lr 2.29e-09 | 4741.71 ms | 28.5% bf16 MFU | 120003 tok/s step 19542/19560 | loss 3.280896 (-0.33z)| norm 0.1979 (-0.47z)| lr 2.07e-09 | 4620.09 ms | 29.2% bf16 MFU | 119677 tok/s step 19543/19560 | loss 3.232732 (-1.45z)| norm 0.2019 (-0.11z)| lr 1.86e-09 | 4225.37 ms | 32.0% bf16 MFU | 119897 tok/s step 19544/19560 | loss 3.248371 (-1.07z)| norm 0.1948 (-0.74z)| lr 1.65e-09 | 4844.23 ms | 27.9% bf16 MFU | 119314 tok/s step 19545/19560 | loss 3.229438 (-1.49z)| norm 0.2050 (+0.17z)| lr 1.48e-09 | 4174.85 ms | 32.3% bf16 MFU | 119627 tok/s step 19546/19560 | loss 3.380672 (+2.02z)| norm 0.2058 (+0.25z)| lr 1.29e-09 | 5386.77 ms | 25.1% bf16 MFU | 118512 tok/s step 19547/19560 | loss 3.272699 (-0.48z)| norm 0.2022 (-0.07z)| lr 1.12e-09 | 4393.93 ms | 30.7% bf16 MFU | 118553 tok/s step 19548/19560 | loss 3.308559 (+0.36z)| norm 0.2132 (+0.90z)| lr 9.78e-10 | 4343.61 ms | 31.1% bf16 MFU | 118660 tok/s step 19549/19560 | loss 3.359111 (+1.52z)| norm 0.2039 (+0.07z)| lr 8.34e-10 | 4165.04 ms | 32.4% bf16 MFU | 119021 tok/s step 19550/19560 | loss 3.273802 (-0.44z)| norm 0.2084 (+0.47z)| lr 6.91e-10 | 4246.71 ms | 31.8% bf16 MFU | 119243 tok/s step 19551/19560 | loss 3.340824 (+1.09z)| norm 0.1998 (-0.30z)| lr 5.72e-10 | 4187.43 ms | 32.2% bf16 MFU | 119541 tok/s step 19552/19560 | loss 3.245020 (-1.11z)| norm 0.2032 (+0.00z)| lr 4.53e-10 | 4175.94 ms | 32.3% bf16 MFU | 119841 tok/s step 19553/19560 | loss 3.253066 (-0.92z)| norm 0.1969 (-0.56z)| lr 3.58e-10 | 4400.64 ms | 30.7% bf16 MFU | 119806 tok/s step 19554/19560 | loss 3.251013 (-0.96z)| norm 0.1961 (-0.63z)| lr 2.86e-10 | 4179.37 ms | 32.3% bf16 MFU | 120088 tok/s step 19555/19560 | loss 3.292035 (-0.02z)| norm 0.1966 (-0.57z)| lr 2.15e-10 | 4218.05 ms | 32.0% bf16 MFU | 120299 tok/s step 19556/19560 | loss 3.247669 (-1.02z)| norm 0.1989 (-0.36z)| lr 1.43e-10 | 4753.17 ms | 28.4% bf16 MFU | 119799 tok/s step 19557/19560 | loss 3.291830 (-0.01z)| norm 0.2070 (+0.36z)| lr 9.54e-11 | 5020.26 ms | 26.9% bf16 MFU | 119031 tok/s step 19558/19560 | loss 3.270460 (-0.49z)| norm 0.2099 (+0.62z)| lr 4.77e-11 | 4730.16 ms | 28.5% bf16 MFU | 118621 tok/s step 19559/19560 | loss 3.297637 (+0.14z)| norm 0.2466 (+3.68z)| lr 2.38e-11 | 4726.99 ms | 28.6% bf16 MFU | 118236 tok/s step 19560/19560 | loss 3.238509 (-1.21z)| norm 0.2009 (-0.20z)| lr 0.00e+00 | 4988.27 ms | 27.1% bf16 MFU | 117579 tok/s val loss 3.251305 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3057/10042 = 0.304421 generating: --- The four alpine ski sports debuts are included in the Round Up and Team draw within the Strategy pack, along with a choice of pricing and entry to FUN. Finally, a three-night accommodation package with accommodation in ski resort Napier-Sloan Alps - Milwaukee (3L); Lanzarote --- Writing checkpoint at step 19560 Writing model to log124M/model_00019560.bin Writing state to log124M/state_00019560_00000.bin total average iteration time: 4212.055771 ms